首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
MOTIVATION: Machine learning methods such as neural networks, support vector machines, and other classification and regression methods rely on iterative optimization of the model quality in the space of the parameters of the method. Model quality measures (accuracies, correlations, etc.) are frequently overly optimistic because the training sets are dominated by particular families and subfamilies. To overcome the bias, the dataset is usually reduced by filtering out closely related objects. However, such filtering uses fixed similarity thresholds and ignores a part of the training information. RESULTS: We suggested a novel approach to calculate prediction model quality based on assigning to each data point inverse density weights derived from the postulated distance metric. We demonstrated that our new weighted measures estimate the model generalization better and are consistent with the machine learning theory. The Vapnik-Chervonenkis theorem was reformulated and applied to derive the space-uniform error estimates. Two examples were used to illustrate the advantages of the inverse density weighting. First, we demonstrated on a set with a built-in bias that the unweighted cross-validation procedure leads to an overly optimistic quality estimate, while the density-weighted quality estimates are more realistic. Second, an analytical equation for weighted quality estimates was used to derive an SVM model for signal peptide prediction using a full set of known signal peptides, instead of the usual filtered subset.  相似文献   

2.
In the drug discovery process, the metabolic fate of drugs is crucially important to prevent drug-drug interactions. Therefore, P450 isozyme selectivity prediction is an important task for screening drugs of appropriate metabolism profiles. Recently, large-scale activity data of five P450 isozymes (CYP1A2 CYP2C9, CYP3A4, CYP2D6, and CYP2C19) have been obtained using quantitative high-throughput screening with a bioluminescence assay. Although some isozymes share similar selectivities, conventional supervised learning algorithms independently learn a prediction model from each P450 isozyme. They are unable to exploit the other P450 isozyme activity data to improve the predictive performance of each P450 isozyme's selectivity. To address this issue, we apply transfer learning that uses activity data of the other isozymes to learn a prediction model from multiple P450 isozymes. After using the large-scale P450 isozyme selectivity dataset for five P450 isozymes, we evaluate the model's predictive performance. Experimental results show that, overall, our algorithm outperforms conventional supervised learning algorithms such as support vector machine (SVM), Weighted k-nearest neighbor classifier, Bagging, Adaboost, and latent semantic indexing (LSI). Moreover, our results show that the predictive performance of our algorithm is improved by exploiting the multiple P450 isozyme activity data in the learning process. Our algorithm can be an effective tool for P450 selectivity prediction for new chemical entities using multiple P450 isozyme activity data.  相似文献   

3.
4.
基于熵准则的鲁棒的RBF谷胱甘肽发酵建模   总被引:1,自引:0,他引:1  
在谷胱甘肽的发酵过程建模中, 当试验数据含有噪音时, 往往会导致模型预测精度和泛化能力的下降。针对该问题, 提出了一种新的基于熵准则的RBF神经网络建模方法。与传统的基于MSE准则函数的建模方法相比, 新方法能从训练样本的整体分布结构来进行模型参数学习, 有效地避免了传统的基于MSE准则的RBF网络的过学习和泛化能力差的缺陷。将该模型应用到实际的谷胱甘肽发酵过程建模中, 实验结果表明: 该方法具有较高的预测精度、泛化能力和良好的鲁棒性, 从而对谷胱甘肽的发酵建模有潜在的应用价值。  相似文献   

5.
In the work, we evaluate the performance of machine learning approaches for predicting successful eradication of aquatic invasive species (AIS) and assess the extent to which eradication of an invasive species depends on the certain specified ecological features of the target ecosystem and/or features that characterize the planned intervention. We studied the outcomes of 143 planned attempts for eradicating AIS, where each attempt was described by ecological and eradication-strategy-related features of the target ecosystem. We considered several machine learning approaches to determine whether one could produce a classifier that accurately predicts weather an invasive species will be eradicated. To assess each learner’s performance, we examined its tenfold cross-validated prediction accuracy as well as the false positive rate, the F-measure, and the Area Under the ROC Curve. We also used Kaplan–Meier survival analysis to determine which features are relevant to predicting the time required for each eradication program. Across the five typical machine learning approaches, our analysis suggests that learners trained by the decision tree work well, and have the best performance. In particular, by examining the trained decision tree model, we found that if an occupied area was not large and/or containments of AIS dispersal were employed, the eradication of AIS was likely to be successful. We also trained decision tree models over only the ecological features and found that their performances were comparable with that of models trained using all features. As our trained decision tree models are accurate, decision makers can use them to estimate the result of the proposed actions before they commit to which specific strategy should be applied.  相似文献   

6.
Air pollution is a serious threat to both the ecological environment and the physical health of individuals. Therefore, accurate air quality prediction is urgent and necessary for pollution mitigation and residents’ travel. However, few existing models are established based on the dynamic spatiotemporal correlation of air pollutants to predict air quality. In this paper, a novel deep learning model combining the dynamic graph convolutional network and the multi-channel temporal convolutional network (DGC-MTCN) is proposed for air quality prediction. To efficiently represent the time-varying spatial dependencies, a new spatiotemporal dynamic correlation calculation method based on gray relation analysis is proposed to construct dynamic adjacency matrices. Then, the spatiotemporal features are sufficiently extracted by the graph convolutional network and the multi-channel temporal convolutional network. Two real-world air quality datasets collected from Beijing and Fushun are applied to verify the performance of our proposed model. The experimental results show that compared with other baselines, the DGC-MTCN model has excellent prediction accuracy. Especially for the prediction of multi-step and different stations, our model performs better temporal stability and generalization ability.  相似文献   

7.
When the standard approach to predict protein function by sequence homology fails, other alternative methods can be used that require only the amino acid sequence for predicting function. One such approach uses machine learning to predict protein function directly from amino acid sequence features. However, there are two issues to consider before successful functional prediction can take place: identifying discriminatory features, and overcoming the challenge of a large imbalance in the training data. We show that by applying feature subset selection followed by undersampling of the majority class, significantly better support vector machine (SVM) classifiers are generated compared with standard machine learning approaches. As well as revealing that the features selected could have the potential to advance our understanding of the relationship between sequence and function, we also show that undersampling to produce fully balanced data significantly improves performance. The best discriminating ability is achieved using SVMs together with feature selection and full undersampling; this approach strongly outperforms other competitive learning algorithms. We conclude that this combined approach can generate powerful machine learning classifiers for predicting protein function directly from sequence.  相似文献   

8.
The identification of the thermostability from the amino acid sequence information would be helpful in computational screening for thermostable proteins. We have developed a method to discriminate thermophilic and mesophilic proteins based on support vector machines. Using self-consistency validation, 5-fold cross-validation and independent testing procedure with other datasets, this module achieved overall accuracy of 94.2%, 90.5% and 92.4%, respectively. The performance of this SVM-based module was better than the classifiers built using alternative machine learning and statistical algorithms including artificial neural networks, Bayesian statistics, and decision trees, when evaluated using these three validation methods. The influence of protein size on prediction accuracy was also addressed.  相似文献   

9.
The biopharmaceutical industry continuously seeks to optimize the critical quality attributes to maintain the reliability and cost-effectiveness of its products. Such optimization demands a scalable and optimal control strategy to meet the process constraints and objectives. This work uses a model predictive controller (MPC) to compute an optimal feeding strategy leading to maximized cell growth and metabolite production in fed-batch cell culture processes. The lack of high-fidelity physics-based models and the high complexity of cell culture processes motivated us to use machine learning algorithms in the forecast model to aid our development. We took advantage of linear regression, the Gaussian process and neural network models in the MPC design to maximize the daily protein production for each batch. The control scheme of the cell culture process solves an optimization problem while maintaining all metabolites and cell culture process variables within the specification. The linear and nonlinear models are developed based on real cell culture process data, and the performance of the designed controllers is evaluated by running several real-time experiments.  相似文献   

10.
Accurate morbidity prediction can contribute greatly to the efficiency of medical services. Gastrointestinal infectious diseases are largely influenced by environmental pollutants, but predicting their morbidity based on pollution indicators is quite difficult because of the complex relationship between the pollutants and the infections. This study presents a deep neural network (DNN) model for estimating the morbidity of gastrointestinal infections based on 129 types of pollutants contained in soil and water. The DNN uses a deep Boltzmann machine (DBM) to model the unknown probabilistic relationship between the pollutants, and employs a Gaussian mixture model (GMM) to output the estimated morbidity. We also propose an evolutionary algorithm for efficiently training the DNN. Experiment on a data set from four counties in central China shows that the proposed model can estimate the morbidity much more accurately than traditional neural network and linear regression models.  相似文献   

11.
We describe a supervised prediction method for diagnosis of acute myeloid leukemia (AML) from patient samples based on flow cytometry measurements. We use a data driven approach with machine learning methods to train a computational model that takes in flow cytometry measurements from a single patient and gives a confidence score of the patient being AML-positive. Our solution is based on an regularized logistic regression model that aggregates AML test statistics calculated from individual test tubes with different cell populations and fluorescent markers. The model construction is entirely data driven and no prior biological knowledge is used. The described solution scored a 100% classification accuracy in the DREAM6/FlowCAP2 Molecular Classification of Acute Myeloid Leukaemia Challenge against a golden standard consisting of 20 AML-positive and 160 healthy patients. Here we perform a more extensive validation of the prediction model performance and further improve and simplify our original method showing that statistically equal results can be obtained by using simple average marker intensities as features in the logistic regression model. In addition to the logistic regression based model, we also present other classification models and compare their performance quantitatively. The key benefit in our prediction method compared to other solutions with similar performance is that our model only uses a small fraction of the flow cytometry measurements making our solution highly economical.  相似文献   

12.
Papermaking wastewater accounts for a large proportion of industrial wastewater, and it is essential to obtain accurate and reliable effluent indices in real-time. Considering the complexity, nonlinearity, and time variability of wastewater treatment processes, a dynamic kernel extreme learning machine (DKELM) method is proposed to predict the key quality indices of effluent chemical oxygen demand (COD). A time lag coefficient is introduced and a kernel function is embedded into the extreme learning machine (ELM) to extract dynamic information and obtain better prediction accuracy. A case study for modeling a wastewater treatment process is demonstrated to evaluate the performance of the proposed DKELM. The results illustrate that both training and prediction accuracy of the DKELM model is superior to other models. For the prediction of the quality indices of effluent COD, the determinate coefficient of the DKELM model is increased by 27.52 %, 21.36 %, 10.42 %, and 10.81 %, compared with partial least squares, ELM, dynamic ELM, and kernel ELM, respectively.  相似文献   

13.
环境微生物研究中机器学习算法及应用   总被引:1,自引:0,他引:1  
陈鹤  陶晔  毛振镀  邢鹏 《微生物学报》2022,62(12):4646-4662
微生物在环境中无处不在,它们不仅是生物地球化学循环和环境演化的关键参与者,也在环境监测、生态治理和保护中发挥着重要作用。随着高通量技术的发展,大量微生物数据产生,运用机器学习对环境微生物大数据进行建模和分析,在微生物标志物识别、污染物预测和环境质量预测等领域的科学研究和社会应用方面均具有重要意义。机器学习可分为监督学习和无监督学习2大类。在微生物组学研究当中,无监督学习通过聚类、降维等方法高效地学习输入数据的特征,进而对微生物数据进行整合和归类。监督学习运用有特征和标记的微生物数据集训练模型,在面对只有特征没有标记的数据时可以判断出标记,从而实现对新数据的分类、识别和预测。然而,复杂的机器学习算法通常以牺牲可解释性为代价来重点关注模型预测的准确性。机器学习模型通常可以看作预测特定结果的“黑匣子”,即对模型如何得出预测所知甚少。为了将机器学习更多地运用于微生物组学研究、提高我们提取有价值的微生物信息的能力,深入了解机器学习算法、提高模型的可解释性尤为重要。本文主要介绍在环境微生物领域常用的机器学习算法和基于微生物组数据的机器学习模型的构建步骤,包括特征选择、算法选择、模型构建和评估等,并对各种机器学习模型在环境微生物领域的应用进行综述,深入探究微生物组与周围环境之间的关联,探讨提高模型可解释性的方法,并为未来环境监测、环境健康预测提供科学参考。  相似文献   

14.
The thermostability of proteins is particularly relevant for enzyme engineering. Developing a computational method to identify mesophilic proteins would be helpful for protein engineering and design. In this work, we developed support vector machine based method to predict thermophilic proteins using the information of amino acid distribution and selected amino acid pairs. A reliable benchmark dataset including 915 thermophilic proteins and 793 non-thermophilic proteins was constructed for training and testing the proposed models. Results showed that 93.8% thermophilic proteins and 92.7% non-thermophilic proteins could be correctly predicted by using jackknife cross-validation. High predictive successful rate exhibits that this model can be applied for designing stable proteins.  相似文献   

15.
Meissner M  Koch O  Klebe G  Schneider G 《Proteins》2009,74(2):344-352
We present machine learning approaches for turn prediction from the amino acid sequence. Different turn classes and types were considered based on a novel turn classification scheme. We trained an unsupervised (self-organizing map) and two kernel-based classifiers, namely the support vector machine and a probabilistic neural network. Turn versus non-turn classification was carried out for turn families containing intramolecular hydrogen bonds and three to six residues. Support vector machine classifiers yielded a Matthews correlation coefficient (mcc) of approximately 0.6 and a prediction accuracy of 80%. Probabilistic neural networks were developed for beta-turn type prediction. The method was able to distinguish between five types of beta-turns yielding mcc > 0.5 and at least 80% overall accuracy. We conclude that the proposed new turn classification is distinct and well-defined, and machine learning classifiers are suited for sequence-based turn prediction. Their potential for sequence-based prediction of turn structures is discussed.  相似文献   

16.
Market impact cost is the most significant portion of implicit transaction costs that can reduce the overall transaction cost, although it cannot be measured directly. In this paper, we employed the state-of-the-art nonparametric machine learning models: neural networks, Bayesian neural network, Gaussian process, and support vector regression, to predict market impact cost accurately and to provide the predictive model that is versatile in the number of variables. We collected a large amount of real single transaction data of US stock market from Bloomberg Terminal and generated three independent input variables. As a result, most nonparametric machine learning models outperformed a-state-of-the-art benchmark parametric model such as I-star model in four error measures. Although these models encounter certain difficulties in separating the permanent and temporary cost directly, nonparametric machine learning models can be good alternatives in reducing transaction costs by considerably improving in prediction performance.  相似文献   

17.
A support vector machine (SVM) modeling approach for short-term load forecasting is proposed. The SVM learning scheme is applied to the power load data, forcing the network to learn the inherent internal temporal property of power load sequence. We also study the performance when other related input variables such as temperature and humidity are considered. The performance of our proposed SVM modeling approach has been tested and compared with feed-forward neural network and cosine radial basis function neural network approaches. Numerical results show that the SVM approach yields better generalization capability and lower prediction error compared to those neural network approaches.  相似文献   

18.
Abstract

Accurate and rapid toxic gas concentration prediction model plays an important role in emergency aid of sudden gas leak. However, it is difficult for existing dispersion model to achieve accuracy and efficiency requirements at the same time. Although some researchers have considered developing new forecasting models with traditional machine learning, such as back propagation (BP) neural network, support vector machine (SVM), the prediction results obtained from such models need to be improved still in terms of accuracy. Then new prediction models based on deep learning are proposed in this paper. Deep learning has obvious advantages over traditional machine learning in prediction and classification. Deep belief networks (DBNs) as well as convolution neural networks (CNNs) are used to build new dispersion models here. Both models are compared with Gaussian plume model, computation fluid dynamics (CFD) model and models based on traditional machine learning in terms of accuracy, prediction time, and computation time. The experimental results turn out that CNNs model performs better considering all evaluation indexes.  相似文献   

19.

Background

Osteoarthritis (OA) is the most common disease of arthritis. Analgesics are widely used in the treat of arthritis, which may increase the risk of cardiovascular diseases by 20% to 50% overall.There are few studies on the side effects of OA medication, especially the risk prediction models on side effects of analgesics. In addition, most prediction models do not provide clinically useful interpretable rules to explain the reasoning process behind their predictions. In order to assist OA patients, we use the eXtreme Gradient Boosting (XGBoost) method to balance the accuracy and interpretability of the prediction model.

Results

In this study we used the XGBoost model as a classifier, which is a supervised machine learning method and can predict side effects of analgesics for OA patients and identify high-risk features (RFs) of cardiovascular diseases caused by analgesics. The Electronic Medical Records (EMRs), which were derived from public knee OA studies, were used to train the model. The performance of the XGBoost model is superior to four well-known machine learning algorithms and identifies the risk features from the biomedical literature. In addition the model can provide decision support for using analgesics in OA patients.

Conclusion

Compared with other machine learning methods, we used XGBoost method to predict side effects of analgesics for OA patients from EMRs, and selected the individual informative RFs. The model has good predictability and interpretability, this is valuable for both medical researchers and patients.
  相似文献   

20.
Prediction of both conserved and nonconserved microRNA targets in animals   总被引:2,自引:0,他引:2  
MOTIVATION: MicroRNAs (miRNAs) are involved in many diverse biological processes and they may potentially regulate the functions of thousands of genes. However, one major issue in miRNA studies is the lack of bioinformatics programs to accurately predict miRNA targets. Animal miRNAs have limited sequence complementarity to their gene targets, which makes it challenging to build target prediction models with high specificity. RESULTS: Here we present a new miRNA target prediction program based on support vector machines (SVMs) and a large microarray training dataset. By systematically analyzing public microarray data, we have identified statistically significant features that are important to target downregulation. Heterogeneous prediction features have been non-linearly integrated in an SVM machine learning framework for the training of our target prediction model, MirTarget2. About half of the predicted miRNA target sites in human are not conserved in other organisms. Our prediction algorithm has been validated with independent experimental data for its improved performance on predicting a large number of miRNA down-regulated gene targets. AVAILABILITY: All the predicted targets were imported into an online database miRDB, which is freely accessible at http://mirdb.org.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号