首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Accurately estimating probabilities from observations is important for probabilistic-based approaches to problems in computational biology. In this paper we present a biologically-motivated method for estimating probability distributions over discrete alphabets from observations using a mixture model of common ancestors. The method is an extension of substitution matrix-based probability estimation methods. In contrast to previous such methods, our method has a simple Bayesian interpretation and has the advantage over Dirichlet mixtures that it is both effective and simple to compute for large alphabets. The method is applied to estimate amino acid probabilities based on observed counts in an alignment and is shown to perform comparably to previous methods. The method is also applied to estimate probability distributions over protein families and improves protein classification accuracy.  相似文献   

2.
Accurately estimating probabilities from observations is important for probabilistic-based approaches to problems in computational biology. In this paper we present a biologically-motivated method for estimating probability distributions over discrete alphabets from observations using a mixture model of common ancestors. The method is an extension of substitution matrix-based probability estimation methods. In contrast to previous such methods, our method has a simple Bayesian interpretation and has the advantage over Dirichlet mixtures that it is both effective and simple to compute for large alphabets. The method is applied to estimate amino acid probabilities based on observed counts in an alignment and is shown to perform comparably to previous methods. The method is also applied to estimate probability distributions over protein families and improves protein classification accuracy.  相似文献   

3.
One common use of binary response regression methods is classification based on an arbitrary probability threshold dictated by the particular application. Since this is given to us a priori, it is sensible to incorporate the threshold into our estimation procedure. Specifically, for the linear logistic model, we solve a set of locally weighted score equations, using a kernel-like weight function centered at the threshold. The bandwidth for the weight function is selected by cross validation of a novel hybrid loss function that combines classification error and a continuous measure of divergence between observed and fitted values; other possible cross-validation functions based on more common binary classification metrics are also examined. This work has much in common with robust estimation, but differs from previous approaches in this area in its focus on prediction, specifically classification into high- and low-risk groups. Simulation results are given showing the reduction in error rates that can be obtained with this method when compared with maximum likelihood estimation, especially under certain forms of model misspecification. Analysis of a melanoma dataset is presented to illustrate the use of the method in practice.  相似文献   

4.
The fundamental biological importance and complexity of allosterically regulated proteins stem from their central role in signal transduction and cellular processes. Recently, machine-learning approaches have been developed and actively deployed to facilitate theoretical and experimental studies of protein dynamics and allosteric mechanisms. In this review, we survey recent developments in applications of machine-learning methods for studies of allosteric mechanisms, prediction of allosteric effects and allostery-related physicochemical properties, and allosteric protein engineering. We also review the applications of machine-learning strategies for characterization of allosteric mechanisms and drug design targeting SARS-CoV-2. Continuous development and task-specific adaptation of machine-learning methods for protein allosteric mechanisms will have an increasingly important role in bridging a wide spectrum of data-intensive experimental and theoretical technologies.  相似文献   

5.
Accurate class probability estimation is important for medical decision making but is challenging, particularly when the number of candidate features exceeds the number of cases. Special methods have been developed for nonprobabilistic classification, but relatively little attention has been given to class probability estimation with numerous candidate variables. In this paper, we investigate overfitting in the development of regularized class probability estimators. We investigate the relation between overfitting and accurate class probability estimation in terms of mean square error. Using simulation studies based on real datasets, we found that some degree of overfitting can be desirable for reducing mean square error. We also introduce a mean square error decomposition for class probability estimation that helps clarify the relationship between overfitting and prediction accuracy.  相似文献   

6.
The use of penalized logistic regression for cancer classification using microarray expression data is presented. Two dimension reduction methods are respectively combined with the penalized logistic regression so that both the classification accuracy and computational speed are enhanced. Two other machine-learning methods, support vector machines and least-squares regression, have been chosen for comparison. It is shown that our methods have achieved at least equal or better results. They also have the advantage that the output probability can be explicitly given and the regression coefficients are easier to interpret. Several other aspects, such as the selection of penalty parameters and components, pertinent to the application of our methods for cancer classification are also discussed.  相似文献   

7.
彭哲也  唐紫珺  谢民主 《遗传》2018,40(3):218-226
复杂疾病是基因与基因、基因与环境交互作用的结果,高维基因交互作用的探测给计算带来了极大的挑战。在过去20年间,机器学习方法被用于探测基因-基因交互作用,并取得了一定的效果。本文综述了机器学习方法在基因交互作用探测中的研究进展,系统地介绍了神经网络(neural networks, NN)、随机森林(random forest, RF)、支持向量机(support vector machines, SVM)和多因子降维法(multifactor dimensionality reduction, MDR)等机器学习方法在全基因组关联研究(genome wide association study, GWAS)中探测基因交互作用的原理和局限性,并对未来的研究进行了展望。  相似文献   

8.
9.
10.
Statistical analysis on landmark-based shape spaces has diverse applications in morphometrics, medical diagnostics, machine vision and other areas. These shape spaces are non-Euclidean quotient manifolds. To conduct nonparametric inferences, one may define notions of centre and spread on this manifold and work with their estimates. However, it is useful to consider full likelihood-based methods, which allow nonparametric estimation of the probability density. This article proposes a broad class of mixture models constructed using suitable kernels on a general compact metric space and then on the planar shape space in particular. Following a Bayesian approach with a nonparametric prior on the mixing distribution, conditions are obtained under which the Kullback-Leibler property holds, implying large support and weak posterior consistency. Gibbs sampling methods are developed for posterior computation, and the methods are applied to problems in density estimation and classification with shape-based predictors. Simulation studies show improved estimation performance relative to existing approaches.  相似文献   

11.
The field of phylogenetic tree estimation has been dominated by three broad classes of methods: distance-based approaches, parsimony and likelihood-based methods (including maximum likelihood (ML) and Bayesian approaches). Here we introduce two new approaches to tree inference: pairwise likelihood estimation and a distance-based method that estimates the number of substitutions along the paths through the tree. Our results include the derivation of the formulae for the probability that two leaves will be identical at a site given a number of substitutions along the path connecting them. We also derive the posterior probability of the number of substitutions along a path between two sequences. The calculations for the posterior probabilities are exact for group-based, symmetric models of character evolution, but are only approximate for more general models.  相似文献   

12.
The estimation of genetic ancestry in human populations has important applications in medical genetic studies. Genetic ancestry is used to control for population stratification in genetic association studies, and is used to understand the genetic basis for ethnic differences in disease susceptibility. In this review, we present an overview of genetic ancestry estimation in human disease studies, followed by a review of popular softwares and methods used for this estimation.  相似文献   

13.
抗原表位预测是免疫信息学研究的重要方向之一,可以给实验提供重要的线索。B细胞表位或抗原决定簇是抗原中可被B细胞受体或抗体特异性识别并结合的部位。实际上,近90%的B细胞表位是构象性的。即使抗原蛋白质三级结构已知,B细胞表位预测仍然是一大挑战。该文结合实例阐述当今主要的构象性B细胞表位预测方法和算法:机器学习预测、非机器学习的计算预测、基于噬菌体展示数据的识别方法,以及一些也可用于构象性B细胞表位预测的通用蛋白质-蛋白质界面预测方法;介绍最新相关预测软件和Web服务资源,说明未来的研究趋势。  相似文献   

14.
Classification is one of the most widely applied tasks in ecology. Ecologists have to deal with noisy, high-dimensional data that often are non-linear and do not meet the assumptions of conventional statistical procedures. To overcome this problem, machine-learning methods have been adopted as ecological classification methods. We compared five machine-learning based classification techniques (classification trees, random forests, artificial neural networks, support vector machines, and automatically induced rule-based fuzzy models) in a biological conservation context. The study case was that of the ocellated turkey (Meleagris ocellata), a bird endemic to the Yucatan peninsula that has suffered considerable decreases in local abundance and distributional area during the last few decades. On a grid of 10 × 10 km cells that was superimposed to the peninsula we analysed relationships between environmental and social explanatory variables and ocellated turkey abundance changes between 1980 and 2000. Abundance was expressed in three (decrease, no change, and increase) and 14 more detailed abundance change classes, respectively. Modelling performance varied considerably between methods with random forests and classification trees being the most efficient ones as measured by overall classification error and the normalised mutual information index. Artificial neural networks yielded the worst results along with linear discriminant analysis, which was included as a conventional statistical approach. We not only evaluated classification accuracy but also characteristics such as time effort, classifier comprehensibility and method intricacy—aspects that determine the success of a classification technique among ecologists and conservation biologists as well as for the communication with managers and decision makers. We recommend the combined use of classification trees and random forests due to the easy interpretability of classifiers and the high comprehensibility of the method.  相似文献   

15.
Genome-wide case-control association studies aim at identifying significant differential markers between sick and healthy populations. With the development of large-scale technologies allowing the genotyping of thousands of single nucleotide polymorphisms (SNPs) comes the multiple testing problem and the practical issue of selecting the most probable set of associated markers. Several False Discovery Rate (FDR) estimation methods have been developed and tuned mainly for differential gene expression studies. However they are based on hypotheses and designs that are not necessarily relevant in genetic association studies. In this article we present a universal methodology to estimate the FDR of genome-wide association results. It uses a single global probability value per SNP and is applicable in practice for any study design, using any statistic. We have benchmarked this algorithm on simulated data and shown that it outperforms previous methods in cases requiring non-parametric estimation. We exemplified the usefulness of the method by applying it to the analysis of experimental genotyping data of three Multiple Sclerosis case-control association studies.  相似文献   

16.
Current approaches to RNA structure prediction range from physics-based methods, which rely on thousands of experimentally measured thermodynamic parameters, to machine-learning (ML) techniques. While the methods for parameter estimation are successfully shifting toward ML-based approaches, the model parameterizations so far remained fairly constant. We study the potential contribution of increasing the amount of information utilized by RNA folding prediction models to the improvement of their prediction quality. This is achieved by proposing novel models, which refine previous ones by examining more types of structural elements, and larger sequential contexts for these elements. Our proposed fine-grained models are made practical thanks to the availability of large training sets, advances in machine-learning, and recent accelerations to RNA folding algorithms. We show that the application of more detailed models indeed improves prediction quality, while the corresponding running time of the folding algorithm remains fast. An additional important outcome of this experiment is a new RNA folding prediction model (coupled with a freely available implementation), which results in a significantly higher prediction quality than that of previous models. This final model has about 70,000 free parameters, several orders of magnitude more than previous models. Being trained and tested over the same comprehensive data sets, our model achieves a score of 84% according to the F?-measure over correctly-predicted base-pairs (i.e., 16% error rate), compared to the previously best reported score of 70% (i.e., 30% error rate). That is, the new model yields an error reduction of about 50%. Trained models and source code are available at www.cs.bgu.ac.il/?negevcb/contextfold.  相似文献   

17.
The search for predictive biomarkers of disease from high-throughput mass spectrometry (MS) data requires a complex analysis path. Preprocessing and machine-learning modules are pipelined, starting from raw spectra, to set up a predictive classifier based on a shortlist of candidate features. As a machine-learning problem, proteomic profiling on MS data needs caution like the microarray case. The risk of overfitting and of selection bias effects is pervasive: not only potential features easily outnumber samples by 10(3) times, but it is easy to neglect information-leakage effects during preprocessing from spectra to peaks. The aim of this review is to explain how to build a general purpose design analysis protocol (DAP) for predictive proteomic profiling: we show how to limit leakage due to parameter tuning and how to organize classification and ranking on large numbers of replicate versions of the original data to avoid selection bias. The DAP can be used with alternative components, i.e. with different preprocessing methods (peak clustering or wavelet based), classifiers e.g. Support Vector Machine (SVM) or feature ranking methods (recursive feature elimination or I-Relief). A procedure for assessing stability and predictive value of the resulting biomarkers' list is also provided. The approach is exemplified with experiments on synthetic datasets (from the Cromwell MS simulator) and with publicly available datasets from cancer studies.  相似文献   

18.
Causal inference methods have been developed for longitudinal observational study designs where confounding is thought to occur over time. In particular, one may estimate and contrast the population mean counterfactual outcome under specific exposure patterns. In such contexts, confounders of the longitudinal treatment-outcome association are generally identified using domain-specific knowledge. However, this may leave an analyst with a large set of potential confounders that may hinder estimation. Previous approaches to data-adaptive model selection for this type of causal parameter were limited to the single time-point setting. We develop a longitudinal extension of a collaborative targeted minimum loss-based estimation (C-TMLE) algorithm that can be applied to perform variable selection in the models for the probability of treatment with the goal of improving the estimation of the population mean counterfactual outcome under a fixed exposure pattern. We investigate the properties of this method through a simulation study, comparing it to G-Computation and inverse probability of treatment weighting. We then apply the method in a real-data example to evaluate the safety of trimester-specific exposure to inhaled corticosteroids during pregnancy in women with mild asthma. The data for this study were obtained from the linkage of electronic health databases in the province of Quebec, Canada. The C-TMLE covariate selection approach allowed for a reduction of the set of potential confounders, which included baseline and longitudinal variables.  相似文献   

19.

Conservation translocations are increasingly used to manage threatened species and restore ecosystems. Translocations increase the risk of disease outbreaks in the translocated and recipient populations. Qualitative disease risk analyses have been used as a means of assessing the magnitude of any effect of disease and the probability of the disease occurring associated with a translocation. Currently multiple alternative qualitative disease risk analysis packages are available to practitioners. Here we compare the ease of use, expertise required, transparency, and results from, three different qualitative disease risk analyses using a translocation of the endangered New Zealand passerine, the hihi (Notiomystis cincta), as a model. We show that the three methods use fundamentally different approaches to define hazards. Different methods are used to produce estimations of the risk from disease, and the estimations are different for the same hazards. Transparency of the process varies between methods from no referencing, or explanations of evidence to justify decisions, through to full documentation of resources, decisions and assumptions made. Evidence to support decisions on estimation of risk from disease is important, to enable knowledge acquired in the future, for example, from translocation outcome, to be used to improve the risk estimation for future translocations. Information documenting each disease risk analysis differs along with variation in emphasis of the questions asked within each package. The expertise required to commence a disease risk analysis varies and an action flow chart tailored for the non-wildlife health specialist are included in one method but completion of the disease risk analysis requires wildlife health specialists with epidemiological and pathological knowledge in all three methods. We show that disease risk analysis package choice may play a greater role in the overall risk estimation of the effect of disease on animal populations involved in a translocation than might previously have been realised.

  相似文献   

20.
Conservation translocations are increasingly used to manage threatened species and restore ecosystems. Translocations increase the risk of disease outbreaks in the translocated and recipient populations. Qualitative disease risk analyses have been used as a means of assessing the magnitude of any effect of disease and the probability of the disease occurring associated with a translocation. Currently multiple alternative qualitative disease risk analysis packages are available to practitioners. Here we compare the ease of use, expertise required, transparency, and results from, three different qualitative disease risk analyses using a translocation of the endangered New Zealand passerine, the hihi (Notiomystis cincta), as a model. We show that the three methods use fundamentally different approaches to define hazards. Different methods are used to produce estimations of the risk from disease, and the estimations are different for the same hazards. Transparency of the process varies between methods from no referencing, or explanations of evidence to justify decisions, through to full documentation of resources, decisions and assumptions made. Evidence to support decisions on estimation of risk from disease is important, to enable knowledge acquired in the future, for example, from translocation outcome, to be used to improve the risk estimation for future translocations. Information documenting each disease risk analysis differs along with variation in emphasis of the questions asked within each package. The expertise required to commence a disease risk analysis varies and an action flow chart tailored for the non-wildlife health specialist are included in one method but completion of the disease risk analysis requires wildlife health specialists with epidemiological and pathological knowledge in all three methods. We show that disease risk analysis package choice may play a greater role in the overall risk estimation of the effect of disease on animal populations involved in a translocation than might previously have been realised.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号