首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
In the nucleotide substitution model for molecular evolution, a major task in the exploration of an evolutionary process is to estimate the substitution number per site of a protein or DNA sequence. The usual estimators are based on the observation of the difference proportion of the two nucleotide sequences. However, a more objective approach is to report a confidence interval with precision rather than only providing point estimators. The conventional confidence intervals used in the literature for the substitution number are constructed by the normal approximation. The performance and construction of confidence intervals for evolutionary models have not been much investigated in the literature. In this article, the performance of these conventional confidence intervals for one-parameter and two-parameter models are explored. Results show that the coverage probabilities of these intervals are unsatisfactory when the true substitution number is small. Since the substitution number may be small in many situations for an evolutionary process, the conventional confidence interval cannot provide accurate information for these cases. Improved confidence intervals for the one-parameter model with desirable coverage probability are proposed in this article. A numerical calculation shows the substantial improvement of the new confidence intervals over the conventional confidence intervals.  相似文献   

2.
A common problem in molecular phylogenetics is choosing a model of DNA substitution that does a good job of explaining the DNA sequence alignment without introducing superfluous parameters. A number of methods have been used to choose among a small set of candidate substitution models, such as the likelihood ratio test, the Akaike Information Criterion (AIC), the Bayesian Information Criterion (BIC), and Bayes factors. Current implementations of any of these criteria suffer from the limitation that only a small set of models are examined, or that the test does not allow easy comparison of non-nested models. In this article, we expand the pool of candidate substitution models to include all possible time-reversible models. This set includes seven models that have already been described. We show how Bayes factors can be calculated for these models using reversible jump Markov chain Monte Carlo, and apply the method to 16 DNA sequence alignments. For each data set, we compare the model with the best Bayes factor to the best models chosen using AIC and BIC. We find that the best model under any of these criteria is not necessarily the most complicated one; models with an intermediate number of substitution types typically do best. Moreover, almost all of the models that are chosen as best do not constrain a transition rate to be the same as a transversion rate, suggesting that it is the transition/transversion rate bias that plays the largest role in determining which models are selected. Importantly, the reversible jump Markov chain Monte Carlo algorithm described here allows estimation of phylogeny (and other phylogenetic model parameters) to be performed while accounting for uncertainty in the model of DNA substitution.  相似文献   

3.
A central task in the study of molecular evolution is the reconstruction of a phylogenetic tree from sequences of current-day taxa. The most established approach to tree reconstruction is maximum likelihood (ML) analysis. Unfortunately, searching for the maximum likelihood phylogenetic tree is computationally prohibitive for large data sets. In this paper, we describe a new algorithm that uses Structural Expectation Maximization (EM) for learning maximum likelihood phylogenetic trees. This algorithm is similar to the standard EM method for edge-length estimation, except that during iterations of the Structural EM algorithm the topology is improved as well as the edge length. Our algorithm performs iterations of two steps. In the E-step, we use the current tree topology and edge lengths to compute expected sufficient statistics, which summarize the data. In the M-Step, we search for a topology that maximizes the likelihood with respect to these expected sufficient statistics. We show that searching for better topologies inside the M-step can be done efficiently, as opposed to standard methods for topology search. We prove that each iteration of this procedure increases the likelihood of the topology, and thus the procedure must converge. This convergence point, however, can be a suboptimal one. To escape from such "local optima," we further enhance our basic EM procedure by incorporating moves in the flavor of simulated annealing. We evaluate these new algorithms on both synthetic and real sequence data and show that for protein sequences even our basic algorithm finds more plausible trees than existing methods for searching maximum likelihood phylogenies. Furthermore, our algorithms are dramatically faster than such methods, enabling, for the first time, phylogenetic analysis of large protein data sets in the maximum likelihood framework.  相似文献   

4.
The hepatitis B virus (HBV) has a circular DNA genome of about 3,200 base pairs. Economical use of the genome with overlapping reading frames may have led to severe constraints on nucleotide substitutions along the genome and to highly variable rates of substitution among nucleotide sites. Nucleotide sequences from 13 complete HBV genomes were compared to examine such variability of substitution rates among sites and to examine the phylogenetic relationships among the HBV variants. The maximum likelihood method was employed to fit models of DNA sequence evolution that can account for the complexity of the pattern of nucleotide substitution. Comparison of the models suggests that the rates of substitution are different in different genes and codon positions; for example, the third codon position changes at a rate over ten times higher than the second position. Furthermore, substantial variation of substitution rates was detected even after the effects of genes and codon positions were corrected; that is, rates are different at different sites of the same gene or at the same codon position. Such rates after the correction were also found to be positively correlated at adjacent sites, which indicated the existence of conserved and variable domains in the proteins encoded by the viral genome. A multiparameter model validates the earlier finding that the variation in nucleotide conservation is not random around the HBV genome. The test for the existence of a molecular clock suggests that substitution rates are more or less constant among lineages. The phylogenetic relationships among the viral variants were examined. Although the data do not seem to contain sufficient information to resolve the details of the phylogeny, it appears quite certain that the serotypes of the viral variants do not reflect their genetic relatedness. Correspondence to: Z. Yang  相似文献   

5.
Summary Studies are carried out on the uniqueness of the stationary point on the likelihood function for estimating molecular phylogenetic trees, yielding proof that there exists at most one stationary point, i.e., the maximum point, in the parameter range for the one parameter model of nucleotide substitution. The proof is simple yet applicable to any type of tree topology with an arbitrary number of operational taxonomic units (OTUs). The proof ensures that any valid approximation algorithm be able to reach the unique maximum point under the conditions mentioned above. An algorithm developed incorporating Newton's approximation method is then compared with the conventional one by means of computers simulation. The results show that the newly developed algorithm always requires less CPU time than the conventional one, whereas both algorithms lead to identical molecular phylogenetic trees in accordance with the proof. Contribution No. 1780 from the National Institute of Genetics, Mishima 411, Japan  相似文献   

6.

Background  

In recent years, model based approaches such as maximum likelihood have become the methods of choice for constructing phylogenies. A number of authors have shown the importance of using adequate substitution models in order to produce accurate phylogenies. In the past, many empirical models of amino acid substitution have been derived using a variety of different methods and protein datasets. These matrices are normally used as surrogates, rather than deriving the maximum likelihood model from the dataset being examined. With few exceptions, selection between alternative matrices has been carried out in an ad hoc manner.  相似文献   

7.
Codon models of evolution have facilitated the interpretation of selective forces operating on genomes. These models, however, assume a single rate of non-synonymous substitution irrespective of the nature of amino acids being exchanged. Recent developments have shown that models which allow for amino acid pairs to have independent rates of substitution offer improved fit over single rate models. However, these approaches have been limited by the necessity for large alignments in their estimation. An alternative approach is to assume that substitution rates between amino acid pairs can be subdivided into rate classes, dependent on the information content of the alignment. However, given the combinatorially large number of such models, an efficient model search strategy is needed. Here we develop a Genetic Algorithm (GA) method for the estimation of such models. A GA is used to assign amino acid substitution pairs to a series of rate classes, where is estimated from the alignment. Other parameters of the phylogenetic Markov model, including substitution rates, character frequencies and branch lengths are estimated using standard maximum likelihood optimization procedures. We apply the GA to empirical alignments and show improved model fit over existing models of codon evolution. Our results suggest that current models are poor approximations of protein evolution and thus gene and organism specific multi-rate models that incorporate amino acid substitution biases are preferred. We further anticipate that the clustering of amino acid substitution rates into classes will be biologically informative, such that genes with similar functions exhibit similar clustering, and hence this clustering will be useful for the evolutionary fingerprinting of genes.  相似文献   

8.
MOTIVATION: The computation of large phylogenetic trees with statistical models such as maximum likelihood or bayesian inference is computationally extremely intensive. It has repeatedly been demonstrated that these models are able to recover the true tree or a tree which is topologically closer to the true tree more frequently than less elaborate methods such as parsimony or neighbor joining. Due to the combinatorial and computational complexity the size of trees which can be computed on a Biologist's PC workstation within reasonable time is limited to trees containing approximately 100 taxa. RESULTS: In this paper we present the latest release of our program RAxML-III for rapid maximum likelihood-based inference of large evolutionary trees which allows for computation of 1.000-taxon trees in less than 24 hours on a single PC processor. We compare RAxML-III to the currently fastest implementations for maximum likelihood and bayesian inference: PHYML and MrBayes. Whereas RAxML-III performs worse than PHYML and MrBayes on synthetic data it clearly outperforms both programs on all real data alignments used in terms of speed and final likelihood values. Availability SUPPLEMENTARY INFORMATION: RAxML-III including all alignments and final trees mentioned in this paper is freely available as open source code at http://wwwbode.cs.tum/~stamatak CONTACT: stamatak@cs.tum.edu.  相似文献   

9.
This paper presents various algorithmic approaches for computing the maximum likelihood estimator of the mixing distribution of a one-parameter family of densities and provides a unifying computer-oriented concept for the statistical analysis of unobserved heterogeneity (i.e., observations stemming from different subpopulations) in a univariate sample. The case with unknown number of population subgroups as well as the case with known number of population subgroups, with emphasis on the first, is considered in the computer package C.A.MAN (Computer Assisted Mixture Analysis). It includes an algorithmic menu with choices of the EM algorithm, the vertex exchange algorithm, a combination of both, as well as the vertex direction method. To ensure reliable convergence, a step-length menu is provided for the three latter methods, each achieving monotonicity for the direction of choice. C.A.MAN has the option to work with restricted support size-that is, the case when the number of components is known a priori. In the latter case, the EM algorithm is used. Applications of mixture modelling in medical problems are discussed.  相似文献   

10.
Codon-based substitution models have been widely used to identify amino acid sites under positive selection in comparative analysis of protein-coding DNA sequences. The nonsynonymous-synonymous substitution rate ratio (d(N)/d(S), denoted omega) is used as a measure of selective pressure at the protein level, with omega > 1 indicating positive selection. Statistical distributions are used to model the variation in omega among sites, allowing a subset of sites to have omega > 1 while the rest of the sequence may be under purifying selection with omega < 1. An empirical Bayes (EB) approach is then used to calculate posterior probabilities that a site comes from the site class with omega > 1. Current implementations, however, use the naive EB (NEB) approach and fail to account for sampling errors in maximum likelihood estimates of model parameters, such as the proportions and omega ratios for the site classes. In small data sets lacking information, this approach may lead to unreliable posterior probability calculations. In this paper, we develop a Bayes empirical Bayes (BEB) approach to the problem, which assigns a prior to the model parameters and integrates over their uncertainties. We compare the new and old methods on real and simulated data sets. The results suggest that in small data sets the new BEB method does not generate false positives as did the old NEB approach, while in large data sets it retains the good power of the NEB approach for inferring positively selected sites.  相似文献   

11.
Estimation of variance components by Monte Carlo (MC) expectation maximization (EM) restricted maximum likelihood (REML) is computationally efficient for large data sets and complex linear mixed effects models. However, efficiency may be lost due to the need for a large number of iterations of the EM algorithm. To decrease the computing time we explored the use of faster converging Newton-type algorithms within MC REML implementations. The implemented algorithms were: MC Newton-Raphson (NR), where the information matrix was generated via sampling; MC average information(AI), where the information was computed as an average of observed and expected information; and MC Broyden''s method, where the zero of the gradient was searched using a quasi-Newton-type algorithm. Performance of these algorithms was evaluated using simulated data. The final estimates were in good agreement with corresponding analytical ones. MC NR REML and MC AI REML enhanced convergence compared to MC EM REML and gave standard errors for the estimates as a by-product. MC NR REML required a larger number of MC samples, while each MC AI REML iteration demanded extra solving of mixed model equations by the number of parameters to be estimated. MC Broyden''s method required the largest number of MC samples with our small data and did not give standard errors for the parameters directly. We studied the performance of three different convergence criteria for the MC AI REML algorithm. Our results indicate the importance of defining a suitable convergence criterion and critical value in order to obtain an efficient Newton-type method utilizing a MC algorithm. Overall, use of a MC algorithm with Newton-type methods proved feasible and the results encourage testing of these methods with different kinds of large-scale problem settings.  相似文献   

12.
Zeh J  Poole D  Miller G  Koski W  Baraff L  Rugh D 《Biometrics》2002,58(4):832-840
Annual survival probability of bowhead whales, Balaena mysticetus, was estimated using both Bayesian and maximum likelihood implementations of Cormack and Jolly-Seber (JS) models for capture-recapture estimation in open populations and reduced-parameter generalizations of these models. Aerial photographs of naturally marked bowheads collected between 1981 and 1998 provided the data. The marked whales first photographed in a particular year provided the initial 'capture' and 'release' of those marked whales and photographs in subsequent years the 'recaptures'. The Cormack model, often called the Cormack-Jolly-Seber (CJS) model, and the program MARK were used to identify the model with a single survival and time-varying capture probabilities as the most appropriate for these data. When survival was constrained to be one or less, the maximum likelihood estimate computed by MARK was one, invalidating confidence interval computations based on the asymptotic standard error or profile likelihood. A Bayesian Markov chain Monte Carlo (MCMC) implementation of the model was used to produce a posterior distribution for annual survival. The corresponding reduced-parameter JS model was also fit via MCMC because it is the more appropriate of the two models for these photoidentification data. Because the CJS model ignores much of the information on capture probabilities provided by the data, its results are less precise and more sensitive to the prior distributions used than results from the JS model. With priors for annual survival and capture probabilities uniform from 0 to 1, the posterior mean for bowhead survival rate from the JS model is 0.984, and 95% of the posterior probability lies between 0.948 and 1. This high estimated survival rate is consistent with other bowhead life history data.  相似文献   

13.
Unbalanced repeated-measures models with structured covariance matrices   总被引:32,自引:0,他引:32  
The question of how to analyze unbalanced or incomplete repeated-measures data is a common problem facing analysts. We address this problem through maximum likelihood analysis using a general linear model for expected responses and arbitrary structural models for the within-subject covariances. Models that can be fit include standard univariate and multivariate models with incomplete data, random-effects models, and models with time-series and factor-analytic error structures. We describe Newton-Raphson and Fisher scoring algorithms for computing maximum likelihood estimates, and generalized EM algorithms for computing restricted and unrestricted maximum likelihood estimates. An example fitting several models to a set of growth data is included.  相似文献   

14.
This article investigates maximum likelihood estimation with saturated and unsaturated models for correlated exchangeable binary data, when a sample of independent clusters of varying sizes is available. We discuss various parameterizations of these models, and propose using the EM algorithm to obtain maximum likelihood estimates. The methodology is illustrated by applications to a study of familial disease aggregation and to the design of a proposed group randomized cancer prevention trial.  相似文献   

15.
Schafer DW 《Biometrics》2001,57(1):53-61
This paper presents an EM algorithm for semiparametric likelihood analysis of linear, generalized linear, and nonlinear regression models with measurement errors in explanatory variables. A structural model is used in which probability distributions are specified for (a) the response and (b) the measurement error. A distribution is also assumed for the true explanatory variable but is left unspecified and is estimated by nonparametric maximum likelihood. For various types of extra information about the measurement error distribution, the proposed algorithm makes use of available routines that would be appropriate for likelihood analysis of (a) and (b) if the true x were available. Simulations suggest that the semiparametric maximum likelihood estimator retains a high degree of efficiency relative to the structural maximum likelihood estimator based on correct distributional assumptions and can outperform maximum likelihood based on an incorrect distributional assumption. The approach is illustrated on three examples with a variety of structures and types of extra information about the measurement error distribution.  相似文献   

16.
非线性再生散度随机效应模型是指数族非线性随机效应模型和非线性再生散度模型的推广和发展.通过视模型中的随机效应为假想的缺失数据和应用Metropolis-Hastings(MH)算法,提出了模型参数极大似然估计的Monte-Carlo EM(MCEM)算法,并用模拟研究和实例分析说明了该算法的可行性.  相似文献   

17.
We have developed a rapid parsimony method for reconstructing ancestral nucleotide states that allows calculation of initial branch lengths that are good approximations to optimal maximum-likelihood estimates under several commonly used substitution models. Use of these approximate branch lengths (rather than fixed arbitrary values) as starting points significantly reduces the time required for iteration to a solution that maximizes the likelihood of a tree. These branch lengths are close enough to the optimal values that they can be used without further iteration to calculate approximate maximum-likelihood scores that are very close to the "exact" scores found by iteration. Several strategies are described for using these approximate scores to substantially reduce times needed for maximum-likelihood tree searches.  相似文献   

18.
In recent years, likelihood ratio tests (LRTs) based on DNA and protein sequence data have been proposed for testing various evolutionary hypotheses. Because conducting an LRT requires an evolutionary model of nucleotide or amino acid substitution, which is almost always unknown, it becomes important to investigate the robustness of LRTs to violations of assumptions of these evolutionary models. Computer simulation was used to examine performance of LRTs of the molecular clock, transition/transversion bias, and among-site rate variation under different substitution models. The results showed that when correct models are used, LRTs perform quite well even when the DNA sequences are as short as 300 nt. However, LRTs were found to be biased under incorrect models. The extent of bias varies considerably, depending on the hypotheses tested, the substitution models assumed, and the lengths of the sequences used, among other things. A preliminary simulation study also suggests that LRTs based on parametric bootstrapping may be more sensitive to substitution models than are standard LRTs. When an assumed substitution model is grossly wrong and a more realistic model is available, LRTs can often reject the wrong model; thus, the performance of LRTs may be improved by using a more appropriate model. On the other hand, many factors of molecular evolution have not been considered in any substitution models so far built, and the possibility of an influence of this negligence on LRTs is often overlooked. The dependence of LRTs on substitution models calls for caution in interpreting test results and highlights the importance of clarifying the substitution patterns of genes and proteins and building more realistic models.  相似文献   

19.
Maximum likelihood methods for cure rate models with missing covariates   总被引:1,自引:0,他引:1  
Chen MH  Ibrahim JG 《Biometrics》2001,57(1):43-52
We propose maximum likelihood methods for parameter estimation for a novel class of semiparametric survival models with a cure fraction, in which the covariates are allowed to be missing. We allow the covariates to be either categorical or continuous and specify a parametric distribution for the covariates that is written as a sequence of one-dimensional conditional distributions. We propose a novel EM algorithm for maximum likelihood estimation and derive standard errors by using Louis's formula (Louis, 1982, Journal of the Royal Statistical Society, Series B 44, 226-233). Computational techniques using the Monte Carlo EM algorithm are discussed and implemented. A real data set involving a melanoma cancer clinical trial is examined in detail to demonstrate the methodology.  相似文献   

20.
Summary Several forms of maximum likelihood models are applied to aligned amino acid sequence data coded for in the mitochondrial DNA of six species (chicken, frog, human, bovine, mouse, and rat). These models range in form from relatively simple models of the type currently used for inferring phylogenetic tree structure to models more complex than those that have been used previously. No major discrepancies between the optimal trees inferred by any of these methods are found, but there are huge differences in adequacy of fit. A very significant finding is that the fit of any of these models is vastly improved by allowing a certain proportion of the amino acid sites to be invariant. An even more important, although disquieting, finding is that none of these models fits well, as judged by standard statistical criteria. The primary reason for this is that amino acid sites undergo substitution according to a process that is very heterogeneous. Because most phylogenetic inference is accomplished by choosing the optimal tree under the assumption that a homogeneous process is acting on the sites, the potential invalidity of some such conclusions is raised by this article's results. The seriousness of this problem depends upon the robustness of the phylogenetic inferential procedure to departures from the underlying model.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号