首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 203 毫秒
1.
Model selection is a topic of special relevance in molecular phylogenetics that affects many, if not all, stages of phylogenetic inference. Here we discuss some fundamental concepts and techniques of model selection in the context of phylogenetics. We start by reviewing different aspects of the selection of substitution models in phylogenetics from a theoretical, philosophical and practical point of view, and summarize this comparison in table format. We argue that the most commonly implemented model selection approach, the hierarchical likelihood ratio test, is not the optimal strategy for model selection in phylogenetics, and that approaches like the Akaike Information Criterion (AIC) and Bayesian methods offer important advantages. In particular, the latter two methods are able to simultaneously compare multiple nested or nonnested models, assess model selection uncertainty, and allow for the estimation of phylogenies and model parameters using all available models (model-averaged inference or multimodel inference). We also describe how the relative importance of the different parameters included in substitution models can be depicted. To illustrate some of these points, we have applied AIC-based model averaging to 37 mitochondrial DNA sequences from the subgenus Ohomopterus(genus Carabus) ground beetles described by Sota and Vogler (2001).  相似文献   

2.
Models of sequence evolution play an important role in molecular evolutionary studies. The use of inappropriate models of evolution may bias the results of the analysis and lead to erroneous conclusions. Several procedures for selecting the best-fit model of evolution for the data at hand have been proposed, like the likelihood ratio test (LRT) and the Akaike (AIC) and Bayesian (BIC) information criteria. The relative performance of these model-selecting algorithms has not yet been studied under a range of different model trees. In this study, the influence of branch length variation upon model selection is characterized. This is done by simulating sequence alignments under a known model of nucleotide substitution, and recording how often this true model is recovered by different model-fitting strategies. Results of this study agree with previous simulations and suggest that model selection is reasonably accurate. However, different model selection methods showed distinct levels of accuracy. Some LRT approaches showed better performance than the AIC or BIC information criteria. Within the LRTs, model selection is affected by the complexity of the initial model selected for the comparisons, and only slightly by the order in which different parameters are added to the model. A specific hierarchy of LRTs, which starts from a simple model of evolution, performed overall better than other possible LRT hierarchies, or than the AIC or BIC. Received: 2 October 2000 / Accepted: 4 January 2001  相似文献   

3.
Murphy and colleagues reported that the mammalian phylogeny was resolved by Bayesian phylogenetics. However, the DNA sequences they used had many alignment gaps and undetermined nucleotide sites. We therefore reanalyzed their data by minimizing unshared nucleotide sites and retaining as many species as possible (13 species). In constructing phylogenetic trees, we used the Bayesian, maximum likelihood (ML), maximum parsimony (MP), and neighbor-joining (NJ) methods with different substitution models. These trees were constructed by using both protein and DNA sequences. The results showed that the posterior probabilities for Bayesian trees were generally much higher than the bootstrap values for ML, MP, and NJ trees. Two different Bayesian topologies for the same set of species were sometimes supported by high posterior probabilities, implying that two different topologies can be judged to be correct by Bayesian phylogenetics. This suggests that the posterior probability in Bayesian analysis can be excessively high as an indication of statistical confidence and therefore Murphy et al.'s tree, which largely depends on Bayesian posterior probability, may not be correct.  相似文献   

4.
In order to have confidence in model-based phylogenetic analysis, the model of nucleotide substitution adopted must be selected in a statistically rigorous manner. Several model-selection methods are applicable to maximum likelihood (ML) analysis, including the hierarchical likelihood-ratio test (hLRT), Akaike information criterion (AIC), Bayesian information criterion (BIC), and decision theory (DT), but their performance relative to empirical data has not been investigated thoroughly. In this study, we use 250 phylogenetic data sets obtained from TreeBASE to examine the effects that choice in model selection has on ML estimation of phylogeny, with an emphasis on optimal topology, bootstrap support, and hypothesis testing. We show that the use of different methods leads to the selection of two or more models for approximately 80% of the data sets and that the AIC typically selects more complex models than alternative approaches. Although ML estimation with different best-fit models results in incongruent tree topologies approximately 50% of the time, these differences are primarily attributable to alternative resolutions of poorly supported nodes. Furthermore, topologies and bootstrap values estimated with ML using alternative statistically supported models are more similar to each other than to topologies and bootstrap values estimated with ML under the Kimura two-parameter (K2P) model or maximum parsimony (MP). In addition, Swofford-Olsen-Waddell-Hillis (SOWH) tests indicate that ML trees estimated with alternative best-fit models are usually not significantly different from each other when evaluated with the same model. However, ML trees estimated with statistically supported models are often significantly suboptimal to ML trees made with the K2P model when both are evaluated with K2P, indicating that not all models perform in an equivalent manner. Nevertheless, the use of alternative statistically supported models generally does not affect tests of monophyletic relationships under either the Shimodaira-Hasegawa (S-H) or SOWH methods. Our results suggest that although choice in model selection has a strong impact on optimal tree topology, it rarely affects evolutionary inferences drawn from the data because differences are mainly confined to poorly supported nodes. Moreover, since ML with alternative best-fit models tends to produce more similar estimates of phylogeny than ML under the K2P model or MP, the use of any statistically based model-selection method is vastly preferable to forgoing the model-selection process altogether.  相似文献   

5.
Likelihood methods for detecting temporal shifts in diversification rates   总被引:8,自引:0,他引:8  
Maximum likelihood is a potentially powerful approach for investigating the tempo of diversification using molecular phylogenetic data. Likelihood methods distinguish between rate-constant and rate-variable models of diversification by fitting birth-death models to phylogenetic data. Because model selection in this context is a test of the null hypothesis that diversification rates have been constant over time, strategies for selecting best-fit models must minimize Type I error rates while retaining power to detect rate variation when it is present. Here I examine model selection, parameter estimation, and power to reject the null hypothesis using likelihood models based on the birth-death process. The Akaike information criterion (AIC) has often been used to select among diversification models; however, I find that selecting models based on the lowest AIC score leads to a dramatic inflation of the Type I error rate. When appropriately corrected to reduce Type I error rates, the birth-death likelihood approach performs as well or better than the widely used gamma statistic, at least when diversification rates have shifted abruptly over time. Analyses of datasets simulated under a range of rate-variable diversification scenarios indicate that the birth-death likelihood method has much greater power to detect variation in diversification rates when extinction is present. Furthermore, this method appears to be the only approach available that can distinguish between a temporal increase in diversification rates and a rate-constant model with nonzero extinction. I illustrate use of the method by analyzing a published phylogeny for Australian agamid lizards.  相似文献   

6.
The use of parameter-rich substitution models in molecular phylogenetics has been criticized on the basis that these models can cause a reduction both in accuracy and in the ability to discriminate among competing topologies. We have explored the relationship between nucleotide substitution model complexity and nonparametric bootstrap support under maximum likelihood (ML) for six data sets for which the true relationships are known with a high degree of certainty. We also performed equally weighted maximum parsimony analyses in order to assess the effects of ignoring branch length information during tree selection. We observed that maximum parsimony gave the lowest mean estimate of bootstrap support for the correct set of nodes relative to the ML models for every data set except one. For several data sets, we established that the exact distribution used to model among-site rate variation was critical for a successful phylogenetic analysis. Site-specific rate models were shown to perform very poorly relative to gamma and invariable sites models for several of the data sets most likely because of the gross underestimation of branch lengths. The invariable sites model also performed poorly for several data sets where this model had a poor fit to the data, suggesting that addition of the gamma distribution can be critical. Estimates of bootstrap support for the correct nodes often increased under gamma and invariable sites models relative to equal rates models. Our observations are contrary to the prediction that such models cause reduced confidence in phylogenetic hypotheses. Our results raise several issues regarding the process of model selection, and we briefly discuss model selection uncertainty and the role of sensitivity analyses in molecular phylogenetics.  相似文献   

7.

Background  

Explicit evolutionary models are required in maximum-likelihood and Bayesian inference, the two methods that are overwhelmingly used in phylogenetic studies of DNA sequence data. Appropriate selection of nucleotide substitution models is important because the use of incorrect models can mislead phylogenetic inference. To better understand the performance of different model-selection criteria, we used 33,600 simulated data sets to analyse the accuracy, precision, dissimilarity, and biases of the hierarchical likelihood-ratio test, Akaike information criterion, Bayesian information criterion, and decision theory.  相似文献   

8.
We review recent models to estimate phylogenetic trees under the multispecies coalescent. Although the distinction between gene trees and species trees has come to the fore of phylogenetics, only recently have methods been developed that explicitly estimate species trees. Of the several factors that can cause gene tree heterogeneity and discordance with the species tree, deep coalescence due to random genetic drift in branches of the species tree has been modeled most thoroughly. Bayesian approaches to estimating species trees utilizes two likelihood functions, one of which has been widely used in traditional phylogenetics and involves the model of nucleotide substitution, and the second of which is less familiar to phylogeneticists and involves the probability distribution of gene trees given a species tree. Other recent parametric and nonparametric methods for estimating species trees involve parsimony criteria, summary statistics, supertree and consensus methods. Species tree approaches are an appropriate goal for systematics, appear to work well in some cases where concatenation can be misleading, and suggest that sampling many independent loci will be paramount. Such methods can also be challenging to implement because of the complexity of the models and computational time. In addition, further elaboration of the simplest of coalescent models will be required to incorporate commonly known issues such as deviation from the molecular clock, gene flow and other genetic forces.  相似文献   

9.
Reversible-jump Markov chain Monte Carlo (RJ-MCMC) is a technique for simultaneously evaluating multiple related (but not necessarily nested) statistical models that has recently been applied to the problem of phylogenetic model selection. Here we use a simulation approach to assess the performance of this method and compare it to Akaike weights, a measure of model uncertainty that is based on the Akaike information criterion. Under conditions where the assumptions of the candidate models matched the generating conditions, both Bayesian and AIC-based methods perform well. The 95% credible interval contained the generating model close to 95% of the time. However, the size of the credible interval differed with the Bayesian credible set containing approximately 25% to 50% fewer models than an AIC-based credible interval. The posterior probability was a better indicator of the correct model than the Akaike weight when all assumptions were met but both measures performed similarly when some model assumptions were violated. Models in the Bayesian posterior distribution were also more similar to the generating model in their number of parameters and were less biased in their complexity. In contrast, Akaike-weighted models were more distant from the generating model and biased towards slightly greater complexity. The AIC-based credible interval appeared to be more robust to the violation of the rate homogeneity assumption. Both AIC and Bayesian approaches suggest that substantial uncertainty can accompany the choice of model for phylogenetic analyses, suggesting that alternative candidate models should be examined in analysis of phylogenetic data. [AIC; Akaike weights; Bayesian phylogenetics; model averaging; model selection; model uncertainty; posterior probability; reversible jump.].  相似文献   

10.
The blind use of models of nucleotide substitution in evolutionary analyses is a common practice in the viral community. Typically, a simple model of evolution like the Kimura two-parameter model is used for estimating genetic distances and phylogenies, either because other authors have used it or because it is the default in various phylogenetic packages. Using two statistical approaches to model fitting, hierarchical likelihood ratio tests and the Akaike information criterion, we show that different viral data sets are better explained by different models of evolution. We demonstrate our results with the analysis of HIV-1 sequences from a hierarchy of samples; sequences within individuals, individuals within subtypes, and subtypes within groups. We also examine results for three different gene regions: gag, pol, and env. The Kimura two-parameter model was not selected as the best-fit model for any of these data sets, despite its widespread use in phylogenetic analyses of HIV-1 sequences. Furthermore, the model complexity increased with increasing sequence divergence. Finally, the molecular-clock hypothesis was rejected in most of the data sets analyzed, throwing into question clock-based estimates of divergence times for HIV-1. The importance of models in evolutionary analyses and their repercussions on the derived conclusions are discussed.  相似文献   

11.
We propose a Bayesian method for testing molecular clock hypotheses for use with aligned sequence data from multiple taxa. Our method utilizes a nonreversible nucleotide substitution model to avoid the necessity of specifying either a known tree relating the taxa or an outgroup for rooting the tree. We employ reversible jump Markov chain Monte Carlo to sample from the posterior distribution of the phylogenetic model parameters and conduct hypothesis testing using Bayes factors, the ratio of the posterior to prior odds of competing models. Here, the Bayes factors reflect the relative support of the sequence data for equal rates of evolutionary change between taxa versus unequal rates, averaged over all possible phylogenetic parameters, including the tree and root position. As the molecular clock model is a restriction of the more general unequal rates model, we use the Savage-Dickey ratio to estimate the Bayes factors. The Savage-Dickey ratio provides a convenient approach to calculating Bayes factors in favor of sharp hypotheses. Critical to calculating the Savage-Dickey ratio is a determination of the prior induced on the modeling restrictions. We demonstrate our method on a well-studied mtDNA sequence data set consisting of nine primates. We find strong support against a global molecular clock, but do find support for a local clock among the anthropoids. We provide mathematical derivations of the induced priors on branch length restrictions assuming equally likely trees. These derivations also have more general applicability to the examination of prior assumptions in Bayesian phylogenetics.  相似文献   

12.
Phylogenetic analysis of large datasets using complex nucleotide substitution models under a maximum likelihood framework can be computationally infeasible, especially when attempting to infer confidence values by way of nonparametric bootstrapping. Recent developments in phylogenetics suggest the computational burden can be reduced by using Bayesian methods of phylogenetic inference. However, few empirical phylogenetic studies exist that explore the efficiency of Bayesian analysis of large datasets. To this end, we conducted an extensive phylogenetic analysis of the wide-ranging and geographically variable Eastern Fence Lizard (Sceloporus undulatus). Maximum parsimony, maximum likelihood, and Bayesian phylogenetic analyses were performed on a combined mitochondrial DNA dataset (12S and 16S rRNA, ND1 protein-coding gene, and associated tRNA; 3,688 bp total) for 56 populations of S. undulatus (78 total terminals including other S. undulatus group species and outgroups). Maximum parsimony analysis resulted in numerous equally parsimonious trees (82,646 from equally weighted parsimony and 335 from weighted parsimony). The majority rule consensus tree derived from the Bayesian analysis was topologically identical to the single best phylogeny inferred from the maximum likelihood analysis, but required approximately 80% less computational time. The mtDNA data provide strong support for the monophyly of the S. undulatus group and the paraphyly of "S. undulatus" with respect to S. belli, S. cautus, and S. woodi. Parallel evolution of ecomorphs within "S. undulatus" has masked the actual number of species within this group. This evidence, along with convincing patterns of phylogeographic differentiation suggests "S. undulatus" represents at least four lineages that should be recognized as evolutionary species.  相似文献   

13.
We have applied Bayesian and maximum likelihood methods of phylogenetic estimation to data from four mitochondrial genes (COI, COII, 12S, and 16S) and a single nuclear gene (EF1alpha) from several genera of New Zealand, Australian, and New Caledonian cicada taxa. We specifically focused on the heterogeneity of phylogenetic signal among the different data partitions and the biogeographic origins of the New Zealand cicada fauna. The Bayesian analyses circumvent many of the problems associated with other statistical tests for comparing data partitions. We took an information-theoretic approach to model selection based on the Akaike Information Criterion (AIC). This approach indicated that there was considerable uncertainty in identifying the best-fit model for some of the partitions. Additionally, a large amount of uncertainty was associated with many parameter estimates from the substitution model. However, a sensitivity analysis on the combined dataset indicated that the model selection uncertainty had little effect on estimates of topology because these estimates were largely insensitive to changes in the assumed model. This outcome suggests strong signal in our data. Our analyses support a New Caledonian affiliation of the New Zealand cicada genera Maoricicada, Kikihia, and Rhodopsalta and Australian affinities for the genera Amphipsalta and Notopsalta. This result was surprising, given that previous cicada biologists suspected a close relationship between Amphipsalta, Notopsalta, and Rhodopsalta based on genitalic characters. Relationships among the closely related genera Maoricicada, Kikihia, and Rhodopsalta were poorly resolved, the mitochondrial data and the EF1alpha data favoring different arrangements within this clade.  相似文献   

14.
A common problem in molecular phylogenetics is choosing a model of DNA substitution that does a good job of explaining the DNA sequence alignment without introducing superfluous parameters. A number of methods have been used to choose among a small set of candidate substitution models, such as the likelihood ratio test, the Akaike Information Criterion (AIC), the Bayesian Information Criterion (BIC), and Bayes factors. Current implementations of any of these criteria suffer from the limitation that only a small set of models are examined, or that the test does not allow easy comparison of non-nested models. In this article, we expand the pool of candidate substitution models to include all possible time-reversible models. This set includes seven models that have already been described. We show how Bayes factors can be calculated for these models using reversible jump Markov chain Monte Carlo, and apply the method to 16 DNA sequence alignments. For each data set, we compare the model with the best Bayes factor to the best models chosen using AIC and BIC. We find that the best model under any of these criteria is not necessarily the most complicated one; models with an intermediate number of substitution types typically do best. Moreover, almost all of the models that are chosen as best do not constrain a transition rate to be the same as a transversion rate, suggesting that it is the transition/transversion rate bias that plays the largest role in determining which models are selected. Importantly, the reversible jump Markov chain Monte Carlo algorithm described here allows estimation of phylogeny (and other phylogenetic model parameters) to be performed while accounting for uncertainty in the model of DNA substitution.  相似文献   

15.
In phylogenetic analyses of molecular sequence data, partitioning involves estimating independent models of molecular evolution for different sets of sites in a sequence alignment. Choosing an appropriate partitioning scheme is an important step in most analyses because it can affect the accuracy of phylogenetic reconstruction. Despite this, partitioning schemes are often chosen without explicit statistical justification. Here, we describe two new objective methods for the combined selection of best-fit partitioning schemes and nucleotide substitution models. These methods allow millions of partitioning schemes to be compared in realistic time frames and so permit the objective selection of partitioning schemes even for large multilocus DNA data sets. We demonstrate that these methods significantly outperform previous approaches, including both the ad hoc selection of partitioning schemes (e.g., partitioning by gene or codon position) and a recently proposed hierarchical clustering method. We have implemented these methods in an open-source program, PartitionFinder. This program allows users to select partitioning schemes and substitution models using a range of information-theoretic metrics (e.g., the Bayesian information criterion, akaike information criterion [AIC], and corrected AIC). We hope that PartitionFinder will encourage the objective selection of partitioning schemes and thus lead to improvements in phylogenetic analyses. PartitionFinder is written in Python and runs under Mac OSX 10.4 and above. The program, source code, and a detailed manual are freely available from www.robertlanfear.com/partitionfinder.  相似文献   

16.
We analyzed the phylogeny of the Neotropical pitvipers within the Porthidium group (including intra-specific through inter-generic relationships) using 1.4 kb of DNA sequences from two mitochondrial protein-coding genes (ND4 and cyt-b). We investigated how Bayesian Markov chain Monte-Carlo (MCMC) phylogenetic hypotheses based on this 'mesoscale' dataset were affected by analysis under various complex models of nucleotide evolution that partition models across the dataset. We develop an approach, employing three statistics (Akaike weights, Bayes factors, and relative Bayes factors), for examining the performance of complex models in order to identify the best-fit model for data analysis. Our results suggest that: (1) model choice may have important practical effects on phylogenetic conclusions even for mesoscale datasets, (2) the use of a complex partitioned model did not produce widespread increases or decreases in nodal posterior probability support, and (3) most differences in resolution resulting from model choice were concentrated at deeper nodes. Our phylogenetic estimates of relationships among members of the Porthidium group (genera: Atropoides, Cerrophidion, and Porthidium) resolve the monophyly of the three genera. Bayesian MCMC results suggest that Cerrophidion and Porthidium form a clade that is the sister taxon to Atropoides. In addition to resolving the intra-specific relationships among a majority of Porthidium group taxa, our results highlight phylogeographic patterns across Middle and South America and suggest that each of the three genera may harbor undescribed species diversity.  相似文献   

17.
Here we present a model of nucleotide substitution in protein-coding regions that also encode the formation of conserved RNA structures. In such regions, apparent evolutionary context dependencies exist, both between nucleotides occupying the same codon and between nucleotides forming a base pair in the RNA structure. The overlap of these fundamental dependencies is sufficient to cause "contagious" context dependencies which cascade across many nucleotide sites. Such large-scale dependencies challenge the use of traditional phylogenetic models in evolutionary inference because they explicitly assume evolutionary independence between short nucleotide tuples. In our model we address this by replacing context dependencies within codons by annotation-specific heterogeneity in the substitution process. Through a general procedure, we fragment the alignment into sets of short nucleotide tuples based on both the protein coding and the structural annotation. These individual tuples are assumed to evolve independently, and the different tuple sets are assigned different annotation-specific substitution models shared between their members. This allows us to build a composite model of the substitution process from components of traditional phylogenetic models. We applied this to a data set of full-genome sequences from the hepatitis C virus where five RNA structures are mapped within the coding region. This allowed us to partition the effects of selection on different structural elements and to test various hypotheses concerning the relation of these effects. Of particular interest, we found evidence of a functional role of loop and bulge regions, as these were shown to evolve according to a different and more constrained selective regime than the nonpairing regions outside the RNA structures. Other potential applications of the model include comparative RNA structure prediction in coding regions and RNA virus phylogenetics.  相似文献   

18.
The evolutionary patterns of hepatitis C virus (HCV), including the best-fitting nucleotide substitution model and the molecular clock hypothesis, were investigated by analyzing full-genome sequences available in the HCV database. The likelihood ratio test allowed us to discriminate among different evolutionary hypotheses. The phylogeny of the six major HCV types was accurately inferred, and the final tree was rooted by reconstructing the hypothetical HCV common ancestor with the maximum likelihood method. The presence of phylogenetic noise and the relative nucleotide substitution rates in the different HCV genes were also examined. These results offer a general guideline for the future of HCV phylogenetic analysis and also provide important insights on HCV origin and evolution. Received: 13 January 2001 / Accepted: 21 June 2001  相似文献   

19.
Miyazawa S 《PloS one》2011,6(12):e28892
BACKGROUND: A mechanistic codon substitution model, in which each codon substitution rate is proportional to the product of a codon mutation rate and the average fixation probability depending on the type of amino acid replacement, has advantages over nucleotide, amino acid, and empirical codon substitution models in evolutionary analysis of protein-coding sequences. It can approximate a wide range of codon substitution processes. If no selection pressure on amino acids is taken into account, it will become equivalent to a nucleotide substitution model. If mutation rates are assumed not to depend on the codon type, then it will become essentially equivalent to an amino acid substitution model. Mutation at the nucleotide level and selection at the amino acid level can be separately evaluated. RESULTS: The present scheme for single nucleotide mutations is equivalent to the general time-reversible model, but multiple nucleotide changes in infinitesimal time are allowed. Selective constraints on the respective types of amino acid replacements are tailored to each gene in a linear function of a given estimate of selective constraints. Their good estimates are those calculated by maximizing the respective likelihoods of empirical amino acid or codon substitution frequency matrices. Akaike and Bayesian information criteria indicate that the present model performs far better than the other substitution models for all five phylogenetic trees of highly-divergent to highly-homologous sequences of chloroplast, mitochondrial, and nuclear genes. It is also shown that multiple nucleotide changes in infinitesimal time are significant in long branches, although they may be caused by compensatory substitutions or other mechanisms. The variation of selective constraint over sites fits the datasets significantly better than variable mutation rates, except for 10 slow-evolving nuclear genes of 10 mammals. An critical finding for phylogenetic analysis is that assuming variable mutation rates over sites lead to the overestimation of branch lengths.  相似文献   

20.
jModelTest: phylogenetic model averaging   总被引:15,自引:0,他引:15  
jModelTest is a new program for the statistical selection of models of nucleotide substitution based on "Phyml" (Guindon and Gascuel 2003. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol. 52:696-704.). It implements 5 different selection strategies, including "hierarchical and dynamical likelihood ratio tests," the "Akaike information criterion," the "Bayesian information criterion," and a "decision-theoretic performance-based" approach. This program also calculates the relative importance and model-averaged estimates of substitution parameters, including a model-averaged estimate of the phylogeny. jModelTest is written in Java and runs under Mac OSX, Windows, and Unix systems with a Java Runtime Environment installed. The program, including documentation, can be freely downloaded from the software section at http://darwin.uvigo.es.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号