首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 359 毫秒
1.
Models of sequence evolution play an important role in molecular evolutionary studies. The use of inappropriate models of evolution may bias the results of the analysis and lead to erroneous conclusions. Several procedures for selecting the best-fit model of evolution for the data at hand have been proposed, like the likelihood ratio test (LRT) and the Akaike (AIC) and Bayesian (BIC) information criteria. The relative performance of these model-selecting algorithms has not yet been studied under a range of different model trees. In this study, the influence of branch length variation upon model selection is characterized. This is done by simulating sequence alignments under a known model of nucleotide substitution, and recording how often this true model is recovered by different model-fitting strategies. Results of this study agree with previous simulations and suggest that model selection is reasonably accurate. However, different model selection methods showed distinct levels of accuracy. Some LRT approaches showed better performance than the AIC or BIC information criteria. Within the LRTs, model selection is affected by the complexity of the initial model selected for the comparisons, and only slightly by the order in which different parameters are added to the model. A specific hierarchy of LRTs, which starts from a simple model of evolution, performed overall better than other possible LRT hierarchies, or than the AIC or BIC. Received: 2 October 2000 / Accepted: 4 January 2001  相似文献   

2.
In order to have confidence in model-based phylogenetic analysis, the model of nucleotide substitution adopted must be selected in a statistically rigorous manner. Several model-selection methods are applicable to maximum likelihood (ML) analysis, including the hierarchical likelihood-ratio test (hLRT), Akaike information criterion (AIC), Bayesian information criterion (BIC), and decision theory (DT), but their performance relative to empirical data has not been investigated thoroughly. In this study, we use 250 phylogenetic data sets obtained from TreeBASE to examine the effects that choice in model selection has on ML estimation of phylogeny, with an emphasis on optimal topology, bootstrap support, and hypothesis testing. We show that the use of different methods leads to the selection of two or more models for approximately 80% of the data sets and that the AIC typically selects more complex models than alternative approaches. Although ML estimation with different best-fit models results in incongruent tree topologies approximately 50% of the time, these differences are primarily attributable to alternative resolutions of poorly supported nodes. Furthermore, topologies and bootstrap values estimated with ML using alternative statistically supported models are more similar to each other than to topologies and bootstrap values estimated with ML under the Kimura two-parameter (K2P) model or maximum parsimony (MP). In addition, Swofford-Olsen-Waddell-Hillis (SOWH) tests indicate that ML trees estimated with alternative best-fit models are usually not significantly different from each other when evaluated with the same model. However, ML trees estimated with statistically supported models are often significantly suboptimal to ML trees made with the K2P model when both are evaluated with K2P, indicating that not all models perform in an equivalent manner. Nevertheless, the use of alternative statistically supported models generally does not affect tests of monophyletic relationships under either the Shimodaira-Hasegawa (S-H) or SOWH methods. Our results suggest that although choice in model selection has a strong impact on optimal tree topology, it rarely affects evolutionary inferences drawn from the data because differences are mainly confined to poorly supported nodes. Moreover, since ML with alternative best-fit models tends to produce more similar estimates of phylogeny than ML under the K2P model or MP, the use of any statistically based model-selection method is vastly preferable to forgoing the model-selection process altogether.  相似文献   

3.
In the reconstruction of a large phylogenetic tree, the most difficult part is usually the problem of how to explore the topology space to find the optimal topology. We have developed a "divide-and-conquer" heuristic algorithm in which an initial neighbor-joining (NJ) tree is divided into subtrees at internal branches having bootstrap values higher than a threshold. The topology search is then conducted by using the maximum-likelihood method to reevaluate all branches with a bootstrap value lower than the threshold while keeping the other branches intact. Extensive simulation showed that our simple method, the neighbor-joining maximum-likelihood (NJML) method, is highly efficient in improving NJ trees. Furthermore, the performance of the NJML method is nearly equal to or better than existing time-consuming heuristic maximum-likelihood methods. Our method is suitable for reconstructing relatively large molecular phylogenetic trees (number of taxa >/= 16).  相似文献   

4.
对模型选择中交叉验证量CV进行改进,得到新的验证模型是否合适的准则RCV,RCV包含了CV的信息,并包含了拟合程度,模型中的待估参数个数和样本容量等等,比起AIC,BIC和CV具有更好的稳定性和分辨功能.  相似文献   

5.
Quartet-mapping, a generalization of the likelihood-mapping procedure.   总被引:5,自引:0,他引:5  
Likelihood-mapping (LM) was suggested as a method of displaying the phylogenetic content of an alignment. However, statistical properties of the method have not been studied. Here we analyze the special case of a four-species tree generated under a range of evolution models and compare the results with those of a natural extension of the likelihood-mapping approach, geometry-mapping (GM), which is based on the method of statistical geometry in sequence space. The methods are compared in their abilities to indicate the correct topology. The performance of both methods in detecting the star topology is especially explored. Our results show that LM tends to reject a star tree more often than GM. When assumptions about the evolutionary model of the maximum-likelihood reconstruction are not matched by the true process of evolution, then LM shows a tendency to favor one tree, whereas GM correctly detects the star tree except for very short outer branch lengths with a statistical significance of >0.95 for all models. LM, on the other hand, reconstructs the correct bifurcating tree with a probability of >0.95 for most branch length combinations even under models with varying substitution rates. The parameter domain for which GM recovers the true tree is much smaller. When the exterior branch lengths are larger than a (analytically derived) threshold value depending on the tree shape (rather than the evolutionary model), GM reconstructs a star tree rather than the true tree. We suggest a combined approach of LM and GM for the evaluation of starlike trees. This approach offers the possibility of testing for significant positive interior branch lengths without extensive statistical and computational efforts.  相似文献   

6.
基于混合效应模型的人工红松节子属性   总被引:1,自引:0,他引:1  
基于黑龙江省孟家岗林场60株人工红松1534个节子数据,利用SAS软件中的NLMIXED和GLIMMIX模块构建人工红松节子属性因子(基径、健全节长度、死亡年龄、角度)的混合效应预测模型.采用赤池信息准则(AIC)、贝叶斯信息准则(BIC)、对数似然值(-2LL)和似然比检验(LRT)评价指标对所构建模型的精度进行比较.结果表明:考虑树木效应的混合模型模拟精度均高于传统回归模型.含有b_1、b_2随机参数组合的节子基径模型是最优混合效应模型;含有b_1、b_3随机参数组合的节子健全节长度模型是最优混合效应模型;含有节子基径随机参数的广义线性混合模型为节子死亡年龄的最优模型;含有截距、节子基径、健全节长度3种随机效应参数组合的广义线性混合模型为节子角度的最优模型.混合效应模型比传统回归模型更能有效地描述节子属性.红松是东北主要的用材树种,利用节子属性预测模型结合合理的整枝方案可以提高木材质量.  相似文献   

7.
In phylogenetic inference by maximum-parsimony (MP), minimum-evolution (ME), and maximum-likelihood (ML) methods, it is customary to conduct extensive heuristic searches of MP, ME, and ML trees, examining a large number of different topologies. However, these extensive searches tend to give incorrect tree topologies. Here we show by extensive computer simulation that when the number of nucleotide sequences (m) is large and the number of nucleotides used (n) is relatively small, the simple MP or ML tree search algorithms such as the stepwise addition (SA) plus nearest neighbor interchange (NNI) search and the SA plus subtree pruning regrafting (SPR) search are as efficient as the extensive search algorithms such as the SA plus tree bisection-reconnection (TBR) search in inferring the true tree. In the case of ME methods, the simple neighbor-joining (NJ) algorithm is as efficient as or more efficient than the extensive NJ+TBR search. We show that when ME methods are used, the simple p distance generally gives better results in phylogenetic inference than more complicated distance measures such as the Hasegawa-Kishino-Yano (HKY) distance, even when nucleotide substitution follows the HKY model. When ML methods are used, the simple Jukes-Cantor (JC) model of phylogenetic inference generally shows a better performance than the HKY model even if the likelihood value for the HKY model is much higher than that for the JC model. This indicates that at least in the present case, selecting of a substitution model by using the likelihood ratio test or the AIC index is not appropriate. When n is small relative to m and the extent of sequence divergence is high, the NJ method with p distance often shows a better performance than ML methods with the JC model. However, when the level of sequence divergence is low, this is not the case.  相似文献   

8.
A common problem in molecular phylogenetics is choosing a model of DNA substitution that does a good job of explaining the DNA sequence alignment without introducing superfluous parameters. A number of methods have been used to choose among a small set of candidate substitution models, such as the likelihood ratio test, the Akaike Information Criterion (AIC), the Bayesian Information Criterion (BIC), and Bayes factors. Current implementations of any of these criteria suffer from the limitation that only a small set of models are examined, or that the test does not allow easy comparison of non-nested models. In this article, we expand the pool of candidate substitution models to include all possible time-reversible models. This set includes seven models that have already been described. We show how Bayes factors can be calculated for these models using reversible jump Markov chain Monte Carlo, and apply the method to 16 DNA sequence alignments. For each data set, we compare the model with the best Bayes factor to the best models chosen using AIC and BIC. We find that the best model under any of these criteria is not necessarily the most complicated one; models with an intermediate number of substitution types typically do best. Moreover, almost all of the models that are chosen as best do not constrain a transition rate to be the same as a transversion rate, suggesting that it is the transition/transversion rate bias that plays the largest role in determining which models are selected. Importantly, the reversible jump Markov chain Monte Carlo algorithm described here allows estimation of phylogeny (and other phylogenetic model parameters) to be performed while accounting for uncertainty in the model of DNA substitution.  相似文献   

9.
In phylogenetic analyses with combined multigene or multiprotein data sets, accounting for differing evolutionary dynamics at different loci is essential for accurate tree prediction. Existing maximum likelihood (ML) and Bayesian approaches are computationally intensive. We present an alternative approach that is orders of magnitude faster. The method, Distance Rates (DistR), estimates rates based upon distances derived from gene/protein sequence data. Simulation studies indicate that this technique is accurate compared with other methods and robust to missing sequence data. The DistR method was applied to a fungal mitochondrial data set, and the rate estimates compared well to those obtained using existing ML and Bayesian approaches. Inclusion of the protein rates estimated from the DistR method into the ML calculation of trees as a branch length multiplier resulted in a significantly improved fit as measured by the Akaike Information Criterion (AIC). Furthermore, bootstrap support for the ML topology was significantly greater when protein rates were used, and some evident errors in the concatenated ML tree topology (i.e., without protein rates) were corrected. [Bayesian credible intervals; DistR method; multigene phylogeny; PHYML; rate heterogeneity.].  相似文献   

10.
Phylogenetic analyses frequently rely on models of sequence evolution that detail nucleotide substitution rates, nucleotide frequencies, and site-to-site rate heterogeneity. These models can influence hypothesis testing and can affect the accuracy of phylogenetic inferences. Maximum likelihood methods of simultaneously constructing phylogenetic tree topologies and estimating model parameters are computationally intensive, and are not feasible for sample sizes of 25 or greater using personal computers. Techniques that initially construct a tree topology and then use this non-maximized topology to estimate ML substitution rates, however, can quickly arrive at a model of sequence evolution. The accuracy of this two-step estimation technique was tested using simulated data sets with known model parameters. The results showed that for a star-like topology, as is often seen in human immunodeficiency virus type 1 (HIV-1) subtype B sequences, a random starting topology could produce nucleotide substitution rates that were not statistically different than the true rates. Samples were isolated from 100 HIV-1 subtype B infected individuals from the United States and a 620 nt region of the env gene was sequenced for each sample. The sequence data were used to obtain a substitution model of sequence evolution specific for HIV-1 subtype B env by estimating nucleotide substitution rates and the site-to-site heterogeneity in 100 individuals from the United States. The method of estimating the model should provide users of large data sets with a way to quickly compute a model of sequence evolution, while the nucleotide substitution model we identified should prove useful in the phylogenetic analysis of HIV-1 subtype B env sequences. Received: 4 October 2000 / Accepted: 1 March 2001  相似文献   

11.
Membrane proteins move in heterogeneous environments with spatially (sometimes temporally) varying friction and with biochemical interactions with various partners. It is important to reliably distinguish different modes of motion to improve our knowledge of the membrane architecture and to understand the nature of interactions between membrane proteins and their environments. Here, we present an analysis technique for single molecule tracking (SMT) trajectories that can determine the preferred model of motion that best matches observed trajectories. The method is based on Bayesian inference to calculate the posteriori probability of an observed trajectory according to a certain model. Information theory criteria, such as the Bayesian information criterion (BIC), the Akaike information criterion (AIC), and modified AIC (AICc), are used to select the preferred model. The considered group of models includes free Brownian motion, and confined motion in 2nd or 4th order potentials. We determine the best information criteria for classifying trajectories. We tested its limits through simulations matching large sets of experimental conditions and we built a decision tree. This decision tree first uses the BIC to distinguish between free Brownian motion and confined motion. In a second step, it classifies the confining potential further using the AIC. We apply the method to experimental Clostridium Perfingens -toxin (CPT) receptor trajectories to show that these receptors are confined by a spring-like potential. An adaptation of this technique was applied on a sliding window in the temporal dimension along the trajectory. We applied this adaptation to experimental CPT trajectories that lose confinement due to disaggregation of confining domains. This new technique adds another dimension to the discussion of SMT data. The mode of motion of a receptor might hold more biologically relevant information than the diffusion coefficient or domain size and may be a better tool to classify and compare different SMT experiments.  相似文献   

12.
The maximum-likelihood (ML) solution to a simple phylogenetic estimation problem is obtained analytically The problem is estimation of the rooted tree for three species using binary characters with a symmetrical rate of substitution under the molecular clock. ML estimates of branch lengths and log-likelihood scores are obtained analytically for each of the three rooted binary trees. Estimation of the tree topology is equivalent to partitioning the sample space (space of possible data outcomes) into subspaces, within each of which one of the three binary trees is the ML tree. Distance-based least squares and parsimony-like methods produce essentially the same estimate of the tree topology, although differences exist among methods even under this simple model. This seems to be the simplest case, but has many of the conceptual and statistical complexities involved in phylogeny estimation. The solution to this real phylogeny estimation problem will be useful for studying the problem of significance evaluation.  相似文献   

13.
Selecting the best-fit model of nucleotide substitution   总被引:2,自引:0,他引:2  
Despite the relevant role of models of nucleotide substitution in phylogenetics, choosing among different models remains a problem. Several statistical methods for selecting the model that best fits the data at hand have been proposed, but their absolute and relative performance has not yet been characterized. In this study, we compare under various conditions the performance of different hierarchical and dynamic likelihood ratio tests, and of Akaike and Bayesian information methods, for selecting best-fit models of nucleotide substitution. We specifically examine the role of the topology used to estimate the likelihood of the different models and the importance of the order in which hypotheses are tested. We do this by simulating DNA sequences under a known model of nucleotide substitution and recording how often this true model is recovered by the different methods. Our results suggest that model selection is reasonably accurate and indicate that some likelihood ratio test methods perform overall better than the Akaike or Bayesian information criteria. The tree used to estimate the likelihood scores does not influence model selection unless it is a randomly chosen tree. The order in which hypotheses are tested, and the complexity of the initial model in the sequence of tests, influence model selection in some cases. Model fitting in phylogenetics has been suggested for many years, yet many authors still arbitrarily choose their models, often using the default models implemented in standard computer programs for phylogenetic estimation. We show here that a best-fit model can be readily identified. Consequently, given the relevance of models, model fitting should be routine in any phylogenetic analysis that uses models of evolution.  相似文献   

14.
Directional selection and the site-frequency spectrum.   总被引:4,自引:0,他引:4  
C D Bustamante  J Wakeley  S Sawyer  D L Hartl 《Genetics》2001,159(4):1779-1788
In this article we explore statistical properties of the maximum-likelihood estimates (MLEs) of the selection and mutation parameters in a Poisson random field population genetics model of directional selection at DNA sites. We derive the asymptotic variances and covariance of the MLEs and explore the power of the likelihood ratio tests (LRT) of neutrality for varying levels of mutation and selection as well as the robustness of the LRT to deviations from the assumption of free recombination among sites. We also discuss the coverage of confidence intervals on the basis of two standard-likelihood methods. We find that the LRT has high power to detect deviations from neutrality and that the maximum-likelihood estimation performs very well when the ancestral states of all mutations in the sample are known. When the ancestral states are not known, the test has high power to detect deviations from neutrality for negative selection but not for positive selection. We also find that the LRT is not robust to deviations from the assumption of independence among sites.  相似文献   

15.
Comparisons are made of the accuracy of the restricted maximum-likelihood, Wagner parsimony, and UPGMA (unweighted pair-group method using arithmetic averages) clustering methods to estimate phylogenetic trees. Data matrices were generated by constructing simulated stochastic evolution in a multidimensional gene-frequency space using a simple genetic-drift model (Brownian-motion, random-walk) with constant rates of divergence in all lineages. Ten differentphylogenetic tree topologies of 20 operational taxonomic units (OTU's), representing a range of tree shapes, were used. Felsenstein's restricted maximum-likelihood method, Wagner parsimony, and UPGMA clustering were used to construct trees from the resulting data matrices. The computations for the restricted maximum-likelihood method were performed on a Cray-1 supercomputer since the required calculations (especially when optimized for the vector hardware) are performed substantially faster than on more conventional computing systems. The overall level of accuracy of tree reconstruction depends on the topology of the true phylogenetic tree. The UPGMA clustering method, especially when genetic-distance coefficients are used, gives the most accurate estimates of the true phylogeny (for our model with constant evolutionary rates). For large numbers of loci, all methods give similar results, but trends in the results imply that the restricted maximum-likelihood method would produce the most accurate trees if sample sizes were large enough.  相似文献   

16.
Anisimova M  Nielsen R  Yang Z 《Genetics》2003,164(3):1229-1236
Maximum-likelihood methods based on models of codon substitution accounting for heterogeneous selective pressures across sites have proved to be powerful in detecting positive selection in protein-coding DNA sequences. Those methods are phylogeny based and do not account for the effects of recombination. When recombination occurs, such as in population data, no unique tree topology can describe the evolutionary history of the whole sequence. This violation of assumptions raises serious concerns about the likelihood method for detecting positive selection. Here we use computer simulation to evaluate the reliability of the likelihood-ratio test (LRT) for positive selection in the presence of recombination. We examine three tests based on different models of variable selective pressures among sites. Sequences are simulated using a coalescent model with recombination and analyzed using codon-based likelihood models ignoring recombination. We find that the LRT is robust to low levels of recombination (with fewer than three recombination events in the history of a sample of 10 sequences). However, at higher levels of recombination, the type I error rate can be as high as 90%, especially when the null model in the LRT is unrealistic, and the test often mistakes recombination as evidence for positive selection. The test that compares the more realistic models M7 (beta) against M8 (beta and omega) is more robust to recombination, where the null model M7 allows the positive selection pressure to vary between 0 and 1 (and so does not account for positive selection), and the alternative model M8 allows an additional discrete class with omega = d(N)/d(S) that could be estimated to be >1 (and thus accounts for positive selection). Identification of sites under positive selection by the empirical Bayes method appears to be less affected than the LRT by recombination.  相似文献   

17.
18.
Hummingbirds are an important model system in avian biology, but to date the group has been the subject of remarkably few phylogenetic investigations. Here we present partitioned Bayesian and maximum likelihood phylogenetic analyses for 151 of approximately 330 species of hummingbirds and 12 outgroup taxa based on two protein-coding mitochondrial genes (ND2 and ND4), flanking tRNAs, and two nuclear introns (AK1 and BFib). We analyzed these data under several partitioning strategies ranging between unpartitioned and a maximum of nine partitions. In order to select a statistically justified partitioning strategy following partitioned Bayesian analysis, we considered four alternative criteria including Bayes factors, modified versions of the Akaike information criterion for small sample sizes (AIC(c)), Bayesian information criterion (BIC), and a decision-theoretic methodology (DT). Following partitioned maximum likelihood analyses, we selected a best-fitting strategy using hierarchical likelihood ratio tests (hLRTS), the conventional AICc, BIC, and DT, concluding that the most stringent criterion, the performance-based DT, was the most appropriate methodology for selecting amongst partitioning strategies. In the context of our well-resolved and well-supported phylogenetic estimate, we consider the historical biogeography of hummingbirds using ancestral state reconstructions of (1) primary geographic region of occurrence (i.e., South America, Central America, North America, Greater Antilles, Lesser Antilles), (2) Andean or non-Andean geographic distribution, and (3) minimum elevational occurrence. These analyses indicate that the basal hummingbird assemblages originated in the lowlands of South America, that most of the principle clades of hummingbirds (all but Mountain Gems and possibly Bees) originated on this continent, and that there have been many (at least 30) independent invasions of other primary landmasses, especially Central America.  相似文献   

19.
The root of a phylogenetic tree is fundamental to its biological interpretation, but standard substitution models do not provide any information on its position. Here, we describe two recently developed models that relax the usual assumptions of stationarity and reversibility, thereby facilitating root inference without the need for an outgroup. We compare the performance of these models on a classic test case for phylogenetic methods, before considering two highly topical questions in evolutionary biology: the deep structure of the tree of life and the root of the archaeal radiation. We show that all three alignments contain meaningful rooting information that can be harnessed by these new models, thus complementing and extending previous work based on outgroup rooting. In particular, our analyses exclude the root of the tree of life from the eukaryotes or Archaea, placing it on the bacterial stem or within the Bacteria. They also exclude the root of the archaeal radiation from several major clades, consistent with analyses using other rooting methods. Overall, our results demonstrate the utility of non-reversible and non-stationary models for rooting phylogenetic trees, and identify areas where further progress can be made.  相似文献   

20.
Increasingly, large data sets pose a challenge for computationally intensive phylogenetic methods such as Bayesian Markov chain Monte Carlo (MCMC). Here, we investigate the performance of common MCMC proposal distributions in terms of median and variance of run time to convergence on 11 data sets. We introduce two new Metropolized Gibbs Samplers for moving through "tree space." MCMC simulation using these new proposals shows faster average run time and dramatically improved predictability in performance, with a 20-fold reduction in the variance of the time to estimate the posterior distribution to a given accuracy. We also introduce conditional clade probabilities and demonstrate that they provide a superior means of approximating tree topology posterior probabilities from samples recorded during MCMC.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号