首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 20 毫秒
1.
We examine whether phylogenetic methods provide biased estimates of tree shape with respect to the random branching model. We investigate the performance of five commonly used phylogenetic methods using computer simulation: (1) maximum parsimony; (2) neighbor joining; (3) UPGMA with an outgroup taxon; (4) UPGMA without an outgroup taxon; and (5) maximum likelihood. All methods provide estimates of tree shape that are, on average, more asymmetrical than the true tree, especially when rates of evolution are high. We suggest a simple explanation for the bias and propose a modified test of tree shape that corrects for it.  相似文献   

2.
The robustness (sensitivity to violation of assumptions) of the maximum- likelihood and neighbor-joining methods was examined using simulation. Maximum likelihood and neighbor joining were implemented with Jukes- Cantor, Kimura, and gamma models of DNA substitution. Simulations were performed in which the assumptions of the methods were violated to varying degrees on three model four-taxon trees. The performance of the methods was evaluated with respect to ability to correctly estimate the unrooted four-taxon tree. Maximum likelihood outperformed neighbor joining in 29 of the 36 cases in which the assumptions of both methods were satisfied. In 133 of 180 of the simulations in which the assumptions of the maximum-likelihood and neighbor-joining methods were violated, maximum likelihood outperformed neighbor joining. These results are consistent with a general superiority of maximum likelihood over neighbor joining under comparable conditions. They extend and clarify an earlier study that found an advantage for neighbor joining over maximum likelihood for gamma-distributed mutation rates.   相似文献   

3.
Phylogenetic analysis using parsimony and likelihood methods   总被引:1,自引:0,他引:1  
The assumptions underlying the maximum-parsimony (MP) method of phylogenetic tree reconstruction were intuitively examined by studying the way the method works. Computer simulations were performed to corroborate the intuitive examination. Parsimony appears to involve very stringent assumptions concerning the process of sequence evolution, such as constancy of substitution rates between nucleotides, constancy of rates across nucleotide sites, and equal branch lengths in the tree. For practical data analysis, the requirement of equal branch lengths means similar substitution rates among lineages (the existence of an approximate molecular clock), relatively long interior branches, and also few species in the data. However, a small amount of evolution is neither a necessary nor a sufficient requirement of the method. The difficulties involved in the application of current statistical estimation theory to tree reconstruction were discussed, and it was suggested that the approach proposed by Felsenstein (1981,J. Mol. Evol. 17: 368–376) for topology estimation, as well as its many variations and extensions, differs fundamentally from the maximum likelihood estimation of a conventional statistical parameter. Evidence was presented showing that the Felsenstein approach does not share the asymptotic efficiency of the maximum likelihood estimator of a statistical parameter. Computer simulations were performed to study the probability that MP recovers the true tree under a hierarchy of models of nucleotide substitution; its performance relative to the likelihood method was especially noted. The results appeared to support the intuitive examination of the assumptions underlying MP. When a simple model of nucleotide substitution was assumed to generate data, the probability that MP recovers the true topology could be as high as, or even higher than, that for the likelihood method. When the assumed model became more complex and realistic, e.g., when substitution rates were allowed to differ between nucleotides or across sites, the probability that MP recovers the true topology, and especially its performance relative to that of the likelihood method, generally deteriorates. As the complexity of the process of nucleotide substitution in real sequences is well recognized, the likelihood method appears preferable to parsimony. However, the development of a statistical methodology for the efficient estimation of the tree topology remains a difficult open problem.  相似文献   

4.
We propose two approximate methods (one based on parsimony and one on pairwise sequence comparison) for estimating the pattern of nucleotide substitution and a parsimony-based method for estimating the gamma parameter for variable substitution rates among sites. The matrix of substitution rates that represents the substitution pattern can be recovered through its relationship with the observable matrix of site pattern frequences in pairwise sequence comparisons. In the parsimony approach, the ancestral sequences reconstructed by the parsimony algorithm were used, and the two sequences compared are those at the ends of a branch in the phylogenetic tree. The method for estimating the gamma parameter was based on a reinterpretation of the numbers of changes at sites inferred by parsimony. Three data sets were analyzed to examine the utility of the approximate methods compared with the more reliable likelihood methods. The new methods for estimating the substitution pattern were found to produce estimates quite similar to those obtained from the likelihood analyses. The new method for estimating the gamma parameter was effective in reducing the bias in conventional parsimony estimates, although it also overestimated the parameter. The approximate methods are computationally very fast and appear useful for analyzing large data sets, for which use of the likelihood method requires excessive computation.   相似文献   

5.
We have investigated the effects of different among-site rate variation models on the estimation of substitution model parameters, branch lengths, topology, and bootstrap proportions under minimum evolution (ME) and maximum likelihood (ML). Specifically, we examined equal rates, invariable sites, gamma-distributed rates, and site-specific rates (SSR) models, using mitochondrial DNA sequence data from three protein-coding genes and one tRNA gene from species of the New Zealand cicada genus Maoricicada. Estimates of topology were relatively insensitive to the substitution model used; however, estimates of bootstrap support, branch lengths, and R-matrices (underlying relative substitution rate matrix) were strongly influenced by the assumptions of the substitution model. We identified one situation where ME and ML tree building became inaccurate when implemented with an inappropriate among-site rate variation model. Despite the fact the SSR models often have a better fit to the data than do invariable sites and gamma rates models, SSR models have some serious weaknesses. First, SSR rate parameters are not comparable across data sets, unlike the proportion of invariable sites or the alpha shape parameter of the gamma distribution. Second, the extreme among-site rate variation within codon positions is problematic for SSR models, which explicitly assume rate homogeneity within each rate class. Third, the SSR models appear to give severe underestimates of R-matrices and branch lengths relative to invariable sites and gamma rates models in this example. We recommend performing phylogenetic analyses under a range of substitution models to test the effects of model assumptions not only on estimates of topology but also on estimates of branch length and nodal support.  相似文献   

6.
Phylogenies are often thought to be more dependent upon the specifics of the sequence alignment rather than on the method of reconstruction. Simulation of sequences containing insertion and deletion events was performed in order to determine the role that alignment accuracy plays during phylogenetic inference. Data sets were simulated for pectinate, balanced, and random tree shapes under different conditions (ultrametric equal branch length, ultrametric random branch length, nonultrametric random branch length). Comparisons between hypothesized alignments and true alignments enabled determination of two measures of alignment accuracy, that of the total data set and that of individual branches. In general, our results indicate that as alignment error increases, topological accuracy decreases. This trend was much more pronounced for data sets derived from more pectinate topologies. In contrast, for balanced, ultrametric, equal branch length tree shapes, alignment inaccuracy had little average effect on tree reconstruction. These conclusions are based on average trends of many analyses under different conditions, and any one specific analysis, independent of the alignment accuracy, may recover very accurate or inaccurate topologies. Maximum likelihood and Bayesian, in general, outperformed neighbor joining and maximum parsimony in terms of tree reconstruction accuracy. Results also indicated that as the length of the branch and of the neighboring branches increase, alignment accuracy decreases, and the length of the neighboring branches is the major factor in topological accuracy. Thus, multiple-sequence alignment can be an important factor in downstream effects on topological reconstruction.  相似文献   

7.
A maximum likelihood framework for estimating site-specific substitution rates is presented that does not require any prior assumptions about the rate distribution. We show that, when the branching pattern of the underlying tree is known, the analysis of pairs of positions is sufficient to estimate site-specific rates. In the abscense of a known topology, we introduce an iterative procedure to estimate simultaneously the branching pattern, the branch lengths, and site-specific substitution rates. Simulations show that the evolutionary rate of fast-evolving sites can be reliably inferred and that the accuracy of rate estimates depends mainly on the number of sequences in the data set. Thus, large sets of aligned sequences are necessary for reliable site-specific rate estimates. The method is applied to the complete mitochondrial DNA sequence of 53 humans, providing a complete picture of the site-specific substitution rates in human mitochondrial DNA.  相似文献   

8.
Few estimates of relative substitution rates, and the underlying mutation rates, exist between mitochondrial and nuclear genes in insects. Previous estimates for insects indicate a 2-9 times faster substitution rate in mitochondrial genes relative to nuclear genes. Here we use novel methods for estimating relative rates of substitution, which incorporate multiple substitutions, and apply these methods to a group of insects (lice, Order: Phthiraptera). First, we use a modification of copath analysis (branch length regression) to construct independent comparisons of rates, consisting of each branch in a phylogenetic tree. The branch length comparisons use maximum likelihood models to correct for multiple substitution. In addition, we estimate codon-specific rates under maximum likelihood for the different genes and compare these values. Estimates of the relative synonymous substitution rates between a mitochondrial (COI) and nuclear (EF-1alpha) gene in lice indicate a relative rate of several 100 to 1. This rapid relative mitochondrial rate (>100 times) is at least an order of magnitude faster than previous estimates for any group of organisms. Comparisons using the same methods for another group of insects (aphids) reveals that this extreme relative rate estimate is not simply attributable to the methods we used, because estimates from aphids are substantially lower. Taxon sampling affects the relative rate estimate, with comparisons involving more closely related taxa resulting in a higher estimate. Relative rate estimates also increase with model complexity, indicating that methods accounting for more multiple substitution estimate higher relative rates.  相似文献   

9.
Short phylogenetic distances between taxa occur, for example, in studies on ribosomal RNA-genes with slow substitution rates. For consistently short distances, it is proved that in the completely singular limit of the covariance matrix ordinary least squares (OLS) estimates are minimum variance or best linear unbiased (BLU) estimates of phylogenetic tree branch lengths. Although OLS estimates are in this situation equal to generalized least squares (GLS) estimates, the GLS chi-square likelihood ratio test will be inapplicable as it is associated with zero degrees of freedom. Consequently, an OLS normal distribution test or an analogous bootstrap approach will provide optimal branch length tests of significance for consistently short phylogenetic distances. As the asymptotic covariances between branch lengths will be equal to zero, it follows that the product rule can be used in tree evaluation to calculate an approximate simultaneous confidence probability that all interior branches are positive.  相似文献   

10.
We conducted a simulation study of the phylogenetic methods UPGMA, neighbor joining, maximum parsimony, and maximum likelihood for a five-taxon tree under a molecular clock. The parameter space included a small region where maximum parsimony is inconsistent, so we tested inconsistency correction for parsimony and distance correction for neighbor joining. As expected, corrected parsimony was consistent. For these data, maximum likelihood with the clock assumption outperformed each of the other methods tested. The distance-based methods performed marginally better than did maximum parsimony and maximum likelihood without the clock assumption. Data correction was generally detrimental to accuracy, especially for short sequence lengths. We identified another region of the parameter space where, although consistent for a given method, some incorrect trees were each selected with up to twice the frequency of the correct (generating) tree for sequences of bounded length. These incorrect trees are those where the outgroup has been incorrectly placed. In addition to this problem, the placement of the outgroup sequence can have a confounding effect on the ingroup tree, whereby the ingroup is correct when using the ingroup sequences alone, but with the inclusion of the outgroup the ingroup tree becomes incorrect.  相似文献   

11.
Intraspecific variation is abundant in all types of systematic characters but is rarely addressed in simulation studies of phylogenetic method performance. We compared the accuracy of 15 phylogenetic methods using simulations to (1) determine the most accurate method(s) for analyzing polymorphic data (under simplified conditions) and (2) test if generalizations about the performance of phylogenetic methods based on previous simulations of fixed (nonpolymorphic) characters are robust to a very different evolutionary model that explicitly includes intraspecific variation. Simulated data sets consisted of allele frequencies that evolved by genetic drift. The phylogenetic methods included eight parsimony coding methods, continuous maximum likelihood, and three distance methods (UPGMA, neighbor joining, and Fitch-Margoliash) applied to two genetic distance measures (Nei's and the modified Cavalli-Sforza and Edwards chord distance). Two sets of simulations were performed. The first examined the effects of different branch lengths, sample sizes (individuals sampled per species), numbers of characters, and numbers of alleles per locus in the eight-taxon case. The second examined more extensively the effects of branch length in the four-taxon, two-allele case. Overall, the most accurate methods were likelihood, the additive distance methods (neighbor joining and Fitch-Margoliash), and the frequency parsimony method. Despite the use of a very different evolutionary model in the present article, many of the results are similar to those from simulations of fixed characters. Similarities include the presence of the "Felsenstein zone," where methods often fail, which suggests that long-branch attraction may occur among closely related species through genetic drift. Differences between the results of fixed and polymorphic data simulations include the following: (1) UPGMA is as accurate or more accurate than nonfrequency parsimony methods across nearly all combinations of branch lengths, and (2) likelihood and the additive distance methods are not positively misled under any combination of branch lengths tested (even when the assumptions of the methods are violated and few characters are sampled). We found that sample size is an important determinant of accuracy and affects the relative success of methods (i.e., distance and likelihood methods outperform parsimony at small sample sizes). Attempts to generalize about the behavior of phylogenetic methods should consider the extreme examples offered by fixed-mutation models of DNA sequence data and genetic-drift models of allele frequencies.  相似文献   

12.
Recent studies have observed that Bayesian analyses of sequence data sets using the program MrBayes sometimes generate extremely large branch lengths, with posterior credibility intervals for the tree length (sum of branch lengths) excluding the maximum likelihood estimates. Suggested explanations for this phenomenon include the existence of multiple local peaks in the posterior, lack of convergence of the chain in the tail of the posterior, mixing problems, and misspecified priors on branch lengths. Here, we analyze the behavior of Bayesian Markov chain Monte Carlo algorithms when the chain is in the tail of the posterior distribution and note that all these phenomena can occur. In Bayesian phylogenetics, the likelihood function approaches a constant instead of zero when the branch lengths increase to infinity. The flat tail of the likelihood can cause poor mixing and undue influence of the prior. We suggest that the main cause of the extreme branch length estimates produced in many Bayesian analyses is the poor choice of a default prior on branch lengths in current Bayesian phylogenetic programs. The default prior in MrBayes assigns independent and identical distributions to branch lengths, imposing strong (and unreasonable) assumptions about the tree length. The problem is exacerbated by the strong correlation between the branch lengths and parameters in models of variable rates among sites or among site partitions. To resolve the problem, we suggest two multivariate priors for the branch lengths (called compound Dirichlet priors) that are fairly diffuse and demonstrate their utility in the special case of branch length estimation on a star phylogeny. Our analysis highlights the need for careful thought in the specification of high-dimensional priors in Bayesian analyses.  相似文献   

13.
The nucleotide substitution matrix inferred from avian data sets using cytochrome b differs considerably from the models commonly used in phylogenetic analyses. To analyze the possible effects of this particular pattern of change in phylogeny estimation we performed a computer simulation in which we started with a real sequence and used the inferred model of change to produce a tree of 10 species. Maximum parsimony (MP), maximum likelihood (ML), and various distance methods were then used to recover the topology and the branch lengths. We used two kinds of data with varying levels of variation. In addition, we tested with the removal of third positions and different weighting schemes. At low levels of variation, MP was outstanding in recovering the topology (90% correct), while unweighted pair-group method, arithmetic average (UPGMA), regardless of distances used, was poor (40%). At the higher level, most methods had a chance of around 40%-58% of finding the true tree. However, in most cases, the trees found were only slightly wrong, with only one or a few branches misplaced. On the other hand, the use of a "wrong" model had serious effects on the estimation of branch lengths (distances). Although precision was high, accuracy was poor with most methods, giving branch lengths that were biased downward. When seeded with the true distance matrix, Fitch and NJ always found the true tree, while UPGMA frequently failed to do so. The effect of removing third positions was dramatic at low levels of variation, because only one MP program was able to find a true tree at all, albeit rarely, while none of the others ever did so. At higher levels, the situation was better, but still much worse than with the whole data set.  相似文献   

14.
Summary Operator metrics are explicity designed to measure evolutionary distances from nucleic acid sequences when substitution rates differ greatly among the organisms being compared, or when substitutions have been extensive. Unlike lengths calculated by the distance matrix and parsimony methods, in which substitutions in one branch of a tree can alter the measured length of another branch, lengths determined by operator metrics are not affected by substitutions outside the branch.In the method, lengths (operator metrics) corresponding to each of the branches of an unrooted tree are calculated. The metric length of a branch reconstructs the number of (transversion) differences between sequences at a tip and a node (or between nodes) of a tree. The theory is general and is fundamentally independent of differences in substitution rates among the organisms being compared. Mathematically, the independence has been obtained becuase the metrics are eigen vectors of fundamental equations which describe the evolution of all unrooted trees.Even under conditions when both the distance matrix method or a simple parsimony length method are show to indicate lengths than are an order of magnitude too large or too small, the operator metrics are accurate. Examples, using data calculated with evolutionary rates and branchings designed to confuse the measurement of branch lengths and to camouflage the topology of the true tree, demonstrate the validity of operator metrics. The method is robust. Operator metric distances are easy to calculated, can be extended to any number of taxa, and provide a statistical estimate of their variances.The utility of the method is demonstrated by using it to analyze the origins and evolutionary of chloroplasts, mitochondria, and eubacteria.  相似文献   

15.
田鹏  刘占林 《生物信息学》2009,7(3):232-233
以系统发育树构建的原有距离方法为基础,吸取了NJ法和FM法中的部分理论,提出了以节点引入为手段的新的简易方法,通过该方法构建了分子系统发育树,结果表明这种方法更加快捷,而且所得结果与FM法完全一致。  相似文献   

16.
为了探究进化模型对DNA条形码分类的影响, 本研究以雾灵山夜蛾科44个种的标本为材料, 获得COI基因序列。使用邻接法(neighbor-joining)、 最大简约法(maximum parsimony)、 最大似然法(maximum likelihood)以及贝叶斯法(Bayesian inference)构建系统发育树, 并且对邻接法的12种模型、 最大似然法的7种模型、 贝叶斯法的2种模型进行模型成功率的评估。结果表明, 邻接法的12种模型成功率相差不大, 较稳定; 最大似然法及贝叶斯法的不同模型成功率存在明显差异, 不稳定; 最大简约法不基于模型, 成功率比较稳定。邻接法及最大似然法共有6种相同的模型, 这6种模型在不同的方法中成功率存在差异。此外, 分子数据中存在单个物种仅有一条序列的情况, 显著降低了模型成功率, 表明在DNA条形码研究中, 每个物种需要有多个样本。  相似文献   

17.
A stepwise algorithm for finding minimum evolution trees   总被引:7,自引:6,他引:1  
A stepwise algorithm for reconstructing minimum evolution (ME) trees from evolutionary distance data is proposed. In each step, a taxon that potentially has a neighbor (another taxon connected to it with a single interior node) is first chosen and then its true neighbor searched iteratively. For m taxa, at most (m-1)!/2 trees are examined and the tree with the minimum sum of branch lengths (S) is chosen as the final tree. This algorithm provides simple strategies for restricting the tree space searched and allows us to implement efficient ways of dynamically computing the ordinary least squares estimates of S for the topologies examined. Using computer simulation, we found that the efficiency of the ME method in recovering the correct tree is similar to that of the neighbor-joining method (Saitou and Nei 1987). A more exhaustive search is unlikely to improve the efficiency of the ME method in finding the correct tree because the correct tree is almost always included in the tree space searched with this stepwise algorithm. The new algorithm finds trees for which S values may not be significantly different from that of the ME tree if the correct tree contains very small interior branches or if the pairwise distance estimates have large sampling errors. These topologies form a set of plausible alternatives to the ME tree and can be compared with each other using statistical tests based on the minimum evolution principle. The new algorithm makes it possible to use the ME method for large data sets.   相似文献   

18.
Until recently, phylogenetic analyses have been routinely based on homologous sequences of a single gene. Given the vast number of gene sequences now available, phylogenetic studies are now based on the analysis of multiple genes. Thus, it has become necessary to devise statistical methods to combine multiple molecular data sets. Here, we compare several models for combining different genes for the purpose of evaluating the likelihood of tree topologies. Three methods of branch length estimation were studied: assuming all genes have the same branch lengths (concatenate model), assuming that branch lengths are proportional among genes (proportional model), or assuming that each gene has a separate set of branch lengths (separate model). We also compared three models of among-site rate variation: the homogenous model, a model that assumes one gamma parameter for all genes, and a model that assumes one gamma parameter for each gene. On the basis of two nuclear and one mitochondrial amino acid data sets, our results suggest that, depending on the data set chosen, either the separate model or the proportional model represents the most appropriate method for branch length analysis. For all the data sets examined, one gamma parameter for each gene represents the best model for among-site rate variation. Using these models we analyzed alternative mammalian tree topologies, and we describe the effect of the assumed model on the maximum likelihood tree. We show that the choice of the model has an impact on the best phylogeny obtained.  相似文献   

19.
We modified the phylogenetic program MrBayes 3.1.2 to incorporate the compound Dirichlet priors for branch lengths proposed recently by Rannala, Zhu, and Yang (2012. Tail paradox, partial identifiability and influential priors in Bayesian branch length inference. Mol. Biol. Evol. 29:325-335.) as a solution to the problem of branch-length overestimation in Bayesian phylogenetic inference. The compound Dirichlet prior specifies a fairly diffuse prior on the tree length (the sum of branch lengths) and uses a Dirichlet distribution to partition the tree length into branch lengths. Six problematic data sets originally analyzed by Brown, Hedtke, Lemmon, and Lemmon (2010. When trees grow too long: investigating the causes of highly inaccurate Bayesian branch-length estimates. Syst. Biol. 59:145-161) are reanalyzed using the modified version of MrBayes to investigate properties of Bayesian branch-length estimation using the new priors. While the default exponential priors for branch lengths produced extremely long trees, the compound Dirichlet priors produced posterior estimates that are much closer to the maximum likelihood estimates. Furthermore, the posterior tree lengths were quite robust to changes in the parameter values in the compound Dirichlet priors, for example, when the prior mean of tree length changed over several orders of magnitude. Our results suggest that the compound Dirichlet priors may be useful for correcting branch-length overestimation in phylogenetic analyses of empirical data sets.  相似文献   

20.
We have developed a rapid parsimony method for reconstructing ancestral nucleotide states that allows calculation of initial branch lengths that are good approximations to optimal maximum-likelihood estimates under several commonly used substitution models. Use of these approximate branch lengths (rather than fixed arbitrary values) as starting points significantly reduces the time required for iteration to a solution that maximizes the likelihood of a tree. These branch lengths are close enough to the optimal values that they can be used without further iteration to calculate approximate maximum-likelihood scores that are very close to the "exact" scores found by iteration. Several strategies are described for using these approximate scores to substantially reduce times needed for maximum-likelihood tree searches.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号