首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
We have developed a new method for reconstructing phylogenetic trees called random local neighbor-joining (RLNJ). Our method is different from the neighbor-joining method (NJ) of Saitou and Nei and affords a more thorough sampling of solution space by randomly searching for local pair of neighbors in each step. Results using the RLNJ method to analyze yeast data show an increasing possibility to get a smaller S value (sum of branch lengths) compared with the NJ method as cases with more taxa are analyzed and many individual runs using the RLNJ method usually generate more than one topology with small S values. Computer simulation shows the fact that the RLNJ method can improve the possibility of recovering correct topology significantly by affording more than one topology. In addition, when using the RLNJ method, computer simulation also shows that the proportion of correct topologies (P(C)) will increase as the number of different topologies decreases and as the proportion of "most frequent topology" increases. Thus, the number of different topologies and the proportion of "most frequent topology" can be used as auxiliary criteria to evaluate reliability of a phylogenetic tree.  相似文献   

2.
We introduce a new approach to estimate the evolutionary distance between two sequences. This approach uses a tree with three leaves: two of them correspond to the studied sequences, whereas the third is chosen to handle long-distance estimation. The branch lengths of this tree are obtained by likelihood maximization and are then used to deduce the desired distance. This approach, called TripleML, improves the precision of evolutionary distance estimates, and thus the topological accuracy of distance-based methods. TripleML can be used with neighbor-joining-like (NJ-like) methods not only to compute the initial distance matrix but also to estimate new distances encountered during the agglomeration process. Computer simulations indicate that using TripleML significantly improves the topological accuracy of NJ, BioNJ, and Weighbor, while conserving a reasonable computation time. With randomly generated 24-taxon trees and realistic parameter values, combining NJ with TripleML reduces the number of wrongly inferred branches by about 11% (against 2.6% and 5.5% for BioNJ and Weighbor, respectively). Moreover, this combination requires only about 1.5 min to infer a phylogeny of 96 sequences composed of 1,200 nucleotides, as compared with 6.5 h for FastDNAml on the same machine (PC 466 MHz).  相似文献   

3.
Choice of a substitution model is a crucial step in the maximum likelihood (ML) method of phylogenetic inference, and investigators tend to prefer complex mathematical models to simple ones. However, when complex models with many parameters are used, the extent of noise in statistical inferences increases, and thus complex models may not produce the true topology with a higher probability than simple ones. This problem was studied using computer simulation. When the number of nucleotides used was relatively large (1000 bp), the HKY+Gamma model showed smaller d(T) topological distance between the inferred and the true trees) than the JC and Kimura models. In the cases of shorter sequences (300 bp) simpler model and search algorithm such as JC model and SA+NNI search were found to be as efficient as more complicated searches and models in terms of topological distances, although the topologies obtained under HKY+Gamma model had the highest likelihood values. The performance of relatively simple search algorithm SA+NNI was found to be essentially the same as that of more extensive SA+TBR search under all models studied. Similarly to the conclusions reached by Takahashi and Nei [Mol. Biol. Evol. 17 (2000) 1251], our results indicate that simple models can be as efficient as complex models, and that use of complex models does not necessarily give more reliable trees compared with simple models.  相似文献   

4.
Among the criteria to evaluate the performance of a phylogenetic method, robustness to model violation is of particular practical importance as complete a priori knowledge of evolutionary processes is typically unavailable. For studies of robustness in phylogenetic inference, a utility to add well-defined model violations to the simulated data would be helpful. We therefore introduce ImOSM, a tool to imbed intermittent evolution as model violation into an alignment. Intermittent evolution refers to extra substitutions occurring randomly on branches of a tree, thus changing alignment site patterns. This means that the extra substitutions are placed on the tree after the typical process of sequence evolution is completed. We then study the robustness of widely used phylogenetic methods: maximum likelihood (ML), maximum parsimony (MP), and a distance-based method (BIONJ) to various scenarios of model violation. Violation of rates across sites (RaS) heterogeneity and simultaneous violation of RaS and the transition/transversion ratio on two nonadjacent external branches hinder all the methods recovery of the true topology for a four-taxon tree. For an eight-taxon balanced tree, the violations cause each of the three methods to infer a different topology. Both ML and MP fail, whereas BIONJ, which calculates the distances based on the ML estimated parameters, reconstructs the true tree. Finally, we report that a test of model homogeneity and goodness of fit tests have enough power to detect such model violations. The outcome of the tests can help to actually gain confidence in the inferred trees. Therefore, we recommend using these tests in practical phylogenetic analyses.  相似文献   

5.
Liu K  Linder CR  Warnow T 《PloS one》2011,6(11):e27731
Statistical methods for phylogeny estimation, especially maximum likelihood (ML), offer high accuracy with excellent theoretical properties. However, RAxML, the current leading method for large-scale ML estimation, can require weeks or longer when used on datasets with thousands of molecular sequences. Faster methods for ML estimation, among them FastTree, have also been developed, but their relative performance to RAxML is not yet fully understood. In this study, we explore the performance with respect to ML score, running time, and topological accuracy, of FastTree and RAxML on thousands of alignments (based on both simulated and biological nucleotide datasets) with up to 27,634 sequences. We find that when RAxML and FastTree are constrained to the same running time, FastTree produces topologically much more accurate trees in almost all cases. We also find that when RAxML is allowed to run to completion, it provides an advantage over FastTree in terms of the ML score, but does not produce substantially more accurate tree topologies. Interestingly, the relative accuracy of trees computed using FastTree and RAxML depends in part on the accuracy of the sequence alignment and dataset size, so that FastTree can be more accurate than RAxML on large datasets with relatively inaccurate alignments. Finally, the running times of RAxML and FastTree are dramatically different, so that when run to completion, RAxML can take several orders of magnitude longer than FastTree to complete. Thus, our study shows that very large phylogenies can be estimated very quickly using FastTree, with little (and in some cases no) degradation in tree accuracy, as compared to RAxML.  相似文献   

6.
MOTIVATION: Maximum likelihood (ML) methods have become very popular for constructing phylogenetic trees from sequence data. However, despite noticeable recent progress, with large and difficult datasets (e.g. multiple genes with conflicting signals) current ML programs still require huge computing time and can become trapped in bad local optima of the likelihood function. When this occurs, the resulting trees may still show some of the defects (e.g. long branch attraction) of starting trees obtained using fast distance or parsimony programs. METHODS: Subtree pruning and regrafting (SPR) topological rearrangements are usually sufficient to intensively search the tree space. Here, we propose two new methods to make SPR moves more efficient. The first method uses a fast distance-based approach to detect the least promising candidate SPR moves, which are then simply discarded. The second method locally estimates the change in likelihood for any remaining potential SPRs, as opposed to globally evaluating the entire tree for each possible move. These two methods are implemented in a new algorithm with a sophisticated filtering strategy, which efficiently selects potential SPRs and concentrates most of the likelihood computation on the promising moves. RESULTS: Experiments with real datasets comprising 35-250 taxa show that, while indeed greatly reducing the amount of computation, our approach provides likelihood values at least as good as those of the best-known ML methods so far and is very robust to poor starting trees. Furthermore, combining our new SPR algorithm with local moves such as PHYML's nearest neighbor interchanges, the time needed to find good solutions can sometimes be reduced even more.  相似文献   

7.
We address phylogenetic reconstruction when the data is generated from a mixture distribution. Such topics have gained considerable attention in the biological community with the clear evidence of heterogeneity of mutation rates. In our work we consider data coming from a mixture of trees which share a common topology, but differ in their edge weights (i.e., branch lengths). We first show the pitfalls of popular methods, including maximum likelihood and Markov chain Monte Carlo algorithms. We then determine in which evolutionary models, reconstructing the tree topology, under a mixture distribution, is (im)possible. We prove that every model whose transition matrices can be parameterized by an open set of multilinear polynomials, either has non-identifiable mixture distributions, in which case reconstruction is impossible in general, or there exist linear tests which identify the topology. This duality theorem, relies on our notion of linear tests and uses ideas from convex programming duality. Linear tests are closely related to linear invariants, which were first introduced by Lake, and are natural from an algebraic geometry perspective.  相似文献   

8.
Two approximate methods are proposed for maximum likelihood phylogenetic estimation, which allow variable rates of substitution across nucleotide sites. Three data sets with quite different characteristics were analyzed to examine empirically the performance of these methods. The first, called the discrete gamma model, uses several categories of rates to approximate the gamma distribution, with equal probability for each category. The mean of each category is used to represent all the rates falling in the category. The performance of this method is found to be quite good, and four such categories appear to be sufficient to produce both an optimum, or near-optimum fit by the model to the data, and also an acceptable approximation to the continuous distribution. The second method, called fixed-rates model, classifies sites into several classes according to their rates predicted assuming the star tree. Sites in different classes are then assumed to be evolving at these fixed rates when other tree topologies are evaluated. Analyses of the data sets suggest that this method can produce reasonable results, but it seems to share some properties of a least-squares pairwise comparison; for example, interior branch lengths in nonbest trees are often found to be zero. The computational requirements of the two methods are comparable to that of Felsenstein's (1981, J Mol Evol 17:368–376) model, which assumes a single rate for all the sites.  相似文献   

9.
Clearcut: a fast implementation of relaxed neighbor joining   总被引:1,自引:0,他引:1  
SUMMARY: Clearcut is an open source implementation for the relaxed neighbor joining (RNJ) algorithm. While traditional neighbor joining (NJ) remains a popular method for distance-based phylogenetic tree reconstruction, it suffers from a O(N(3)) time complexity, where N represents the number of taxa in the input. Due to this steep asymptotic time complexity, NJ cannot reasonably handle very large datasets. In contrast, RNJ realizes a typical-case time complexity on the order of N(2)logN without any significant qualitative difference in output. RNJ is particularly useful when inferring a very large tree or a large number of trees. In addition, RNJ retains the desirable property that it will always reconstruct the true tree given a matrix of additive pairwise distances. Clearcut implements RNJ as a C program, which takes either a set of aligned sequences or a pre-computed distance matrix as input and produces a phylogenetic tree. Alternatively, Clearcut can reconstruct phylogenies using an extremely fast standard NJ implementation. AVAILABILITY: Clearcut source code is available for download at: http://bioinformatics.hungry.com/clearcut  相似文献   

10.
We have developed a new tool, called fastDNAml, for constructingphylogenetic trees from DNA sequences. The program can be runon a wide variety of computers ranging from Unix workstationsto massively parallel systems, and is available from the RibosomalDatabase Project (RDP) by anonymous FTP. Our program uses amaximum likelihood approach and is based on version 3.3 of Felsenstein'sdnaml program. Several enhancements, including algorithmic changes,significantly improve performance and reduce memory usage, makingit feasible to construct even very large trees. Trees containing40–100 taxa have been easily generated, and phylogeneticestimates are possible even when hundreds of sequences exist.We are currently using the tool to construct a phylogenetictree based on 473 small subunit rRNA sequences from prokaryotes.  相似文献   

11.
The superiority of cladistic methods to both synthetic and phenetic methods is briefly advanced and reviewed. Cladistics creates testable hypotheses of phylogeny that also give a highly informative summary of available data. Thus it best fits the criteria for a method for determining the general reference classification in biology.For protistologists in particular, cladistics is especially useful. Inundated by an abundance of ultrastructural, biochemical, and cell biological information, protistologists could be greatly helped by the informative way in which cladistics orders and summarizes the data. In addition to classifying protist taxa, hypotheses about the evolution of cell organelles and cellular could be scientifically formulated and tested by cladistics. Because cladistic classifications best summarize the data, they would also be best for making predictions about taxa and characters. They would, for the same reason, be the most stable. Widespread adoption of cladistic methods would serve to stabilize the now fluid state of protist taxonomy. It is for all of these reasons that such methods best suit the needs of the evolutionary protistologist.  相似文献   

12.
In phylogenetic analyses with combined multigene or multiprotein data sets, accounting for differing evolutionary dynamics at different loci is essential for accurate tree prediction. Existing maximum likelihood (ML) and Bayesian approaches are computationally intensive. We present an alternative approach that is orders of magnitude faster. The method, Distance Rates (DistR), estimates rates based upon distances derived from gene/protein sequence data. Simulation studies indicate that this technique is accurate compared with other methods and robust to missing sequence data. The DistR method was applied to a fungal mitochondrial data set, and the rate estimates compared well to those obtained using existing ML and Bayesian approaches. Inclusion of the protein rates estimated from the DistR method into the ML calculation of trees as a branch length multiplier resulted in a significantly improved fit as measured by the Akaike Information Criterion (AIC). Furthermore, bootstrap support for the ML topology was significantly greater when protein rates were used, and some evident errors in the concatenated ML tree topology (i.e., without protein rates) were corrected. [Bayesian credible intervals; DistR method; multigene phylogeny; PHYML; rate heterogeneity.].  相似文献   

13.
Summary Methods of classical segregation analysis were applied to a sample of 129 sibships with one or more individuals affected by neurofibromatosis-1 (NF-1). The sample consists only of subjects with NF-1; all the probands had been referred for genetic counselling because of café-au-lait spots, and a diagnostic protocol was invariably applied. No deviation from the segregation ratio expected for a fully penetrant Mendelian dominant gene was observed. A maximum likelihood estimate of the proportion of sporadic cases was obtained, and the mutation rate was estimated to be 6.5×10-5 gametes per generation (95% CI 5.0–8.1).  相似文献   

14.
Stewart WC  Thompson EA 《Biometrics》2006,62(3):728-734
As a result of previous large, multipoint linkage studies there is a substantial amount of existing marker data. Due to the increased sample size, genetic maps estimated from these data could be more accurate than publicly available maps. However, current methods for map estimation are restricted to data sets containing pedigrees with a small number of individuals, or cannot make full use of marker data that are observed at several loci on members of large, extended pedigrees. In this article, a maximum likelihood (ML) method for map estimation that can make full use of the marker data in a large, multipoint linkage study is described. The method is applied to replicate sets of simulated marker data involving seven linked loci, and pedigree structures based on the real multipoint linkage study of Abkevich et al. (2003, American Journal of Human Genetics 73, 1271-1281). The variance of the ML estimate is accurately estimated, and tests of both simple and composite null hypotheses are performed. An efficient procedure for combining map estimates over data sets is also suggested.  相似文献   

15.
Even when the maximum likelihood (ML) tree is a better estimate of the true phylogenetic tree than those produced by other methods, the result of a poor ML search may be no better than that of a more thorough search under some faster criterion. The ability to find the globally optimal ML tree is therefore important. Here, I compare a range of heuristic search strategies (and their associated computer programs) in terms of their success at locating the ML tree for 20 empirical data sets with 14 to 158 sequences and 411 to 120,762 aligned nucleotides. Three distinct topics are discussed: the success of the search strategies in relation to certain features of the data, the generation of starting trees for the search, and the exploration of multiple islands of trees. As a starting tree, there was little difference among the neighbor-joining tree based on absolute differences (including the BioNJ tree), the stepwise-addition parsimony tree (with or without nearest-neighbor-interchange (NNI) branch swapping), and the stepwise-addition ML tree. The latter produced the best ML score on average but was orders of magnitude slower than the alternatives. The BioNJ tree was second best on average. As search strategies, star decomposition and quartet puzzling were the slowest and produced the worst ML scores. The DPRml, IQPNNI, MultiPhyl, PhyML, PhyNav, and TreeFinder programs with default options produced qualitatively similar results, each locating a single tree that tended to be in an NNI suboptimum (rather than the global optimum) when the data set had low phylogenetic information. For such data sets, there were multiple tree islands with very similar ML scores. The likelihood surface only became relatively simple for data sets that contained approximately 500 aligned nucleotides for 50 sequences and 3,000 nucleotides for 100 sequences. The RAxML and GARLI programs allowed multiple islands to be explored easily, but both programs also tended to find NNI suboptima. A newly developed version of the likelihood ratchet using PAUP* successfully found the peaks of multiple islands, but its speed needs to be improved.  相似文献   

16.

Background  

Non-parametric bootstrapping is a widely-used statistical procedure for assessing confidence of model parameters based on the empirical distribution of the observed data [1] and, as such, it has become a common method for assessing tree confidence in phylogenetics [2]. Traditional non-parametric bootstrapping does not weigh each tree inferred from resampled (i.e., pseudo-replicated) sequences. Hence, the quality of these trees is not taken into account when computing bootstrap scores associated with the clades of the original phylogeny. As a consequence, traditionally, the trees with different bootstrap support or those providing a different fit to the corresponding pseudo-replicated sequences (the fit quality can be expressed through the LS, ML or parsimony score) contribute in the same way to the computation of the bootstrap support of the original phylogeny.  相似文献   

17.
18.
Advocates of cladistic parsimony methods have invoked the philosophy of Karl Popper in an attempt to argue for the superiority of those methods over phylogenetic methods based on Ronald Fisher's statistical principle of likelihood. We argue that the concept of likelihood in general, and its application to problems of phylogenetic inference in particular, are highly compatible with Popper's philosophy. Examination of Popper's writings reveals that his concept of corroboration is, in fact, based on likelihood. Moreover, because probabilistic assumptions are necessary for calculating the probabilities that define Popper's corroboration, likelihood methods of phylogenetic inference--with their explicit probabilistic basis--are easily reconciled with his concept. In contrast, cladistic parsimony methods, at least as described by certain advocates of those methods, are less easily reconciled with Popper's concept of corroboration. If those methods are interpreted as lacking probabilistic assumptions, then they are incompatible with corroboration. Conversely, if parsimony methods are to be considered compatible with corroboration, then they must be interpreted as carrying implicit probabilistic assumptions. Thus, the non-probabilistic interpretation of cladistic parsimony favored by some advocates of those methods is contradicted by an attempt by the same authors to justify parsimony methods in terms of Popper's concept of corroboration. In addition to being compatible with Popperian corroboration, the likelihood approach to phylogenetic inference permits researchers to test the assumptions of their analytical methods (models) in a way that is consistent with Popper's ideas about the provisional nature of background knowledge.  相似文献   

19.
In phylogenetic inference by maximum-parsimony (MP), minimum-evolution (ME), and maximum-likelihood (ML) methods, it is customary to conduct extensive heuristic searches of MP, ME, and ML trees, examining a large number of different topologies. However, these extensive searches tend to give incorrect tree topologies. Here we show by extensive computer simulation that when the number of nucleotide sequences (m) is large and the number of nucleotides used (n) is relatively small, the simple MP or ML tree search algorithms such as the stepwise addition (SA) plus nearest neighbor interchange (NNI) search and the SA plus subtree pruning regrafting (SPR) search are as efficient as the extensive search algorithms such as the SA plus tree bisection-reconnection (TBR) search in inferring the true tree. In the case of ME methods, the simple neighbor-joining (NJ) algorithm is as efficient as or more efficient than the extensive NJ+TBR search. We show that when ME methods are used, the simple p distance generally gives better results in phylogenetic inference than more complicated distance measures such as the Hasegawa-Kishino-Yano (HKY) distance, even when nucleotide substitution follows the HKY model. When ML methods are used, the simple Jukes-Cantor (JC) model of phylogenetic inference generally shows a better performance than the HKY model even if the likelihood value for the HKY model is much higher than that for the JC model. This indicates that at least in the present case, selecting of a substitution model by using the likelihood ratio test or the AIC index is not appropriate. When n is small relative to m and the extent of sequence divergence is high, the NJ method with p distance often shows a better performance than ML methods with the JC model. However, when the level of sequence divergence is low, this is not the case.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号