首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
In phylogenetic inference by maximum-parsimony (MP), minimum-evolution (ME), and maximum-likelihood (ML) methods, it is customary to conduct extensive heuristic searches of MP, ME, and ML trees, examining a large number of different topologies. However, these extensive searches tend to give incorrect tree topologies. Here we show by extensive computer simulation that when the number of nucleotide sequences (m) is large and the number of nucleotides used (n) is relatively small, the simple MP or ML tree search algorithms such as the stepwise addition (SA) plus nearest neighbor interchange (NNI) search and the SA plus subtree pruning regrafting (SPR) search are as efficient as the extensive search algorithms such as the SA plus tree bisection-reconnection (TBR) search in inferring the true tree. In the case of ME methods, the simple neighbor-joining (NJ) algorithm is as efficient as or more efficient than the extensive NJ+TBR search. We show that when ME methods are used, the simple p distance generally gives better results in phylogenetic inference than more complicated distance measures such as the Hasegawa-Kishino-Yano (HKY) distance, even when nucleotide substitution follows the HKY model. When ML methods are used, the simple Jukes-Cantor (JC) model of phylogenetic inference generally shows a better performance than the HKY model even if the likelihood value for the HKY model is much higher than that for the JC model. This indicates that at least in the present case, selecting of a substitution model by using the likelihood ratio test or the AIC index is not appropriate. When n is small relative to m and the extent of sequence divergence is high, the NJ method with p distance often shows a better performance than ML methods with the JC model. However, when the level of sequence divergence is low, this is not the case.  相似文献   

2.
RAxML-VI-HPC (randomized axelerated maximum likelihood for high performance computing) is a sequential and parallel program for inference of large phylogenies with maximum likelihood (ML). Low-level technical optimizations, a modification of the search algorithm, and the use of the GTR+CAT approximation as replacement for GTR+Gamma yield a program that is between 2.7 and 52 times faster than the previous version of RAxML. A large-scale performance comparison with GARLI, PHYML, IQPNNI and MrBayes on real data containing 1000 up to 6722 taxa shows that RAxML requires at least 5.6 times less main memory and yields better trees in similar times than the best competing program (GARLI) on datasets up to 2500 taxa. On datasets > or =4000 taxa it also runs 2-3 times faster than GARLI. RAxML has been parallelized with MPI to conduct parallel multiple bootstraps and inferences on distinct starting trees. The program has been used to compute ML trees on two of the largest alignments to date containing 25,057 (1463 bp) and 2182 (51,089 bp) taxa, respectively. AVAILABILITY: icwww.epfl.ch/~stamatak  相似文献   

3.
Even when the maximum likelihood (ML) tree is a better estimate of the true phylogenetic tree than those produced by other methods, the result of a poor ML search may be no better than that of a more thorough search under some faster criterion. The ability to find the globally optimal ML tree is therefore important. Here, I compare a range of heuristic search strategies (and their associated computer programs) in terms of their success at locating the ML tree for 20 empirical data sets with 14 to 158 sequences and 411 to 120,762 aligned nucleotides. Three distinct topics are discussed: the success of the search strategies in relation to certain features of the data, the generation of starting trees for the search, and the exploration of multiple islands of trees. As a starting tree, there was little difference among the neighbor-joining tree based on absolute differences (including the BioNJ tree), the stepwise-addition parsimony tree (with or without nearest-neighbor-interchange (NNI) branch swapping), and the stepwise-addition ML tree. The latter produced the best ML score on average but was orders of magnitude slower than the alternatives. The BioNJ tree was second best on average. As search strategies, star decomposition and quartet puzzling were the slowest and produced the worst ML scores. The DPRml, IQPNNI, MultiPhyl, PhyML, PhyNav, and TreeFinder programs with default options produced qualitatively similar results, each locating a single tree that tended to be in an NNI suboptimum (rather than the global optimum) when the data set had low phylogenetic information. For such data sets, there were multiple tree islands with very similar ML scores. The likelihood surface only became relatively simple for data sets that contained approximately 500 aligned nucleotides for 50 sequences and 3,000 nucleotides for 100 sequences. The RAxML and GARLI programs allowed multiple islands to be explored easily, but both programs also tended to find NNI suboptima. A newly developed version of the likelihood ratchet using PAUP* successfully found the peaks of multiple islands, but its speed needs to be improved.  相似文献   

4.
The gene-duplication problem is to infer a species supertree from a collection of gene trees that are confounded by complex histories of gene-duplication events. This problem is NP-complete and thus requires efficient and effective heuristics. Existing heuristics perform a stepwise search of the tree space, where each step is guided by an exact solution to an instance of a local search problem. A classical local search problem is the {tt NNI} search problem, which is based on the nearest neighbor interchange operation. In this work, we 1) provide a novel near-linear time algorithm for the {tt NNI} search problem, 2) introduce extensions that significantly enlarge the search space of the {tt NNI} search problem, and 3) present algorithms for these extended versions that are asymptotically just as efficient as our algorithm for the {tt NNI} search problem. The exceptional speedup achieved in the extended {tt NNI} search problems makes the gene-duplication problem more tractable for large-scale phylogenetic analyses. We verify the performance of our algorithms in a comparison study using sets of large randomly generated gene trees.  相似文献   

5.
The problem of reconstructing the duplication history of a set of tandemly repeated sequences was first introduced by Fitch (1977). Many recent studies deal with this problem, showing the validity of the unequal recombination model proposed by Fitch, describing numerous inference algorithms, and exploring the combinatorial properties of these new mathematical objects, which are duplication trees. In this paper, we deal with the topological rearrangement of these trees. Classical rearrangements used in phylogeny (NNI, SPR, TBR, ...) cannot be applied directly on duplication trees. We show that restricting the neighborhood defined by the SPR (Subtree Pruning and Regrafting) rearrangement to valid duplication trees, allows exploring the whole duplication tree space. We use these restricted rearrangements in a local search method which improves an initial tree via successive rearrangements. This method is applied to the optimization of parsimony and minimum evolution criteria. We show through simulations that this method improves all existing programs for both reconstructing the topology of the true tree and recovering its duplication events. We apply this approach to tandemly repeated human Zinc finger genes and observe that a much better duplication tree is obtained by our method than using any other program.  相似文献   

6.
Kück P  Mayer C  Wägele JW  Misof B 《PloS one》2012,7(5):e36593
The aim of our study was to test the robustness and efficiency of maximum likelihood with respect to different long branch effects on multiple-taxon trees. We simulated data of different alignment lengths under two different 11-taxon trees and a broad range of different branch length conditions. The data were analyzed with the true model parameters as well as with estimated and incorrect assumptions about among-site rate variation. If length differences between connected branches strongly increase, tree inference with the correct likelihood model assumptions can fail. We found that incorporating invariant sites together with Γ distributed site rates in the tree reconstruction (Γ+I) increases the robustness of maximum likelihood in comparison with models using only Γ. The results show that for some topologies and branch lengths the reconstruction success of maximum likelihood under the correct model is still low for alignments with a length of 100,000 base positions. Altogether, the high confidence that is put in maximum likelihood trees is not always justified under certain tree shapes even if alignment lengths reach 100,000 base positions.  相似文献   

7.
We introduce a mechanism for analytically deriving upper bounds on the maximum likelihood for genetic sequence data on sets of phylogenies. A simple 'partition' bound is introduced for general models. Tighter bounds are developed for the simplest model of evolution, the two state symmetric model of nucleotide substitution under the molecular clock. This follows earlier theoretical work which has been restricted to this model by analytic complexity. A weakness of current numerical computation is that reported 'maximum likelihood' results cannot be guaranteed, both for a specified tree (because of the possibility of multiple maxima) or over the full tree space (as the computation is intractable for large sets of trees). The bounds we develop here can be used to conclusively eliminate large proportions of tree space in the search for the maximum likelihood tree. This is vital in the development of a branch and bound search strategy for identifying the maximum likelihood tree. We report the results from a simulation study of approximately 10(6) data sets generated on clock-like trees of five leaves. In each trial a likelihood value of one specific instance of a parameterised tree is compared to the bound determined for each of the 105 possible rooted binary trees. The proportion of trees that are eliminated from the search for the maximum likelihood tree ranged from 92% to almost 98%, indicating a computational speed-up factor of between 12 and 44.  相似文献   

8.
A better understanding of disease progression is beneficial for early diagnosis and appropriate individual therapy. Many different approaches for statistical modelling of cumulative disease progression have been proposed in the literature, including simple path models up to complex restricted Bayesian networks. Important fields of application are diseases such as cancer and HIV. Tumour progression is measured by means of chromosome aberrations, whereas people infected with HIV develop drug resistances because of genetic changes of the HI‐virus. These two very different diseases have typical courses of disease progression, which can be modelled partly by consecutive and partly by independent steps. This paper gives an overview of the different progression models and points out their advantages and drawbacks. Different models are compared via simulations to analyse how they work if some of their assumptions are violated. In a simulation study, we evaluate how models perform in terms of fitting induced multivariate probability distributions and topological relationships. We often find that the true model class used for generating data is outperformed by either a less or a more complex model class. The more flexible conjunctive Bayesian networks can be used to fit oncogenetic trees, whereas mixtures of oncogenetic trees with three tree components can be well fitted by mixture models with only two tree components.  相似文献   

9.
5S rDNA clones from 12 South American diploid Hordeum species containing the HH genome and 3 Eurasian diploid Hordeum species containing the II genome, including the cultivated barley Hordeum vulgare, were sequenced and their sequence diversity was analyzed. The 374 sequenced clones were assigned to "unit classes", which were further assigned to haplomes. Each haplome contained 2 unit classes. The naming of the unit classes reflected the haplomes, viz. both the long H1 and short I1 unit classes were identified with II genome diploids, and both the long H2 and long Y2 unit classes were recognized in South American HH genome diploids. Based upon an alignment of all sequences or alignments of representative sequences, we tested several evolutionary models, and then subjected the parameters of the models to a series of maximum likelihood (ML) analyses and various tests, including the molecular clock, and to a Bayesian evolutionary inference analysis using Markov chain Monte Carlo (MCMC). The best fitting model of nucleotide substitution was the HKY+G (Hasegawa, Kishino, Yano 1985 model with the Gamma distribution rates of nucleotide substitutions). Results from both ML and MCMC imply that the long H1 and short I unit classes found in the II genome diploids diverged from each other at the same rate as the long H2 and long Y2 unit classes found in the HH genome diploids. The divergence among the unit classes, estimated to be circa 7 million years, suggests that the genus Hordeum may be a paleopolyploid.  相似文献   

10.
Higher-level relationships within, and the root of Placentalia, remain contentious issues. Resolution of the placental tree is important to the choice of mammalian genome projects and model organisms, as well as for understanding the biogeography of the eutherian radiation. We present phylogenetic analyses of 63 species representing all extant eutherian mammal orders for a new molecular phylogenetic marker, a 1.3kb portion of exon 26 of the apolipoprotein B (APOB) gene. In addition, we analyzed a multigene concatenation that included APOB sequences and a previously published data set (Murphy et al., 2001b) of three mitochondrial and 19 nuclear genes, resulting in an alignment of over 17kb for 42 placentals and two marsupials. Due to computational difficulties, previous maximum likelihood analyses of large, multigene concatenations for placental mammals have used quartet puzzling, less complex models of sequence evolution, or phylogenetic constraints to approximate a full maximum likelihood bootstrap. Here, we utilize a Unix load sharing facility to perform maximum likelihood bootstrap analyses for both the APOB and concatenated data sets with a GTR+Gamma+I model of sequence evolution, tree-bisection and reconnection branch-swapping, and no phylogenetic constraints. Maximum likelihood and Bayesian analyses of both data sets provide support for the superordinal clades Boreoeutheria, Euarchontoglires, Laurasiatheria, Xenarthra, Afrotheria, and Ostentoria (pangolins+carnivores), as well as for the monophyly of the orders Eulipotyphla, Primates, and Rodentia, all of which have recently been questioned. Both data sets recovered an association of Hippopotamidae and Cetacea within Cetartiodactyla, as well as hedgehog and shrew within Eulipotyphla. APOB showed strong support for an association of tarsier and Anthropoidea within Primates. Parsimony, maximum likelihood and Bayesian analyses with both data sets placed Afrotheria at the base of the placental radiation. Statistical tests that employed APOB to examine a priori hypotheses for the root of the placental tree rejected rooting on myomorphs and hedgehog, but did not discriminate between rooting at the base of Afrotheria, at the base of Xenarthra, or between Atlantogenata (Xenarthra+Afrotheria) and Boreoeutheria. An orthologous deletion of 363bp in the aligned APOB sequences proved phylogenetically informative for the grouping of the order Carnivora with the order Pholidota into the superordinal clade Ostentoria. A smaller deletion of 237-246bp was diagnostic of the superordinal clade Afrotheria.  相似文献   

11.
12.
Different types of random binary topological trees (like neuronal processes and rivers) occur with relative frequencies that can be explained in terms of growth models. It will be shown how the model parameter determining the mode of growth can be estimated with the maximum likelihood procedure from observed data. Monte Carlo simulations were used to study the distributional properties of this estimator which appeared to have a negligible bias. It is shown that the minimum chi-square procedure yields an estimate that is very close to the maximum likelihood estimate. Moreover, the goodness-of-fit of the growth model can be inferred directly from the chi-square statistic. To illustrate the procedures we examined axonal trees from the goldfish tectum. A notion of complete partition randomness is presented as an alternative to our growth hypotheses.  相似文献   

13.
Estimating the pattern of nucleotide substitution   总被引:43,自引:0,他引:43  
Knowledge of the pattern of nucleotide substitution is important both to our understanding of molecular sequence evolution and to reliable estimation of phylogenetic relationships. The method of parsimony analysis, which has been used to estimate substitution patterns in real sequences, has serious drawbacks and leads to results difficult to interpret. In this paper a model-based maximum likelihood approach is proposed for estimating substitution patterns in real sequences. Nucleotide substitution is assumed to follow a homogeneous Markov process, and the general reversible process model (REV) and the unrestricted model without the reversibility assumption are used. These models are also applied to examine the adequacy of the model of Hasegawa et al. (J. Mol. Evol. 1985;22:160–174) (HKY85). Two data sets are analyzed. For the -globin pseudogenes of six primate species, the REV model fits the data much better than HKY85, while, for a segment of mtDNA sequences from nine primates, REV cannot provide a significantly better fit than HKY85 when rate variation over sites is taken into account in the models. It is concluded that the use of the REV model in phylogenetic analysis can be recommended, especially for large data sets or for sequences with extreme substitution patterns, while HKY85 may be expected to provide a good approximation. The use of the unrestricted model does not appear to be worthwhile.  相似文献   

14.
A rapid heuristic algorithm for finding minimum evolution trees   总被引:2,自引:0,他引:2  
The minimum sum of branch lengths (S), or the minimum evolution (ME) principle, has been shown to be a good optimization criterion in phylogenetic inference. Unfortunately, the number of topologies to be analyzed is computationally prohibitive when a large number of taxa are involved. Therefore, simplified, heuristic methods, such as the neighbor-joining (NJ) method, are usually employed instead. The NJ method analyzes only a small number of trees (compared with the size of the entire search space); so, the tree obtained may not be the ME tree (for which the S value is minimum over the entire search space). Different compromises between very restrictive and exhaustive search spaces have been proposed recently. In particular, the "stepwise algorithm" (SA) utilizes what is known in computer science as the "beam search," whereas the NJ method employs a "greedy search." SA is virtually guaranteed to find the ME trees while being much faster than exhaustive search algorithms. In this study we propose an even faster method for finding the ME tree. The new algorithm adjusts its search exhaustiveness (from greedy to complete) according to the statistical reliability of the tree node being reconstructed. It is also virtually guaranteed to find the ME tree. The performances and computational efficiencies of ME, SA, NJ, and our new method were compared in extensive simulation studies. The new algorithm was found to perform practically as well as the SA (and, therefore, ME) methods and slightly better than the NJ method. For searching for the globally optimal ME tree, the new algorithm is significantly faster than existing ones, thus making it relatively practical for obtaining all trees with an S value equal to or smaller than that of the NJ tree, even when a large number of taxa is involved.  相似文献   

15.
BACKGROUND/AIMS: Complex traits pose a particular challenge to standard methods for segregation analysis (SA), and for such traits it is difficult to assess the ability of complex SA (CSA) to approximate the true mode of inheritance. Here we use an oligogenic Bayesian Markov chain Monte Carlo method for SA (OSA) to verify results from a single-locus likelihood-based CSA for data on a quantitative measure of reading ability. METHODS: We compared the profile likelihood from CSA, maximized over the trait allele frequency, to the posterior distribution of genotype effects from OSA to explore differences in the overall parameter estimates from SA on the original phenotype data and the same data Winsorized to reduce the potential influence of three outlying data points. RESULTS: Bayesian OSA revealed two modes of inheritance, one of which coincided with the QTL model from CSA. Winsorizing abolished the model originally estimated by CSA; both CSA and OSA identified only the second OSA model. CONCLUSION: Differences between the results from the two methods alerted us to the presence of influential data points, and identified the QTL model best supported by the data. Thus, the Bayesian OSA proved a valuable tool for assessing and verifying inheritance models from CSA.  相似文献   

16.
IQPNNI: moving fast through tree space and stopping in time   总被引:12,自引:0,他引:12  
An efficient tree reconstruction method (IQPNNI) is introduced to reconstruct a phylogenetic tree based on DNA or amino acid sequence data. Our approach combines various fast algorithms to generate a list of potential candidate trees. The key ingredient is the definition of so-called important quartets (IQs), which allow the computation of an intermediate tree in O(n(2)) time for n sequences. The resulting tree is then further optimized by applying the nearest neighbor interchange (NNI) operation. Subsequently a random fraction of the sequences is deleted from the best tree found so far. The deleted sequences are then re-inserted in the smaller tree using the important quartet puzzling (IQP) algorithm. These steps are repeated several times and the best tree, with respect to the likelihood criterion, is considered as the inferred phylogenetic tree. Moreover, we suggest a rule which indicates when to stop the search. Simulations show that IQPNNI gives a slightly better accuracy than other programs tested. Moreover, we applied the approach to 218 small subunit rRNA sequences and 500 rbcL sequences. We found trees with higher likelihood compared to the results by others. A program to reconstruct DNA or amino acid based phylogenetic trees is available online (http://www.bi.uni-duesseldorf.de/software/iqpnni).  相似文献   

17.
Maximum likelihood supertrees   总被引:2,自引:0,他引:2  
  相似文献   

18.
In popular use of Bayesian phylogenetics, a default branch-length prior is almost universally applied without knowing how a different prior would have affected the outcome. We performed Bayesian and maximum likelihood (ML) inference of phylogeny based on empirical nucleotide sequence data from a family of lichenized ascomycetes, the Psoraceae, the morphological delimitation of which has been controversial. We specifically assessed the influence of the combination of Bayesian branch-length prior and likelihood model on the properties of the Markov chain Monte Carlo tree sample, including node support, branch lengths, and taxon stability. Data included two regions of the mitochondrial ribosomal RNA gene, the internal transcribed spacer region of the nuclear ribosomal RNA gene, and the protein-coding largest subunit of RNA polymerase II. Data partitioning was performed using Bayes' factors, whereas the best-fitting model of each partition was selected using the Bayesian information criterion (BIC). Given the data and model, short Bayesian branch-length priors generate higher numbers of strongly supported nodes as well as short and topologically similar trees sampled from parts of tree space that are largely unexplored by the ML bootstrap. Long branch-length priors generate fewer strongly supported nodes and longer and more dissimilar trees that are sampled mostly from inside the range of tree space sampled by the ML bootstrap. Priors near the ML distribution of branch lengths generate the best marginal likelihood and the highest frequency of "rogue" (unstable) taxa. The branch-length prior was shown to interact with the likelihood model. Trees inferred under complex partitioned models are more affected by the stretching effect of the branch-length prior. Fewer nodes are strongly supported under a complex model given the same branch-length prior. Irrespective of model, internal branches make up a larger proportion of total tree length under the shortest branch-length priors compared with longer priors. Relative effects on branch lengths caused by the branch-length prior can be problematic to downstream phylogenetic comparative methods making use of the branch lengths. Furthermore, given the same branch-length prior, trees are on average more dissimilar under a simple unpartitioned model compared with a more complex partitioned models. The distribution of ML branch lengths was shown to better fit a gamma or Pareto distribution than an exponential one. Model adequacy tests indicate that the best-fitting model selected by the BIC is insufficient for describing data patterns in 5 of 8 partitions. More general substitution models are required to explain the data in three of these partitions, one of which also requires nonstationarity. The two mitochondrial ribosomal RNA gene partitions need heterotachous models. We found no significant correlations between, on the one hand, the amount of ambiguous data or the smallest branch-length distance to another taxon and, on the other hand, the topological stability of individual taxa. Integrating over several exponentially distributed means under the best-fitting model, node support for the family Psoraceae, including Psora, Protoblastenia, and the Micarea sylvicola group, is approximately 0.96. Support for the genus Psora is distinctly lower, but we found no evidence to contradict the current classification.  相似文献   

19.
The composite-likelihood estimator (CLE) of the population recombination rate considers only sites with exactly two alleles under a finite-sites mutation model (McVean, G. A. T., P. Awadalla, and P. Fearnhead. 2002. A coalescent-based method for detecting and estimating recombination from gene sequences. Genetics 160:1231-1241). While in such a model the identity of alleles is not considered, the CLE has been shown to be robust to minor misspecification of the underlying mutational model. However, there are many situations where the putative mutation and demographic history can be quite complex. One good example is rapidly evolving pathogens, like HIV-1. First we evaluated the performance of the CLE and the likelihood permutation test (LPT) under more complex, realistic models, including a general time reversible (GTR) substitution model, rate heterogeneity among sites (Gamma), positive selection, population growth, population structure, and noncontemporaneous sampling. Second, we relaxed some of the assumptions of the CLE allowing for a four-allele, GTR + Gamma model in an attempt to use the data more efficiently. Through simulations and the analysis of real data, we concluded that the CLE is robust to severe misspecifications of the substitution model, but underestimates the recombination rate in the presence of exponential growth, population mixture, selection, or noncontemporaneous sampling. In such cases, the use of more complex models slightly increases performance in some occasions, especially in the case of the LPT. Thus, our results provide for a more robust application of the estimation of recombination rates.  相似文献   

20.
Maximum likelihood (ML) for phylogenetic inference from sequence data remains a method of choice, but has computational limitations. In particular, it cannot be applied for a global search through all potential trees when the number of taxa is large, and hence a heuristic restriction in the search space is required. In this paper, we derive a quadratic approximation, QAML, to the likelihood function whose maximum is easily determined for a given tree. The derivation depends on Hadamard conjugation, and hence is limited to the simple symmetric models of Kimura and of Jukes and Cantor. Preliminary testing has demonstrated the accuracy of QAML is close to that of ML.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号