BEAST: Bayesian evolutionary analysis by sampling trees   总被引:2,自引:0,他引:2  


The evolutionary analysis of molecular sequence variation is a statistical enterprise. This is reflected in the increased use of probabilistic models for phylogenetic inference, multiple sequence alignment, and molecular population genetics. Here we present BEAST: a fast, flexible software architecture for Bayesian analysis of molecular sequences related by an evolutionary tree. A large number of popular stochastic models of sequence evolution are provided and tree-based models suitable for both within- and between-species sequence data are implemented.  相似文献   

The effect of missing data on phylogenetic methods is a potentially important issue in our attempts to reconstruct the Tree of Life. If missing data are truly problematic, then it may be unwise to include species in an analysis that lack data for some characters (incomplete taxa) or to include characters that lack data for some species. Given the difficulty of obtaining data from all characters for all taxa (e.g., fossils), missing data might seriously impede efforts to reconstruct a comprehensive phylogeny that includes all species. Fortunately, recent simulations and empirical analyses suggest that missing data cells are not themselves problematic, and that incomplete taxa can be accurately placed as long as the overall number of characters in the analysis is large. However, these studies have so far only been conducted on parsimony, likelihood, and neighbor-joining methods. Although Bayesian phylogenetic methods have become widely used in recent years, the effects of missing data on Bayesian analysis have not been adequately studied. Here, we conduct simulations to test whether Bayesian analyses can accurately place incomplete taxa despite extensive missing data. In agreement with previous studies of other methods, we find that Bayesian analyses can accurately reconstruct the position of highly incomplete taxa (i.e., 95% missing data), as long as the overall number of characters in the analysis is large. These results suggest that highly incomplete taxa can be safely included in many Bayesian phylogenetic analyses.  相似文献   

Bayesian inference is becoming a common statistical approach to phylogenetic estimation because, among other reasons, it allows for rapid analysis of large data sets with complex evolutionary models. Conveniently, Bayesian phylogenetic methods use currently available stochastic models of sequence evolution. However, as with other model-based approaches, the results of Bayesian inference are conditional on the assumed model of evolution: inadequate models (models that poorly fit the data) may result in erroneous inferences. In this article, I present a Bayesian phylogenetic method that evaluates the adequacy of evolutionary models using posterior predictive distributions. By evaluating a model's posterior predictive performance, an adequate model can be selected for a Bayesian phylogenetic study. Although I present a single test statistic that assesses the overall (global) performance of a phylogenetic model, a variety of test statistics can be tailored to evaluate specific features (local performance) of evolutionary models to identify sources failure. The method presented here, unlike the likelihood-ratio test and parametric bootstrap, accounts for uncertainty in the phylogeny and model parameters.  相似文献   

BEAST 2: A Software Platform for Bayesian Evolutionary Analysis   总被引:1,自引:0,他引:1  
We present a new open source, extensible and flexible software platform for Bayesian evolutionary analysis called BEAST 2. This software platform is a re-design of the popular BEAST 1 platform to correct structural deficiencies that became evident as the BEAST 1 software evolved. Key among those deficiencies was the lack of post-deployment extensibility. BEAST 2 now has a fully developed package management system that allows third party developers to write additional functionality that can be directly installed to the BEAST 2 analysis platform via a package manager without requiring a new software release of the platform. This package architecture is showcased with a number of recently published new models encompassing birth-death-sampling tree priors, phylodynamics and model averaging for substitution models and site partitioning. A second major improvement is the ability to read/write the entire state of the MCMC chain to/from disk allowing it to be easily shared between multiple instances of the BEAST software. This facilitates checkpointing and better support for multi-processor and high-end computing extensions. Finally, the functionality in new packages can be easily added to the user interface (BEAUti 2) by a simple XML template-based mechanism because BEAST 2 has been re-designed to provide greater integration between the analysis engine and the user interface so that, for example BEAST and BEAUti use exactly the same XML file format.
This is a PLOS Computational Biology Software Article.

Fair-balance paradox, star-tree paradox, and Bayesian phylogenetics   总被引:1,自引:0,他引:1  
The star-tree paradox refers to the conjecture that the posterior probabilities for the three unrooted trees for four species (or the three rooted trees for three species if the molecular clock is assumed) do not approach 1/3 when the data are generated using the star tree and when the amount of data approaches infinity. It reflects the more general phenomenon of high and presumably spurious posterior probabilities for trees or clades produced by the Bayesian method of phylogenetic reconstruction, and it is perceived to be a manifestation of the deeper problem of the extreme sensitivity of Bayesian model selection to the prior on parameters. Analysis of the star-tree paradox has been hampered by the intractability of the integrals involved. In this article, I use Laplacian expansion to approximate the posterior probabilities for the three rooted trees for three species using binary characters evolving at a constant rate. The approximation enables calculation of posterior tree probabilities for arbitrarily large data sets. Both theoretical analysis of the analogous fair-coin and fair-balance problems and computer simulation for the tree problem confirmed the existence of the star-tree paradox. When the data size n --> infinity, the posterior tree probabilities do not converge to 1/3 each, but they vary among data sets according to a statistical distribution. This distribution is characterized. Two strategies for resolving the star-tree paradox are explored: (1) a nonzero prior probability for the degenerate star tree and (2) an increasingly informative prior forcing the internal branch length toward zero. Both appear to be effective in resolving the paradox, but the latter is simpler to implement. The posterior tree probabilities are found to be very sensitive to the prior.  相似文献   

We provide a new automated statistical method for DNA barcoding based on a Bayesian phylogenetic analysis. The method is based on automated database sequence retrieval, alignment, and phylogenetic analysis using a custom-built program for Bayesian phylogenetic analysis. We show on real data that the method outperforms Blast searches as a measure of confidence and can help eliminate 80% of all false assignment based on best Blast hit. However, the most important advance of the method is that it provides statistically meaningful measures of confidence. We apply the method to a re-analysis of previously published ancient DNA data and show that, with high statistical confidence, most of the published sequences are in fact of Neanderthal origin. However, there are several cases of chimeric sequences that are comprised of a combination of both Neanderthal and modern human DNA.  相似文献   

Many empirical studies have revealed considerable differences between nonparametric bootstrapping and Bayesian posterior probabilities in terms of the support values for branches, despite claimed predictions about their approximate equivalence. We investigated this problem by simulating data, which were then analyzed by maximum likelihood bootstrapping and Bayesian phylogenetic analysis using identical models and reoptimization of parameter values. We show that Bayesian posterior probabilities are significantly higher than corresponding nonparametric bootstrap frequencies for true clades, but also that erroneous conclusions will be made more often. These errors are strongly accentuated when the models used for analyses are underparameterized. When data are analyzed under the correct model, nonparametric bootstrapping is conservative. Bayesian posterior probabilities are also conservative in this respect, but less so.  相似文献   

Bayesian phylogenetic methods require the selection of prior probability distributions for all parameters of the model of evolution. These distributions allow one to incorporate prior information into a Bayesian analysis, but even in the absence of meaningful prior information, a prior distribution must be chosen. In such situations, researchers typically seek to choose a prior that will have little effect on the posterior estimates produced by an analysis, allowing the data to dominate. Sometimes a prior that is uniform (assigning equal prior probability density to all points within some range) is chosen for this purpose. In reality, the appropriate prior depends on the parameterization chosen for the model of evolution, a choice that is largely arbitrary. There is an extensive Bayesian literature on appropriate prior choice, and it has long been appreciated that there are parameterizations for which uniform priors can have a strong influence on posterior estimates. We here discuss the relationship between model parameterization and prior specification, using the general time-reversible model of nucleotide evolution as an example. We present Bayesian analyses of 10 simulated data sets obtained using a variety of prior distributions and parameterizations of the general time-reversible model. Uniform priors can produce biased parameter estimates under realistic conditions, and a variety of alternative priors avoid this bias.  相似文献   

We have applied Bayesian and maximum likelihood methods of phylogenetic estimation to data from four mitochondrial genes (COI, COII, 12S, and 16S) and a single nuclear gene (EF1alpha) from several genera of New Zealand, Australian, and New Caledonian cicada taxa. We specifically focused on the heterogeneity of phylogenetic signal among the different data partitions and the biogeographic origins of the New Zealand cicada fauna. The Bayesian analyses circumvent many of the problems associated with other statistical tests for comparing data partitions. We took an information-theoretic approach to model selection based on the Akaike Information Criterion (AIC). This approach indicated that there was considerable uncertainty in identifying the best-fit model for some of the partitions. Additionally, a large amount of uncertainty was associated with many parameter estimates from the substitution model. However, a sensitivity analysis on the combined dataset indicated that the model selection uncertainty had little effect on estimates of topology because these estimates were largely insensitive to changes in the assumed model. This outcome suggests strong signal in our data. Our analyses support a New Caledonian affiliation of the New Zealand cicada genera Maoricicada, Kikihia, and Rhodopsalta and Australian affinities for the genera Amphipsalta and Notopsalta. This result was surprising, given that previous cicada biologists suspected a close relationship between Amphipsalta, Notopsalta, and Rhodopsalta based on genitalic characters. Relationships among the closely related genera Maoricicada, Kikihia, and Rhodopsalta were poorly resolved, the mitochondrial data and the EF1alpha data favoring different arrangements within this clade.  相似文献   

As larger, more complex data sets are being used to infer phylogenies, accuracy of these phylogenies increasingly requires models of evolution that accommodate heterogeneity in the processes of molecular evolution. We investigated the effect of improper data partitioning on phylogenetic accuracy, as well as the type I error rate and sensitivity of Bayes factors, a commonly used method for choosing among different partitioning strategies in Bayesian analyses. We also used Bayes factors to test empirical data for the need to divide data in a manner that has no expected biological meaning. Posterior probability estimates are misleading when an incorrect partitioning strategy is assumed. The error was greatest when the assumed model was underpartitioned. These results suggest that model partitioning is important for large data sets. Bayes factors performed well, giving a 5% type I error rate, which is remarkably consistent with standard frequentist hypothesis tests. The sensitivity of Bayes factors was found to be quite high when the across-class model heterogeneity reflected that of empirical data. These results suggest that Bayes factors represent a robust method of choosing among partitioning strategies. Lastly, results of tests for the inclusion of unexpected divisions in empirical data mirrored the simulation results, although the outcome of such tests is highly dependent on accounting for rate variation among classes. We conclude by discussing other approaches for partitioning data, as well as other applications of Bayes factors.  相似文献   

The main limiting factor in Bayesian MCMC analysis of phylogeny is typically the efficiency with which topology proposals sample tree space. Here we evaluate the performance of seven different proposal mechanisms, including most of those used in current Bayesian phylogenetics software. We sampled 12 empirical nucleotide data sets--ranging in size from 27 to 71 taxa and from 378 to 2,520 sites--under difficult conditions: short runs, no Metropolis-coupling, and an oversimplified substitution model producing difficult tree spaces (Jukes Cantor with equal site rates). Convergence was assessed by comparison to reference samples obtained from multiple Metropolis-coupled runs. We find that proposals producing topology changes as a side effect of branch length changes (LOCAL and Continuous Change) consistently perform worse than those involving stochastic branch rearrangements (nearest neighbor interchange, subtree pruning and regrafting, tree bisection and reconnection, or subtree swapping). Among the latter, moves that use an extension mechanism to mix local with more distant rearrangements show better overall performance than those involving only local or only random rearrangements. Moves with only local rearrangements tend to mix well but have long burn-in periods, whereas moves with random rearrangements often show the reverse pattern. Combinations of moves tend to perform better than single moves. The time to convergence can be shortened considerably by starting with a good tree, but this comes at the cost of compromising convergence diagnostics based on overdispersed starting points. Our results have important implications for developers of Bayesian MCMC implementations and for the large group of users of Bayesian phylogenetics software.  相似文献   

We describe a procedure for model averaging of relaxed molecular clock models in Bayesian phylogenetics. Our approach allows us to model the distribution of rates of substitution across branches, averaged over a set of models, rather than conditioned on a single model. We implement this procedure and test it on simulated data to show that our method can accurately recover the true underlying distribution of rates. We applied the method to a set of alignments taken from a data set of 12 mammalian species and uncovered evidence that lognormally distributed rates better describe this data set than do exponentially distributed rates. Additionally, our implementation of model averaging permits accurate calculation of the Bayes factor(s) between two or more relaxed molecular clock models. Finally, we introduce a new computational approach for sampling rates of substitution across branches that improves the convergence of our Markov chain Monte Carlo algorithms in this context. Our methods are implemented under the BEAST 1.6 software package, available at http://beast-mcmc.googlecode.com.  相似文献   

We study the phylogeny of the placental mammals using molecular data from all mitochondrial tRNAs and rRNAs of 54 species. We use probabilistic substitution models specific to evolution in base paired regions of RNA. A number of these models have been implemented in a new phylogenetic inference software package for carrying out maximum likelihood and Bayesian phylogenetic inferences. We describe our Bayesian phylogenetic method which uses a Markov chain Monte Carlo algorithm to provide samples from the posterior distribution of tree topologies. Our results show support for four primary mammalian clades, in agreement with recent studies of much larger data sets mainly comprising nuclear DNA. We discuss some issues arising when using Bayesian techniques on RNA sequence data.  相似文献   

Relaxed phylogenetics and dating with confidence   总被引:3,自引:1,他引:2       下载免费PDF全文
In phylogenetics, the unrooted model of phylogeny and the strict molecular clock model are two extremes of a continuum. Despite their dominance in phylogenetic inference, it is evident that both are biologically unrealistic and that the real evolutionary process lies between these two extremes. Fortunately, intermediate models employing relaxed molecular clocks have been described. These models open the gate to a new field of “relaxed phylogenetics.” Here we introduce a new approach to performing relaxed phylogenetic analysis. We describe how it can be used to estimate phylogenies and divergence times in the face of uncertainty in evolutionary rates and calibration times. Our approach also provides a means for measuring the clocklikeness of datasets and comparing this measure between different genes and phylogenies. We find no significant rate autocorrelation among branches in three large datasets, suggesting that autocorrelated models are not necessarily suitable for these data. In addition, we place these datasets on the continuum of clocklikeness between a strict molecular clock and the alternative unrooted extreme. Finally, we present analyses of 102 bacterial, 106 yeast, 61 plant, 99 metazoan, and 500 primate alignments. From these we conclude that our method is phylogenetically more accurate and precise than the traditional unrooted model while adding the ability to infer a timescale to evolution.  相似文献   

In this study a multilocus phylogenetic analysis of metalmark moths (Lepidoptera: Choreutidae) focused on resolving the higher‐level phylogeny of this group is presented. Through the analysis of this dataset, I explore different data‐partitioning strategies in Bayesian phylogenetic inference, and find that a partitioning strategy can have a large influence on the results of phylogenetic analysis. Depending on how the data are partitioned, there can be significant differences in branch support. I also test for the existence of the Bayesian star tree paradox, and its importance in this dataset, and find that it appears to inflate support for the clade including Rhobonda gaurisana, Hemerophila houttuinialis, H. diva and H. felis, but plays no role in other cases where the differences between maximum‐likelihood bootstraps and Bayesian posterior probabilities are large. The results of all the phylogenetic analyses strongly suggest that including Millieriinae in Choreutidae renders the family polyphyletic. The monophyly of the other two subfamilies, Brenthiinae and Choreutinae, as well as their sister‐group relationship, is strongly supported. Similarly, the monophyly of all the genera examined except Hemerophila is also well supported. To bring the classification of Choreutidae in line with our current understanding of the phylogenetic relationships in the family, I propose to exclude Millieriinae from Choreutidae, elevate it to Millieriidae Heppner, and place it as incertae sedis within Ditrysia.  相似文献   

The generally accepted hypothesis regarding the origin of fossorial mammals proposes adaptive convergence from open environments towards the use of subterranean environments. We evaluated this hypothesis for South American mole-mice using conventional and Bayesian frameworks, with independent evidence. By using a molecular approach based on Cytochrome b and IRBP sequences, we evaluated phylogenetic relationships, time of origin, the ancestral trait of fossoriality, and ancestral distributions of species belonging to the Andean Clade (Rodentia: Sigmodontinae). Our results indicate that the Andean Clade is highly sustained; with one clade grouping all fossorial forms and another grouping all cursorial species. We hypothesized that fossoriality originated in the Miocene/Pliocene transition, in the Temperate Forests of southern South America. We conclude that the origin of fossorial ecomorphological traits did not necessarily occur under a general model of open environments, the origin of these traits depends on the ecological-historical relationship of the taxon with the environment.  相似文献   

Concepts of species proposed within the phylogenetic paradigm arecritically reviewed. Most so called phylogenetic species concepts relyheavily on factors immaterial to phylogenetic hypotheses. Thus, theyhave limited empirical content and offer weak bases on which to makedecisions about real problems related to species. Any workable notion ofspecies relies on an explicit character analysis, rather than onabstract properties of lineages, narrative predications and speculationson tokogenetic relationships. Species only exist conjecturally, as thesmallest meaningful units for phylogenetic analysis, as based oncharacter evidence. Such an idea considers species to be conjecturesbased on similarity, that are subsequently subject to testing by theresults of analysis. Species, thus, are units of phylogenetic analysisin the same way as hypotheses of homology are units of comparablesimilarities, i.e. conjectures to be tested by congruence. Althoughmonophyly need not be demonstrated for species-level taxa, hypotheses ofrelationships are the only basis to refute species limits and guidenecessary rearrangements. The factor that leads to recognition ofspecies is similarity in observed traits. The concept of life cycle isintroduced as an important element in the discussion of species, as anefficient way to convey subsidiary notions of sexual dimorphism,polymorphism, polytypy and clusters of diagnosable semaphoronts. Thenotion of exemplars is used to expand the concept ofspecies-as-individual-organisms into a more generally usable concept.Species are therefore proposed for a diagnosable sample of(observed or inferred) life cycles represented by exemplars all of whichare hypothesized to attach to the same node in a cladogram, and whichare not structured into other similarly diagnosable clusters. Thisdefinition is character-based, potentially testable by reference to abranching diagram, and dispenses with reference to ancestor-descendantrelationships or regression into population concepts. It provides aworkable basis on which to proceed with phylogenetic analysis and abasis for that analysis to refute or refine species limits. A protocolis offered for testing hypotheses of species boundaries in cladograms.  相似文献   

