首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Yu Y  Degnan JH  Nakhleh L 《PLoS genetics》2012,8(4):e1002660
Gene tree topologies have proven a powerful data source for various tasks, including species tree inference and species delimitation. Consequently, methods for computing probabilities of gene trees within species trees have been developed and widely used in probabilistic inference frameworks. All these methods assume an underlying multispecies coalescent model. However, when reticulate evolutionary events such as hybridization occur, these methods are inadequate, as they do not account for such events. Methods that account for both hybridization and deep coalescence in computing the probability of a gene tree topology currently exist for very limited cases. However, no such methods exist for general cases, owing primarily to the fact that it is currently unknown how to compute the probability of a gene tree topology within the branches of a phylogenetic network. Here we present a novel method for computing the probability of gene tree topologies on phylogenetic networks and demonstrate its application to the inference of hybridization in the presence of incomplete lineage sorting. We reanalyze a Saccharomyces species data set for which multiple analyses had converged on a species tree candidate. Using our method, though, we show that an evolutionary hypothesis involving hybridization in this group has better support than one of strict divergence. A similar reanalysis on a group of three Drosophila species shows that the data is consistent with hybridization. Further, using extensive simulation studies, we demonstrate the power of gene tree topologies at obtaining accurate estimates of branch lengths and hybridization probabilities of a given phylogenetic network. Finally, we discuss identifiability issues with detecting hybridization, particularly in cases that involve extinction or incomplete sampling of taxa.  相似文献   

2.
Numerous simulation studies have investigated the accuracy of phylogenetic inference of gene trees under maximum parsimony, maximum likelihood, and Bayesian techniques. The relative accuracy of species tree inference methods under simulation has received less study. The number of analytical techniques available for inferring species trees is increasing rapidly, and in this paper, we compare the performance of several species tree inference techniques at estimating recent species divergences using computer simulation. Simulating gene trees within species trees of different shapes and with varying tree lengths (T) and population sizes (), and evolving sequences on those gene trees, allows us to determine how phylogenetic accuracy changes in relation to different levels of deep coalescence and phylogenetic signal. When the probability of discordance between the gene trees and the species tree is high (i.e., T is small and/or is large), Bayesian species tree inference using the multispecies coalescent (BEST) outperforms other methods. The performance of all methods improves as the total length of the species tree is increased, which reflects the combined benefits of decreasing the probability of discordance between species trees and gene trees and gaining more accurate estimates for gene trees. Decreasing the probability of deep coalescences by reducing also leads to accuracy gains for most methods. Increasing the number of loci from 10 to 100 improves accuracy under difficult demographic scenarios (i.e., coalescent units ≤ 4N(e)), but 10 loci are adequate for estimating the correct species tree in cases where deep coalescence is limited or absent. In general, the correlation between the phylogenetic accuracy and the posterior probability values obtained from BEST is high, although posterior probabilities are overestimated when the prior distribution for is misspecified.  相似文献   

3.
Gene tree distributions under the coalescent process   总被引:10,自引:0,他引:10  
Under the coalescent model for population divergence, lineage sorting can cause considerable variability in gene trees generated from any given species tree. In this paper, we derive a method for computing the distribution of gene tree topologies given a bifurcating species tree for trees with an arbitrary number of taxa in the case that there is one gene sampled per species. Applications for gene tree distributions include determining exact probabilities of topological equivalence between gene trees and species trees and inferring species trees from multiple datasets. In addition, we examine the shapes of gene tree distributions and their sensitivity to changes in branch lengths, species tree shape, and tree size. The method for computing gene tree distributions is implemented in the computer program COAL.  相似文献   

4.
Lineage sorting and introgression can lead to incongruence among gene phylogenies, complicating the inference of species trees for large groups of taxa that have recently and rapidly radiated. In addition, it can be difficult to determine which of these processes is responsible for this incongruence. We explore these issues with the radiation of New Zealand alpine cicadas of the genus Maoricicada Dugdale. Gene trees were estimated from four putative independent loci: mitochondrial DNA (2274 nucleotides), elongation factor 1-alpha (1275 nucleotides), period (1709 nucleotides), and calmodulin (678 nucleotides). We reconstructed phylogenies using maximum likelihood and Bayesian methods from 44 individuals representing the 19 species and subspecies of Maoricicada and two outgroups. Species-level relationships were reconstructed using a novel extension of gene tree parsimony, whereby gene trees were weighted by their Bayesian posterior probabilities. The inferred gene trees show marked incongruence in the placement of some taxa, especially the enigmatic forest and scrub dwelling species, M. iolanthe. Using the species tree estimated by gene tree parsimony, we simulated coalescent gene trees in order to test the null hypothesis that the nonrandom placement of M. iolanthe among gene trees has arisen by chance. Under the assumptions of constant population size, known generation time, and panmixia, we were able to reject this null hypothesis. Furthermore, because the two alternative placements of M. iolanthe are in each case with species that share a similar song structure, we conclude that it is more likely that an ancient introgression event rather than lineage sorting has caused this incongruence.  相似文献   

5.
Gene trees are evolutionary trees representing the ancestry of genes sampled from multiple populations. Species trees represent populations of individuals—each with many genes—splitting into new populations or species. The coalescent process, which models ancestry of gene copies within populations, is often used to model the probability distribution of gene trees given a fixed species tree. This multispecies coalescent model provides a framework for phylogeneticists to infer species trees from gene trees using maximum likelihood or Bayesian approaches. Because the coalescent models a branching process over time, all trees are typically assumed to be rooted in this setting. Often, however, gene trees inferred by traditional phylogenetic methods are unrooted. We investigate probabilities of unrooted gene trees under the multispecies coalescent model. We show that when there are four species with one gene sampled per species, the distribution of unrooted gene tree topologies identifies the unrooted species tree topology and some, but not all, information in the species tree edges (branch lengths). The location of the root on the species tree is not identifiable in this situation. However, for 5 or more species with one gene sampled per species, we show that the distribution of unrooted gene tree topologies identifies the rooted species tree topology and all its internal branch lengths. The length of any pendant branch leading to a leaf of the species tree is also identifiable for any species from which more than one gene is sampled.  相似文献   

6.
We explore model-based techniques of phylogenetic tree inference exercising Markov invariants. Markov invariants are group invariant polynomials and are distinct from what is known in the literature as phylogenetic invariants, although we establish a commonality in some special cases. We show that the simplest Markov invariant forms the foundation of the Log–Det distance measure. We take as our primary tool group representation theory, and show that it provides a general framework for analyzing Markov processes on trees. From this algebraic perspective, the inherent symmetries of these processes become apparent, and focusing on plethysms, we are able to define Markov invariants and give existence proofs. We give an explicit technique for constructing the invariants, valid for any number of character states and taxa. For phylogenetic trees with three and four leaves, we demonstrate that the corresponding Markov invariants can be fruitfully exploited in applied phylogenetic studies.  相似文献   

7.
The concordance of gene trees and species trees is reconsidered in detail, allowing for samples of arbitrary size to be taken from the species. A sense of concordance for gene tree and species tree topologies is clarified, such that if the "collapsed gene tree" produced by a gene tree has the same topology as the species tree, the gene tree is said to be topologically concordant with the species tree. The term speciodendric is introduced to refer to genes whose trees are topologically concordant with species trees. For a given three-species topology, probabilities of each of the three possible collapsed gene tree topologies are given, as are probabilities of monophyletic concordance and concordance in the sense of N. Takahata (1989), Genetics 122, 957-966. Increasing the sample size is found to increase the probability of topological concordance, but a limit exists on how much the topological concordance probability can be increased. Suggested sample sizes beyond which this probability can be increased only minimally are given. The results are discussed in terms of implications for molecular studies of phylogenetics and speciation.  相似文献   

8.
An analytical method is presented for constructing linear invariants. All linear invariants of a k-species tree can be derived from those of (k-1)-species trees using this method. The new method is simpler than that of Cavender, which relies on numerical computations. Moreover, the new method provides a convenient tool to study the relationships between linear invariants of the same tree or of different trees. All linear invariants of trees of up to five species are derived in this study. For four species, there are 16 independent linear invariants for each of the three possible unrooted trees, 14 of which are shared by two unrooted trees and 12 of these are shared by all three unrooted trees; the last types of linear invariants can be used to construct tests on the assumptions about nucleotide substitutions. The number of linear invariants for a tree is found to increase rapidly with the number of species.  相似文献   

9.
Because of the stochastic way in which lineages sort during speciation, gene trees may differ in topology from each other and from species trees. Surprisingly, assuming that genetic lineages follow a coalescent model of within-species evolution, we find that for any species tree topology with five or more species, there exist branch lengths for which gene tree discordance is so common that the most likely gene tree topology to evolve along the branches of a species tree differs from the species phylogeny. This counterintuitive result implies that in combining data on multiple loci, the straightforward procedure of using the most frequently observed gene tree topology as an estimate of the species tree topology can be asymptotically guaranteed to produce an incorrect estimate. We conclude with suggestions that can aid in overcoming this new obstacle to accurate genomic inference of species phylogenies.  相似文献   

10.
Incomplete lineage sorting can cause incongruence between the phylogenetic history of genes (the gene tree) and that of the species (the species tree), which can complicate the inference of phylogenies. In this article, I present a new coalescent-based algorithm for species tree inference with maximum likelihood. I first describe an improved method for computing the probability of a gene tree topology given a species tree, which is much faster than an existing algorithm by Degnan and Salter (2005). Based on this method, I develop a practical algorithm that takes a set of gene tree topologies and infers species trees with maximum likelihood. This algorithm searches for the best species tree by starting from initial species trees and performing heuristic search to obtain better trees with higher likelihood. This algorithm, called STELLS (which stands for Species Tree InfErence with Likelihood for Lineage Sorting), has been implemented in a program that is downloadable from the author's web page. The simulation results show that the STELLS algorithm is more accurate than an existing maximum likelihood method for many datasets, especially when there is noise in gene trees. I also show that the STELLS algorithm is efficient and can be applied to real biological datasets.  相似文献   

11.
ABSTRACT: BACKGROUND: The ancestries of genes form gene trees which do not necessarily have the same topology as the species tree due to incomplete lineage sorting. Available algorithms determining the probability of a gene tree given a species tree require exponential computational runtime. RESULTS: In this paper, we provide a polynomial time algorithm to calculate the probability of a ranked gene tree topology for a given species tree, where a ranked tree topology is a tree topology with the internal vertices being ordered. The probability of a gene tree topology can thus be calculated in polynomial time if the number of orderings of the internal vertices is a polynomial number. However, the complexity of calculating the probability of a gene tree topology with an exponential number of rankings for a given species tree remains unknown. CONCLUSIONS: Polynomial algorithms for calculating ranked gene tree probabilities may become useful in developing methodology to infer species trees based on a collection of gene trees, leading to a more accurate reconstruction of ancestral species relationships.  相似文献   

12.
Species tree inference from gene family trees is becoming increasingly popular because it can account for discordance between the species tree and the corresponding gene family trees. In particular, methods that can account for multiple-copy gene families exhibit potential to leverage paralogy as informative signal. At present, there does not exist any widely adopted inference method for this purpose. Here, we present SpeciesRax, the first maximum likelihood method that can infer a rooted species tree from a set of gene family trees and can account for gene duplication, loss, and transfer events. By explicitly modeling events by which gene trees can depart from the species tree, SpeciesRax leverages the phylogenetic rooting signal in gene trees. SpeciesRax infers species tree branch lengths in units of expected substitutions per site and branch support values via paralogy-aware quartets extracted from the gene family trees. Using both empirical and simulated data sets we show that SpeciesRax is at least as accurate as the best competing methods while being one order of magnitude faster on large data sets at the same time. We used SpeciesRax to infer a biologically plausible rooted phylogeny of the vertebrates comprising 188 species from 31,612 gene families in 1 h using 40 cores. SpeciesRax is available under GNU GPL at https://github.com/BenoitMorel/GeneRax and on BioConda.  相似文献   

13.
When gene copies are sampled from various species, the resulting gene tree might disagree with the containing species tree. The primary causes of gene tree and species tree discord include incomplete lineage sorting, horizontal gene transfer, and gene duplication and loss. Each of these events yields a different parsimony criterion for inferring the (containing) species tree from gene trees. With incomplete lineage sorting, species tree inference is to find the tree minimizing extra gene lineages that had to coexist along species lineages; with gene duplication, it becomes to find the tree minimizing gene duplications and/or losses. In this paper, we present the following results: 1) The deep coalescence cost is equal to the number of gene losses minus two times the gene duplication cost in the reconciliation of a uniquely leaf labeled gene tree and a species tree. The deep coalescence cost can be computed in linear time for any arbitrary gene tree and species tree. 2) The deep coalescence cost is always not less than the gene duplication cost in the reconciliation of an arbitrary gene tree and a species tree. 3) Species tree inference by minimizing deep coalescence events is NP-hard.  相似文献   

14.
As an alternative to parsimony analyses, stochastic models have been proposed ( [Lewis, 2001] and [Nylander et al., 2004]) for morphological characters, so that maximum likelihood or Bayesian analyses may be used for phylogenetic inference. A key feature of these models is that they account for ascertainment bias, in that only varying, or parsimony-informative characters are observed. However, statistical consistency of such model-based inference requires that the model parameters be identifiable from the joint distribution they entail, and this issue has not been addressed.Here we prove that parameters for several such models, with finite state spaces of arbitrary size, are identifiable, provided the tree has at least eight leaves. If the tree topology is already known, then seven leaves suffice for identifiability of the numerical parameters. The method of proof involves first inferring a full distribution of both parsimony-informative and non-informative pattern joint probabilities from the parsimony-informative ones, using phylogenetic invariants. The failure of identifiability of the tree parameter for four-taxon trees is also investigated.  相似文献   

15.
Phylogenetic inference from genome-wide data (phylogenomics) has revolutionized the study of evolution because it enables accounting for discordance among evolutionary histories across the genome. To this end, summary methods have been developed to allow accurate and scalable inference of species trees from gene trees. However, most of these methods, including the widely used ASTRAL, can only handle single-copy gene trees and do not attempt to model gene duplication and gene loss. As a result, most phylogenomic studies have focused on single-copy genes and have discarded large parts of the data. Here, we first propose a measure of quartet similarity between single-copy and multicopy trees that accounts for orthology and paralogy. We then introduce a method called ASTRAL-Pro (ASTRAL for PaRalogs and Orthologs) to find the species tree that optimizes our quartet similarity measure using dynamic programing. By studying its performance on an extensive collection of simulated data sets and on real data sets, we show that ASTRAL-Pro is more accurate than alternative methods.  相似文献   

16.
We propose a model based approach to use multiple gene trees to estimate the species tree. The coalescent process requires that gene divergences occur earlier than species divergences when there is any polymorphism in the ancestral species. Under this scenario, speciation times are restricted to be smaller than the corresponding gene split times. The maximum tree (MT) is the tree with the largest possible speciation times in the space of species trees restricted by available gene trees. If all populations have the same population size, the MT is the maximum likelihood estimate of the species tree. It can be shown the MT is a consistent estimator of the species tree even when the MT is built upon the estimates of the true gene trees if the gene tree estimates are statistically consistent. The MT converges in probability to the true species tree at an exponential rate.  相似文献   

17.
Under a coalescent model for within-species evolution, gene trees may differ from species trees to such an extent that the gene tree topology most likely to evolve along the branches of a species tree can disagree with the species tree topology. Gene tree topologies that are more likely to be produced than the topology that matches that of the species tree are termed anomalous, and the region of branch-length space that gives rise to anomalous gene trees (AGTs) is the anomaly zone. We examine the occurrence of anomalous gene trees for the case of five taxa, the smallest number of taxa for which every species tree topology has a nonempty anomaly zone. Considering all sets of branch lengths that give rise to anomalous gene trees, the largest value possible for the smallest branch length in the species tree is greater in the five-taxon case (0.1934 coalescent time units) than in the previously studied case of four taxa (0.1568). The five-taxon case demonstrates the existence of three phenomena that do not occur in the four-taxon case. First, anomalous gene trees can have the same unlabeled topology as the species tree. Second, the anomaly zone does not necessarily enclose a ball centered at the origin in branch-length space, in which all branches are short. Third, as a branch length increases, it is possible for the number of AGTs to increase rather than decrease or remain constant. These results, which help to describe how the properties of anomalous gene trees increase in complexity as the number of taxa increases, will be useful in formulating strategies for evading the problem of anomalous gene trees during species tree inference from multilocus data.  相似文献   

18.
Multilocus genomic data sets can be used to infer a rich set of information about the evolutionary history of a lineage, including gene trees, species trees, and phylogenetic networks. However, user‐friendly tools to run such integrated analyses are lacking, and workflows often require tedious reformatting and handling time to shepherd data through a series of individual programs. Here, we present a tool written in Python—TREEasy—that performs automated sequence alignment (with MAFFT), gene tree inference (with IQ‐Tree), species inference from concatenated data (with IQ‐Tree and RaxML‐NG), species tree inference from gene trees (with ASTRAL, MP‐EST, and STELLS2), and phylogenetic network inference (with SNaQ and PhyloNet). The tool only requires FASTA files and nine parameters as inputs. The tool can be run as command line or through a Graphical User Interface (GUI). As examples, we reproduced a recent analysis of staghorn coral evolution, and performed a new analysis on the evolution of the “WGD clade” of yeast. The latter revealed novel patterns that were not identified by previous analyses. TREEasy represents a reliable and simple tool to accelerate research in systematic biology ( https://github.com/MaoYafei/TREEasy ).  相似文献   

19.
The relationship between speciation times and the corresponding times of gene divergence is of interest in phylogenetic inference as a means of understanding the past evolutionary dynamics of populations and of estimating the timing of speciation events. It has long been recognized that gene divergence times might substantially pre-date speciation events. Although the distribution of the difference between these has previously been studied for the case of two populations, this distribution has not been explicitly computed for larger species phylogenies. Here we derive a simple method for computing this distribution for trees of arbitrary size. A two-stage procedure is proposed which (i) considers the probability distribution of the time from the speciation event at the root of the species tree to the gene coalescent time conditionally on the number of gene lineages available at the root; and (ii) calculates the probability mass function for the number of gene lineages at the root. This two-stage approach dramatically simplifies numerical analysis, because in the first step the conditional distribution does not depend on an underlying species tree, while in the second step the pattern of gene coalescence prior to the species tree root is irrelevant. In addition, the algorithm provides intuition concerning the properties of the distribution with respect to the various features of the underlying species tree. The methodology is complemented by developing probabilistic formulae and software, written in R. The method and software are tested on five-taxon species trees with varying levels of symmetry. The examples demonstrate that more symmetric species trees tend to have larger mean coalescent times and are more likely to have a unimodal gamma-like distribution with a long right tail, while asymmetric trees tend to have smaller mean coalescent times with an exponential-like distribution. In addition, species trees with longer branches generally have shorter mean coalescent times, with branches closest to the root of the tree being most influential.  相似文献   

20.
Phylogenomics has largely succeeded in its aim of accurately inferring species trees, even when there are high levels of discordance among individual gene trees. These resolved species trees can be used to ask many questions about trait evolution, including the direction of change and number of times traits have evolved. However, the mapping of traits onto trees generally uses only a single representation of the species tree, ignoring variation in the gene trees used to construct it. Recognizing that genes underlie traits, these results imply that many traits follow topologies that are discordant with the species topology. As a consequence, standard methods for character mapping will incorrectly infer the number of times a trait has evolved. This phenomenon, dubbed “hemiplasy,” poses many problems in analyses of character evolution. Here we outline these problems, explaining where and when they are likely to occur. We offer several ways in which the possible presence of hemiplasy can be diagnosed, and discuss multiple approaches to dealing with the problems presented by underlying gene tree discordance when carrying out character mapping. Finally, we discuss the implications of hemiplasy for general phylogenetic inference, including the possible drawbacks of the widespread push for “resolved” species trees.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号