首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
MOTIVATION: Inferring species phylogenies with a history of gene losses and duplications is a challenging and an important task in computational biology. This problem can be solved by duplication-loss models in which the primary step is to reconcile a rooted gene tree with a rooted species tree. Most modern methods of phylogenetic reconstruction (from sequences) produce unrooted gene trees. This limitation leads to the problem of transforming unrooted gene tree into a rooted tree, and then reconciling rooted trees. The main questions are 'What about biological interpretation of choosing rooting?', 'Can we find efficiently the optimal rootings?', 'Is the optimal rooting unique?'. RESULTS: In this paper we present a model of reconciling unrooted gene tree with a rooted species tree, which is based on a concept of choosing rooting which has minimal reconciliation cost. Our analysis leads to the surprising property that all the minimal rootings have identical distributions of gene duplications and gene losses in the species tree. It implies, in our opinion, that the concept of an optimal rooting is very robust, and thus biologically meaningful. Also, it has nice computational properties. We present a linear time and space algorithm for computing optimal rooting(s). This algorithm was used in two different ways to reconstruct the optimal species phylogeny of five known yeast genomes from approximately 4700 gene trees. Moreover, we determined locations (history) of all gene duplications and gene losses in the final species tree. It is interesting to notice that the top five species trees are the same for both methods. AVAILABILITY: Software and documentation are freely available from http://bioputer.mimuw.edu.pl/~gorecki/urec  相似文献   

2.
Gene trees are evolutionary trees representing the ancestry of genes sampled from multiple populations. Species trees represent populations of individuals—each with many genes—splitting into new populations or species. The coalescent process, which models ancestry of gene copies within populations, is often used to model the probability distribution of gene trees given a fixed species tree. This multispecies coalescent model provides a framework for phylogeneticists to infer species trees from gene trees using maximum likelihood or Bayesian approaches. Because the coalescent models a branching process over time, all trees are typically assumed to be rooted in this setting. Often, however, gene trees inferred by traditional phylogenetic methods are unrooted. We investigate probabilities of unrooted gene trees under the multispecies coalescent model. We show that when there are four species with one gene sampled per species, the distribution of unrooted gene tree topologies identifies the unrooted species tree topology and some, but not all, information in the species tree edges (branch lengths). The location of the root on the species tree is not identifiable in this situation. However, for 5 or more species with one gene sampled per species, we show that the distribution of unrooted gene tree topologies identifies the rooted species tree topology and all its internal branch lengths. The length of any pendant branch leading to a leaf of the species tree is also identifiable for any species from which more than one gene is sampled.  相似文献   

3.
One of the criteria for inferring a species tree from a collection of gene trees, when gene tree incongruence is assumed to be due to incomplete lineage sorting (ILS), is Minimize Deep Coalescence (MDC). Exact algorithms for inferring the species tree from rooted, binary trees under MDC were recently introduced. Nevertheless, in phylogenetic analyses of biological data sets, estimated gene trees may differ from true gene trees, be incompletely resolved, and not necessarily rooted. In this article, we propose new MDC formulations for the cases where the gene trees are unrooted/binary, rooted/non-binary, and unrooted/non-binary. Further, we prove structural theorems that allow us to extend the algorithms for the rooted/binary gene tree case to these cases in a straightforward manner. In addition, we devise MDC-based algorithms for cases when multiple alleles per species may be sampled. We study the performance of these methods in coalescent-based computer simulations.  相似文献   

4.
Using topological summaries of gene trees as a basis for species tree inference is a promising approach to obtain acceptable speed on genomic-scale datasets, and to avoid some undesirable modeling assumptions. Here we study the probabilities of splits on gene trees under the multispecies coalescent model, and how their features might inform species tree inference. After investigating the behavior of split consensus methods, we investigate split invariants—that is, polynomial relationships between split probabilities. These invariants are then used to show that, even though a split is an unrooted notion, split probabilities retain enough information to identify the rooted species tree topology for trees of 5 or more taxa, with one possible 6-taxon exception.  相似文献   

5.
Background

Discovering the location of gene duplications and multiple gene duplication episodes is a fundamental issue in evolutionary molecular biology. The problem introduced by Guigó et al. in 1996 is to map gene duplication events from a collection of rooted, binary gene family trees onto theirs corresponding rooted binary species tree in such a way that the total number of multiple gene duplication episodes is minimized. There are several models in the literature that specify how gene duplications from gene families can be interpreted as one duplication episode. However, in all duplication episode problems gene trees are rooted. This restriction limits the applicability, since unrooted gene family trees are frequently inferred by phylogenetic methods.

Results

In this article we show the first solution to the open problem of episode clustering where the input gene family trees are unrooted. In particular, by using theoretical properties of unrooted reconciliation, we show an efficient algorithm that reduces this problem into the episode clustering problems defined for rooted trees. We show theoretical properties of the reduction algorithm and evaluation of empirical datasets.

Conclusions

We provided algorithms and tools that were successfully applied to several empirical datasets. In particular, our comparative study shows that we can improve known results on genomic duplication inference from real datasets.

  相似文献   

6.

Background  

Lateral genetic transfer can lead to disagreements among phylogenetic trees comprising sequences from the same set of taxa. Where topological discordance is thought to have arisen through genetic transfer events, tree comparisons can be used to identify the lineages that may have shared genetic information. An 'edit path' of one or more transfer events can be represented with a series of subtree prune and regraft (SPR) operations, but finding the optimal such set of operations is NP-hard for comparisons between rooted trees, and may be so for unrooted trees as well.  相似文献   

7.
DupTree is a new software program for inferring rooted species trees from collections of gene trees using the gene tree parsimony approach. The program implements a novel algorithm that significantly improves upon the run time of standard search heuristics for gene tree parsimony, and enables the first truly genome-scale phylogenetic analyses. In addition, DupTree allows users to examine alternate rootings and to weight the reconciliation costs for gene trees. DupTree is an open source project written in C++. Availability: DupTree for Mac OS X, Windows, and Linux along with a sample dataset and an on-line manual are available at http://genome.cs.iastate.edu/CBL/DupTree  相似文献   

8.
Liu L  Yu L 《Systematic biology》2011,60(5):661-667
In this study, we develop a distance method for inferring unrooted species trees from a collection of unrooted gene trees. The species tree is estimated by the neighbor joining (NJ) tree built from a distance matrix in which the distance between two species is defined as the average number of internodes between two species across gene trees, that is, average gene-tree internode distance. The distance method is named NJ(st) to distinguish it from the original NJ method. Under the coalescent model, we show that if gene trees are known or estimated correctly, the NJ(st) method is statistically consistent in estimating unrooted species trees. The simulation results suggest that NJ(st) and STAR (another coalescence-based method for inferring species trees) perform almost equally well in estimating topologies of species trees, whereas the Bayesian coalescence-based method, BEST, outperforms both NJ(st) and STAR. Unlike BEST and STAR, the NJ(st) method can take unrooted gene trees to infer species trees without using an outgroup. In addition, the NJ(st) method can handle missing data and is thus useful in phylogenomic studies in which data sets often contain missing loci for some individuals.  相似文献   

9.
A Robinson-Foulds (RF) supertree for a collection of input trees is a tree containing all the species in the input trees that is at minimum total RF distance to the input trees. Thus, an RF supertree is consistent with the maximum number of splits in the input trees. Constructing RF supertrees for rooted and unrooted data is NP-hard. Nevertheless, effective local search heuristics have been developed for the restricted case where the input trees and the supertree are rooted. We describe new heuristics, based on the Edge Contract and Refine (ECR) operation, that remove this restriction, thereby expanding the utility of RF supertrees. Our experimental results on simulated and empirical data sets show that our unrooted local search algorithms yield better supertrees than those obtained from MRP and rooted RF heuristics in terms of total RF distance to the input trees and, for simulated data, in terms of RF distance to the true tree.  相似文献   

10.
Neutral macroevolutionary models, such as the Yule model, give rise to a probability distribution on the set of discrete rooted binary trees over a given leaf set. Such models can provide a signal as to the approximate location of the root when only the unrooted phylogenetic tree is known, and this signal becomes relatively more significant as the number of leaves grows. In this short note, we show that among models that treat all taxa equally, and are sampling consistent (i.e. the distribution on trees is not affected by taxa yet to be included), all such models, except one (the so-called PDA model), convey some information as to the location of the ancestral root in an unrooted tree.  相似文献   

11.
We developed a recurrence relation that counts the number of tandem duplication trees (either rooted or unrooted) that are consistent with a set of n tandemly repeated sequences generated under the standard unequal recombination (or crossover) model of tandem duplications. The number of rooted duplication trees is exactly twice the number of unrooted trees, which means that on average only two positions for a root on a duplication tree are possible. Using the recurrence, we tabulated these numbers for small values of n. We also developed an asymptotic formula that for large n provides estimates for these numbers. These numbers give a priori probabilities for phylogenies of the repeated sequences to be duplication trees. This work extends earlier studies where exhaustive counts of the numbers for small n were obtained. One application showed the significance of finding that most maximum-parsimony trees constructed from repeat sequences from human immunoglobins and T-cell receptors were tandem duplication trees. Those findings provided strong support to the proposed mechanisms of tandem gene duplication. The recurrence relation also suggests efficient algorithms to recognize duplication trees and to generate random duplication trees for simulation. We present a linear-time recognition algorithm.  相似文献   

12.
Species tree inference from gene family trees is becoming increasingly popular because it can account for discordance between the species tree and the corresponding gene family trees. In particular, methods that can account for multiple-copy gene families exhibit potential to leverage paralogy as informative signal. At present, there does not exist any widely adopted inference method for this purpose. Here, we present SpeciesRax, the first maximum likelihood method that can infer a rooted species tree from a set of gene family trees and can account for gene duplication, loss, and transfer events. By explicitly modeling events by which gene trees can depart from the species tree, SpeciesRax leverages the phylogenetic rooting signal in gene trees. SpeciesRax infers species tree branch lengths in units of expected substitutions per site and branch support values via paralogy-aware quartets extracted from the gene family trees. Using both empirical and simulated data sets we show that SpeciesRax is at least as accurate as the best competing methods while being one order of magnitude faster on large data sets at the same time. We used SpeciesRax to infer a biologically plausible rooted phylogeny of the vertebrates comprising 188 species from 31,612 gene families in 1 h using 40 cores. SpeciesRax is available under GNU GPL at https://github.com/BenoitMorel/GeneRax and on BioConda.  相似文献   

13.
An analytical method is presented for constructing linear invariants. All linear invariants of a k-species tree can be derived from those of (k-1)-species trees using this method. The new method is simpler than that of Cavender, which relies on numerical computations. Moreover, the new method provides a convenient tool to study the relationships between linear invariants of the same tree or of different trees. All linear invariants of trees of up to five species are derived in this study. For four species, there are 16 independent linear invariants for each of the three possible unrooted trees, 14 of which are shared by two unrooted trees and 12 of these are shared by all three unrooted trees; the last types of linear invariants can be used to construct tests on the assumptions about nucleotide substitutions. The number of linear invariants for a tree is found to increase rapidly with the number of species.  相似文献   

14.
Development of methods for estimating species trees from multilocus data is a current challenge in evolutionary biology. We propose a method for estimating the species tree topology and branch lengths using approximate Bayesian computation (ABC). The method takes as data a sample of observed rooted gene tree topologies, and then iterates through the following sequence of steps: First, a randomly selected species tree is used to compute the distribution of rooted gene tree topologies. This distribution is then compared to the observed gene topology frequencies, and if the fit between the observed and the predicted distributions is close enough, the proposed species tree is retained. Repeating this many times leads to a collection of retained species trees that are then used to form the estimate of the overall species tree. We test the performance of the method, which we call ST-ABC, using both simulated and empirical data. The simulation study examines both symmetric and asymmetric species trees over a range of branch lengths and sample sizes. The results from the simulation study show that the model performs very well, giving accurate estimates for both the topology and the branch lengths across the conditions studied, and that a sample size of 25 loci appears to be adequate for the method. Further, we apply the method to two empirical cases: a 4-taxon data set for primates and a 7-taxon data set for yeast. In both cases, we find that estimates obtained with ST-ABC agree with previous studies. The method provides efficient estimation of the species tree, and does not require sequence data, but rather the observed distribution of rooted gene topologies without branch lengths. Therefore, this method is a useful alternative to other currently available methods for species tree estimation.  相似文献   

15.
Fair-balance paradox, star-tree paradox, and Bayesian phylogenetics   总被引:1,自引:0,他引:1  
The star-tree paradox refers to the conjecture that the posterior probabilities for the three unrooted trees for four species (or the three rooted trees for three species if the molecular clock is assumed) do not approach 1/3 when the data are generated using the star tree and when the amount of data approaches infinity. It reflects the more general phenomenon of high and presumably spurious posterior probabilities for trees or clades produced by the Bayesian method of phylogenetic reconstruction, and it is perceived to be a manifestation of the deeper problem of the extreme sensitivity of Bayesian model selection to the prior on parameters. Analysis of the star-tree paradox has been hampered by the intractability of the integrals involved. In this article, I use Laplacian expansion to approximate the posterior probabilities for the three rooted trees for three species using binary characters evolving at a constant rate. The approximation enables calculation of posterior tree probabilities for arbitrarily large data sets. Both theoretical analysis of the analogous fair-coin and fair-balance problems and computer simulation for the tree problem confirmed the existence of the star-tree paradox. When the data size n --> infinity, the posterior tree probabilities do not converge to 1/3 each, but they vary among data sets according to a statistical distribution. This distribution is characterized. Two strategies for resolving the star-tree paradox are explored: (1) a nonzero prior probability for the degenerate star tree and (2) an increasingly informative prior forcing the internal branch length toward zero. Both appear to be effective in resolving the paradox, but the latter is simpler to implement. The posterior tree probabilities are found to be very sensitive to the prior.  相似文献   

16.
赵庆新  谭远德 《遗传》2002,24(4):447-454
肠道微生物与寄主具有复杂的、多方面的相互依存效应,这种依存效应所产生的共生关系或协同进化关系既可反映寄主间的系统演化关系,也可显示肠道微生物间的系统演化关系,共生关系或协同进化关系是由于寄主与肠道微生物两者之间存在着相互自然选择作用所形成的,在长期的进化历程中逐步发生的共生关系信息很可能被记录在DNA序列中。本文通过检测鱼鲤鱼科8种鱼中9种肠道菌群的分布含量对这9种菌群进行分析,且利用从GenBank调取这9种肠道细菌菌属的43个种或亚种的16S DNA序列的构建NJ树和MP树,将这6个科9个属43个种或亚种分为革兰氏阴性和革兰氏阳性两大类群(一级分枝)。在这两类群中,又以科为单位分为6个亚类群(二级分枝),而肠杆菌科中则以属为单位分为4个小类群(三级分枝),此外球状菌与杆状菌也能截然分开。将16S DNA的NJ树隐去所有的种,以属为单位所得到的以分枝形式的无根树在拓扑结构上与菌群分布含量(寄主范围)所构建的无根树相近,但芽孢杆菌在两种无根树的位置中有较大的差异。如果提高检测水平,扩大所检测的寄主对象,这种差异有可能消除。  相似文献   

17.
Supertree methods are used to construct a large tree over a large set of taxa from a set of small trees over overlapping subsets of the complete taxa set. Since accurate reconstruction methods are currently limited to a maximum of a few dozen taxa, the use of a supertree method in order to construct the tree of life is inevitable. Supertree methods are broadly divided according to the input trees: When the input trees are unrooted, the basic reconstruction unit is a quartet tree. In this case, the basic decision problem of whether there exists a tree that agrees with all quartets is NP-complete. On the other hand, when the input trees are rooted, the basic reconstruction unit is a rooted triplet and the above decision problem has a polynomial time algorithm. However, when there is no tree which agrees with all triplets, it would be desirable to find the tree that agrees with the maximum number of triplets. However, this optimization problem was shown to be NP-hard. Current heuristic approaches perform min cut on a graph representing the triplets inconsistency and return a tree that is guaranteed to satisfy some required properties. In this work, we present a different heuristic approach that guarantees the properties provided by the current methods and give experimental evidence that it significantly outperforms currently used methods. This method is based on a divide and conquer approach, where the min cut in the divide step is replaced by a max cut in a variant of the same graph. The latter is achieved by a lightweight semidefinite programming-like heuristic that leads to very fast running times  相似文献   

18.
We report on the only known case of independent discovery of unrooted trees in a historical science outside of biological systematics. The method of textual criticism (ecdotics, i.e., the building of text-version genealogies) created by French philologist Henri Quentin (1872–1935) proposes the use of a type of branching scheme equivalent to unrooted trees in phylogenetics. Because Quentin's method has never become the prevailing paradigm in philology, his insight into unrooted trees has not been noticed in previous studies comparing philology and phylogenetics. In fact, the modern use of unrooted trees in philology is seen as imported from phylogenetics. Quentin's procedure starts by building an unrooted tree (‘chain’) expressing the network of text versions (taxa) based on ‘variants’ (equivalent to unpolarized character states). Such undirected scheme is then rooted on the basis of extrinsic temporal information, thus resulting in a complete (rooted) hypothesis of relationships. Quentin asserts that the building of an unrooted tree precedes the determination of its orientation (rooting) and that the two procedures reflect distinct levels of structural organization, relying on different assumptions. Henri Quentin fully grasped the implications of time-reversible properties of unrooted trees and associated characters, in striking prescience of the same concepts developed in phylogenetics some 45 years later. The two versions of unrooted trees were developed entirely independently of each other and such convergence is testimony to the formal efficiency of approaching historical reconstruction in unrooted and rooted dimensions.  相似文献   

19.
Accuracy of phylogenetic trees estimated from DNA sequence data   总被引:4,自引:1,他引:3  
The relative merits of four different tree-making methods in obtaining the correct topology were studied by using computer simulation. The methods studied were the unweighted pair-group method with arithmetic mean (UPGMA), Fitch and Margoliash's (FM) method, thd distance Wagner (DW) method, and Tateno et al.'s modified Farris (MF) method. An ancestral DNA sequence was assumed to evolve into eight sequences following a given model tree. Both constant and varying rates of nucleotide substitution were considered. Once the DNA sequences for the eight extant species were obtained, phylogenetic trees were constructed by using corrected (d) and uncorrected (p) nucleotide substitutions per site. The topologies of the trees obtained were then compared with that of the model tree. The results obtained can be summarized as follows: (1) The probability of obtaining the correct rooted or unrooted tree is low unless a large number of nucleotide differences exists between different sequences. (2) When the number of nucleotide substitutions per sequence is small or moderately large, the FM, DW, and MF methods show a better performance than UPGMA in recovering the correct topology. The former group of methods is particularly good for obtaining the correct unrooted tree. (3) When the number of substitutions per sequence is large, UPGMA is at least as good as the other methods, particularly for obtaining the correct rooted tree. (4) When the rate of nucleotide substitution varies with evolutionary lineage, the FM, DW, and MF methods show a better performance in obtaining the correct topology than UPGMA, except when a rooted tree is to be produced from data with a large number of nucleotide substitutions per sequence.(ABSTRACT TRUNCATED AT 250 WORDS)   相似文献   

20.
Given a set of evolutionary trees on a same set of taxa, the maximum agreement subtree problem (MAST), respectively, maximum compatible tree problem (MCT), consists of finding a largest subset of taxa such that all input trees restricted to these taxa are isomorphic, respectively compatible. These problems have several applications in phylogenetics such as the computation of a consensus of phylogenies obtained from different data sets, the identification of species subjected to horizontal gene transfers and, more recently, the inference of supertrees, e.g., Trees Of Life. We provide two linear time algorithms to check the isomorphism, respectively, compatibility, of a set of trees or otherwise identify a conflict between the trees with respect to the relative location of a small subset of taxa. Then, we use these algorithms as subroutines to solve MAST and MCT on rooted or unrooted trees of unbounded degree. More precisely, we give exact fixed-parameter tractable algorithms, whose running time is uniformly polynomial when the number of taxa on which the trees disagree is bounded. The improves on a known result for MAST and proves fixed-parameter tractability for MCT.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号