首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 437 毫秒
1.
Multilocus genomic data sets can be used to infer a rich set of information about the evolutionary history of a lineage, including gene trees, species trees, and phylogenetic networks. However, user‐friendly tools to run such integrated analyses are lacking, and workflows often require tedious reformatting and handling time to shepherd data through a series of individual programs. Here, we present a tool written in Python—TREEasy—that performs automated sequence alignment (with MAFFT), gene tree inference (with IQ‐Tree), species inference from concatenated data (with IQ‐Tree and RaxML‐NG), species tree inference from gene trees (with ASTRAL, MP‐EST, and STELLS2), and phylogenetic network inference (with SNaQ and PhyloNet). The tool only requires FASTA files and nine parameters as inputs. The tool can be run as command line or through a Graphical User Interface (GUI). As examples, we reproduced a recent analysis of staghorn coral evolution, and performed a new analysis on the evolution of the “WGD clade” of yeast. The latter revealed novel patterns that were not identified by previous analyses. TREEasy represents a reliable and simple tool to accelerate research in systematic biology ( https://github.com/MaoYafei/TREEasy ).  相似文献   

2.
为了探究进化模型对DNA条形码分类的影响, 本研究以雾灵山夜蛾科44个种的标本为材料, 获得COI基因序列。使用邻接法(neighbor-joining)、 最大简约法(maximum parsimony)、 最大似然法(maximum likelihood)以及贝叶斯法(Bayesian inference)构建系统发育树, 并且对邻接法的12种模型、 最大似然法的7种模型、 贝叶斯法的2种模型进行模型成功率的评估。结果表明, 邻接法的12种模型成功率相差不大, 较稳定; 最大似然法及贝叶斯法的不同模型成功率存在明显差异, 不稳定; 最大简约法不基于模型, 成功率比较稳定。邻接法及最大似然法共有6种相同的模型, 这6种模型在不同的方法中成功率存在差异。此外, 分子数据中存在单个物种仅有一条序列的情况, 显著降低了模型成功率, 表明在DNA条形码研究中, 每个物种需要有多个样本。  相似文献   

3.
Gene tree distributions under the coalescent process   总被引:10,自引:0,他引:10  
Under the coalescent model for population divergence, lineage sorting can cause considerable variability in gene trees generated from any given species tree. In this paper, we derive a method for computing the distribution of gene tree topologies given a bifurcating species tree for trees with an arbitrary number of taxa in the case that there is one gene sampled per species. Applications for gene tree distributions include determining exact probabilities of topological equivalence between gene trees and species trees and inferring species trees from multiple datasets. In addition, we examine the shapes of gene tree distributions and their sensitivity to changes in branch lengths, species tree shape, and tree size. The method for computing gene tree distributions is implemented in the computer program COAL.  相似文献   

4.
To tree or not to tree   总被引:2,自引:1,他引:1  
The practice of tracking geographical divergence along a phylogenetic tree has added an evolutionary perspective to biogeographic analysis within single species. In spite of the popularity of phylogeography, there is an emerging problem. Recurrent mutation and recombination both create homoplasy, multiple evolutionary occurrences of the same character that are identical in state but not identical by descent. Homoplasic molecular data are phylogenetically ambiguous. Converting homoplasic molecular data into a tree represents an extrapolation, and there can be myriad candidate trees among which to choose. Derivative biogeographic analyses of 'the tree' are analyses of that extrapolation, and the results depend on the tree chosen. I explore the informational aspects of converting a multicharacter data set into a phylogenetic tree, and then explore what happens when that tree is used for population analysis. Three conclusions follow: (i) some trees are better than others; good trees are true to the data, whereas bad trees are not; (ii) for biogeographic analysis, we should use only good trees, which yield the same biogeographic inference as the phenetic data, but little more; and (iii) the reliable biogeographic inference is inherent in the phenetic data, not the trees.  相似文献   

5.
Numerous simulation studies have investigated the accuracy of phylogenetic inference of gene trees under maximum parsimony, maximum likelihood, and Bayesian techniques. The relative accuracy of species tree inference methods under simulation has received less study. The number of analytical techniques available for inferring species trees is increasing rapidly, and in this paper, we compare the performance of several species tree inference techniques at estimating recent species divergences using computer simulation. Simulating gene trees within species trees of different shapes and with varying tree lengths (T) and population sizes (), and evolving sequences on those gene trees, allows us to determine how phylogenetic accuracy changes in relation to different levels of deep coalescence and phylogenetic signal. When the probability of discordance between the gene trees and the species tree is high (i.e., T is small and/or is large), Bayesian species tree inference using the multispecies coalescent (BEST) outperforms other methods. The performance of all methods improves as the total length of the species tree is increased, which reflects the combined benefits of decreasing the probability of discordance between species trees and gene trees and gaining more accurate estimates for gene trees. Decreasing the probability of deep coalescences by reducing also leads to accuracy gains for most methods. Increasing the number of loci from 10 to 100 improves accuracy under difficult demographic scenarios (i.e., coalescent units ≤ 4N(e)), but 10 loci are adequate for estimating the correct species tree in cases where deep coalescence is limited or absent. In general, the correlation between the phylogenetic accuracy and the posterior probability values obtained from BEST is high, although posterior probabilities are overestimated when the prior distribution for is misspecified.  相似文献   

6.
We propose a model based approach to use multiple gene trees to estimate the species tree. The coalescent process requires that gene divergences occur earlier than species divergences when there is any polymorphism in the ancestral species. Under this scenario, speciation times are restricted to be smaller than the corresponding gene split times. The maximum tree (MT) is the tree with the largest possible speciation times in the space of species trees restricted by available gene trees. If all populations have the same population size, the MT is the maximum likelihood estimate of the species tree. It can be shown the MT is a consistent estimator of the species tree even when the MT is built upon the estimates of the true gene trees if the gene tree estimates are statistically consistent. The MT converges in probability to the true species tree at an exponential rate.  相似文献   

7.
8.
The New World swallow genus Tachycineta comprises nine species that collectively have a wide geographic distribution and remarkable variation both within- and among-species in ecologically important traits. Existing phylogenetic hypotheses for Tachycineta are based on mitochondrial DNA sequences, thus they provide estimates of a single gene tree. In this study we sequenced multiple individuals from each species at 16 nuclear intron loci. We used gene concatenated approaches (Bayesian and maximum likelihood) as well as coalescent-based species tree inference to reconstruct phylogenetic relationships of the genus. We examined the concordance and conflict between the nuclear and mitochondrial trees and between concatenated and coalescent-based inferences. Our results provide an alternative phylogenetic hypothesis to the existing mitochondrial DNA estimate of phylogeny. This new hypothesis provides a more accurate framework in which to explore trait evolution and examine the evolution of the mitochondrial genome in this group.  相似文献   

9.
Assessing effects of gene tree error in coalescent analyses have widely ignored coalescent branch lengths (CBLs) despite their potential utility in estimating ancestral population demographics and detecting species tree anomaly zones. However, the ability of coalescent methods to obtain accurate estimates remains largely unexplored. Errors in gene trees should lead to underestimates of the true CBL, and for a given set of comparisons, longer CBLs should be more accurate. Here, we furthered our empirical understanding of how error in gene tree quality (i.e., locus informativeness and gene tree resolution) affect CBLs using four datasets comprised of ultraconserved elements (UCE) or exons for clades that exhibit wide ranges of branch lengths. For each dataset, we compared the impact of locus informativeness (assessed using number of parsimony-informative sites) and gene tree resolution on CBL estimates. Our results, in general, showed that CBLs were drastically shorter when estimates included low informative loci. Gene tree resolution also had an impact on UCE datasets, with polytomous gene trees producing longer branches than randomly resolved gene trees. However, resolution did not appear to affect CBL estimates from the more informative exon datasets. Thus, as expected, gene tree quality affects CBL estimates, though this can generally be minimized by using moderate filtering to select more informative loci and/or by allowing polytomies in gene trees. These approaches, as well as additional contributions to improve CBL estimation, should lead to CBLs that are useful for addressing evolutionary and biological questions.  相似文献   

10.
MOTIVATION: When analyzing protein sequences using sequence similarity searches, orthologous sequences (that diverged by speciation) are more reliable predictors of a new protein's function than paralogous sequences (that diverged by gene duplication), because duplication enables functional diversification. The utility of phylogenetic information in high-throughput genome annotation ('phylogenomics') is widely recognized, but existing approaches are either manual or indirect (e.g. not based on phylogenetic trees). Our goal is to automate phylogenomics using explicit phylogenetic inference. A necessary component is an algorithm to infer speciation and duplication events in a given gene tree. RESULTS: We give an algorithm to infer speciation and duplication events on a gene tree by comparison to a trusted species tree. This algorithm has a worst-case running time of O(n(2)) which is inferior to two previous algorithms that are approximately O(n) for a gene tree of sequences. However, our algorithm is extremely simple, and its asymptotic worst case behavior is only realized on pathological data sets. We show empirically, using 1750 gene trees constructed from the Pfam protein family database, that it appears to be a practical (and often superior) algorithm for analyzing real gene trees. AVAILABILITY: http://www.genetics.wustl.edu/eddy/forester.  相似文献   

11.
The fact that different phylogenomic data sets can lead to highly supported but inconsistent results suggest that conflict among gene trees in real data sets could be severe. We provide here a detailed exploration of gene tree space to investigate the relationships in Hymenoptera based on data obtained by Johnson et al. (Current Biology, 2013, 23, 2058), in which ants and Apoidea (bees and spheciform wasps) were recovered as sister groups, contradicting previous studies. We found high levels of topological variation among gene trees, several of them disagreeing with previously published hypotheses. To profile the dynamics of emerging support versus conflicting signal in combined analysis of data, we employed a novel method based on the incremental addition of randomized data to coalescence‐based phylogenetic inference. Although the monophyly of Aculeata and of Formicidae were consistently recovered using as little as 6.5% of the 308 available markers, signal for the Formicidae + Apoidea clade prevailed only after more than 50% of the loci were sampled. Still, non‐negligible support for alternative hypotheses remained until all genes were added to the analysis. Our results suggest that phylogenetic conflict is rather pervasive and not scattered as noise across individual gene trees because alternative topologies were recovered not from a specific subset, but from several random combinations of loci. Thus, even though phylogenetic signal recovered from full gene data sets was already dominant in much smaller ensembles, large amounts of data may be indeed necessary to overcome phylogenetic conflict.  相似文献   

12.
MOTIVATION: Maximum likelihood (ML) methods have become very popular for constructing phylogenetic trees from sequence data. However, despite noticeable recent progress, with large and difficult datasets (e.g. multiple genes with conflicting signals) current ML programs still require huge computing time and can become trapped in bad local optima of the likelihood function. When this occurs, the resulting trees may still show some of the defects (e.g. long branch attraction) of starting trees obtained using fast distance or parsimony programs. METHODS: Subtree pruning and regrafting (SPR) topological rearrangements are usually sufficient to intensively search the tree space. Here, we propose two new methods to make SPR moves more efficient. The first method uses a fast distance-based approach to detect the least promising candidate SPR moves, which are then simply discarded. The second method locally estimates the change in likelihood for any remaining potential SPRs, as opposed to globally evaluating the entire tree for each possible move. These two methods are implemented in a new algorithm with a sophisticated filtering strategy, which efficiently selects potential SPRs and concentrates most of the likelihood computation on the promising moves. RESULTS: Experiments with real datasets comprising 35-250 taxa show that, while indeed greatly reducing the amount of computation, our approach provides likelihood values at least as good as those of the best-known ML methods so far and is very robust to poor starting trees. Furthermore, combining our new SPR algorithm with local moves such as PHYML's nearest neighbor interchanges, the time needed to find good solutions can sometimes be reduced even more.  相似文献   

13.
Katoh K  Miyata T 《FEBS letters》1999,463(1-2):129-132
Applying the tree bisection and reconnection (TBR) algorithm, we have developed a heuristic method (maximum likelihood (ML)-TBR) for inferring the ML tree based on tree topology search. For initial trees from which iterative processes start in ML-TBR, two cases were considered: one is 100 neighbor-joining (NJ) trees based on the bootstrap resampling and the other is 100 randomly generated trees. The same ML tree was obtained in both cases. All different iterative processes started from 100 independent initial trees ultimately converged on one optimum tree with the largest log-likelihood value, suggesting that a limited number of initial trees will be quite enough in ML-TBR. This also suggests that the optimum tree corresponds to the global optimum in tree topology space and thus probably coincides with the ML tree inferred by intact ML analysis. This method has been applied to the inference of phylogenetic tree of the SOX family members. The mammalian testis-determining gene SRY is believed to have evolved from SOX-3, a member of the SOX family, based on several lines of evidence, including their sequence similarity, the location of SOX-3 on the X chromosome and some aspects of their expression. This model should be supported directly from the phylogenetic tree of the SOX family, but no evidence has been provided to date. A recently published NJ tree shows implausibly remote origin of SRY, suggesting that a more sophisticated method is required for understanding this problem. The ML tree inferred by the present method showed that the SRYs of marsupial and placental mammals form a monophyletic cluster which had diverged from the mammalian SOX-3 in the early evolution of mammals.  相似文献   

14.

Background  

The ever-increasing wealth of genomic sequence information provides an unprecedented opportunity for large-scale phylogenetic analysis. However, species phylogeny inference is obfuscated by incongruence among gene trees due to evolutionary events such as gene duplication and loss, incomplete lineage sorting (deep coalescence), and horizontal gene transfer. Gene tree parsimony (GTP) addresses this issue by seeking a species tree that requires the minimum number of evolutionary events to reconcile a given set of incongruent gene trees. Despite its promise, the use of gene tree parsimony has been limited by the fact that existing software is either not fast enough to tackle large data sets or is restricted in the range of evolutionary events it can handle.  相似文献   

15.
Choosing among alternative trees of multigene families   总被引:4,自引:0,他引:4  
Estimation of gene trees is the first step in testing alternative hypotheses about the evolution of multigene families. The standard practice for inferring gene family history is to construct trees that meet some objective criteria based on the fit of the character state changes (nucleotide or amino acid changes) to the gene tree. Unfortunately, analysis of character state data can be misleading. In addition, this approach ignores information about the relationships of the species from which the genes have been sampled. In this paper I explore using statistics of fit between the character data and gene trees and the reconciliation of the gene and species trees for choosing among alternative evolutionary hypotheses of gene families. In particular, I advocate a two-pronged strategy for choosing among alternative gene trees. First, the character data are used to define a set of acceptable gene trees (i.e., trees that are not significantly different from the minimum length tree). Next, the set of acceptable gene trees is reconciled with a known species tree, and the gene tree requiring the fewest number of gene duplications and losses is adopted as the best estimate of evolutionary history. The approach is illustrated using three gene families: BMP, EGR, and LDH.  相似文献   

16.
Phylogenetic trees from multiple genes can be obtained in two fundamentally different ways. In one, gene sequences are concatenated into a super-gene alignment, which is then analyzed to generate the species tree. In the other, phylogenies are inferred separately from each gene, and a consensus of these gene phylogenies is used to represent the species tree. Here, we have compared these two approaches by means of computer simulation, using 448 parameter sets, including evolutionary rate, sequence length, base composition, and transition/transversion rate bias. In these simulations, we emphasized a worst-case scenario analysis in which 100 replicate datasets for each evolutionary parameter set (gene) were generated, and the replicate dataset that produced a tree topology showing the largest number of phylogenetic errors was selected to represent that parameter set. Both randomly selected and worst-case replicates were utilized to compare the consensus and concatenation approaches primarily using the neighbor-joining (NJ) method. We find that the concatenation approach yields more accurate trees, even when the sequences concatenated have evolved with very different substitution patterns and no attempts are made to accommodate these differences while inferring phylogenies. These results appear to hold true for parsimony and likelihood methods as well. The concatenation approach shows >95% accuracy with only 10 genes. However, this gain in accuracy is sometimes accompanied by reinforcement of certain systematic biases, resulting in spuriously high bootstrap support for incorrect partitions, whether we employ site, gene, or a combined bootstrap resampling approach. Therefore, it will be prudent to report the number of individual genes supporting an inferred clade in the concatenated sequence tree, in addition to the bootstrap support.  相似文献   

17.
Gene trees are evolutionary trees representing the ancestry of genes sampled from multiple populations. Species trees represent populations of individuals—each with many genes—splitting into new populations or species. The coalescent process, which models ancestry of gene copies within populations, is often used to model the probability distribution of gene trees given a fixed species tree. This multispecies coalescent model provides a framework for phylogeneticists to infer species trees from gene trees using maximum likelihood or Bayesian approaches. Because the coalescent models a branching process over time, all trees are typically assumed to be rooted in this setting. Often, however, gene trees inferred by traditional phylogenetic methods are unrooted. We investigate probabilities of unrooted gene trees under the multispecies coalescent model. We show that when there are four species with one gene sampled per species, the distribution of unrooted gene tree topologies identifies the unrooted species tree topology and some, but not all, information in the species tree edges (branch lengths). The location of the root on the species tree is not identifiable in this situation. However, for 5 or more species with one gene sampled per species, we show that the distribution of unrooted gene tree topologies identifies the rooted species tree topology and all its internal branch lengths. The length of any pendant branch leading to a leaf of the species tree is also identifiable for any species from which more than one gene is sampled.  相似文献   

18.
Testing the relations between tree parameters and the richness and composition of lichen communities in near-natural stands could be a first step to gather information for forest managers interested in conservation and in biodiversity assessment and monitoring. This work aims at evaluating the influence of tree age and age-related parameters on tree-level richness and community composition of lichens on spruce in an Alpine forest. The lichen survey was carried out in four sites used for long-term monitoring. In each site, tree age, diameter at breast height, tree height, the first branch height, and crown projection area were measured for each tree. Trees were stratified into three age classes: (1) <100 years old, immature trees usually not suitable for felling, (2) 100–200 years old, mature trees suitable for felling, and (3) >200 years old, over-mature trees normally rare or absent in managed stands. In each site, seven trees in each age class were selected randomly. Tree age and related parameters proved to influence both tree-level species richness and composition of lichen communities. Species richness increased with tree age and related parameters indicative of tree size. This relation could be interpreted as the result of different joint effects of age per se and tree size with its area-effect. Species turnover is also suspected to improve species richness on over-mature trees. Similarly to species richness, tree-level species composition can be partially explained by tree-related parameters. Species composition changed from young to old trees, several lichens being associated with over-mature trees. This pool of species, including nationally rare lichens, represents a community which is probably poorly developed in managed forests. In accordance to the general aims of near-to-nature forestry, the presence of over-mature trees should be enhanced in the future forest landscape of the Alps especially in protected areas and Natura 2,000 sites, where conservation purposes are explicitly included in the management guidelines.  相似文献   

19.

Background  

Several phylogenetic approaches have been developed to estimate species trees from collections of gene trees. However, maximum likelihood approaches for estimating species trees under the coalescent model are limited. Although the likelihood of a species tree under the multispecies coalescent model has already been derived by Rannala and Yang, it can be shown that the maximum likelihood estimate (MLE) of the species tree (topology, branch lengths, and population sizes) from gene trees under this formula does not exist. In this paper, we develop a pseudo-likelihood function of the species tree to obtain maximum pseudo-likelihood estimates (MPE) of species trees, with branch lengths of the species tree in coalescent units.  相似文献   

20.
Phylogeny reconstruction is a difficult computational problem, because the number of possible solutions increases with the number of included taxa. For example, for only 14 taxa, there are more than seven trillion possible unrooted phylogenetic trees. For this reason, phylogenetic inference methods commonly use clustering algorithms (e.g., the neighbor-joining method) or heuristic search strategies to minimize the amount of time spent evaluating nonoptimal trees. Even heuristic searches can be painfully slow, especially when computationally intensive optimality criteria such as maximum likelihood are used. I describe here a different approach to heuristic searching (using a genetic algorithm) that can tremendously reduce the time required for maximum-likelihood phylogenetic inference, especially for data sets involving large numbers of taxa. Genetic algorithms are simulations of natural selection in which individuals are encoded solutions to the problem of interest. Here, labeled phylogenetic trees are the individuals, and differential reproduction is effected by allowing the number of offspring produced by each individual to be proportional to that individual's rank likelihood score. Natural selection increases the average likelihood in the evolving population of phylogenetic trees, and the genetic algorithm is allowed to proceed until the likelihood of the best individual ceases to improve over time. An example is presented involving rbcL sequence data for 55 taxa of green plants. The genetic algorithm described here required only 6% of the computational effort required by a conventional heuristic search using tree bisection/reconnection (TBR) branch swapping to obtain the same maximum-likelihood topology.   相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号