首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 31 毫秒
Phylogeny reconstruction is a difficult computational problem, because the number of possible solutions increases with the number of included taxa. For example, for only 14 taxa, there are more than seven trillion possible unrooted phylogenetic trees. For this reason, phylogenetic inference methods commonly use clustering algorithms (e.g., the neighbor-joining method) or heuristic search strategies to minimize the amount of time spent evaluating nonoptimal trees. Even heuristic searches can be painfully slow, especially when computationally intensive optimality criteria such as maximum likelihood are used. I describe here a different approach to heuristic searching (using a genetic algorithm) that can tremendously reduce the time required for maximum-likelihood phylogenetic inference, especially for data sets involving large numbers of taxa. Genetic algorithms are simulations of natural selection in which individuals are encoded solutions to the problem of interest. Here, labeled phylogenetic trees are the individuals, and differential reproduction is effected by allowing the number of offspring produced by each individual to be proportional to that individual's rank likelihood score. Natural selection increases the average likelihood in the evolving population of phylogenetic trees, and the genetic algorithm is allowed to proceed until the likelihood of the best individual ceases to improve over time. An example is presented involving rbcL sequence data for 55 taxa of green plants. The genetic algorithm described here required only 6% of the computational effort required by a conventional heuristic search using tree bisection/reconnection (TBR) branch swapping to obtain the same maximum-likelihood topology.   相似文献   

Direct optimization of unaligned sequence characters provides a natural framework to explore the sensitivity of phylogenetic hypotheses to variation in analytical parameters. Phenotypic data, when combined into such analyses, are typically analyzed with static homology correspondences unlike the dynamic homology sequence data. Static homology characters may be expected to constrain the direct optimization and thus, potentially increase the similarity of phylogenetic hypotheses under different cost sets. However, whether a total-evidence approach increases the phylogenetic stability or not remains empirically largely unexplored. Here, I studied the impact of static homology data on sensitivity using six empirical data sets composed of several molecular markers and phenotypic data. The inclusion of static homology phenotypic data increased the average stability of phylogenetic hypothesis in five out of the six data sets. To investigate if any static homology characters would have similar effect, the analyses were repeated with randomized phenotypic data, and with one of the molecular markers fixed as static homology characters. These analyses had, on average, almost no effect on the phylogenetic stability, although the randomized phenotypic data sometimes resulted in even higher stability than empirical phenotypic data. The impact was related to the strength of the phylogenetic signal in the phenotypic data: higher average jackknife support of the phenotypic tree correlated with stronger stabilizing effect in the total-evidence analysis. Phenotypic data with a strong signal made the total-evidence trees topologically more similar to the phenotypic trees, thus, they constrained the dynamic homology correspondences of the sequence data. Characters that increase phylogenetic stability are particularly valuable for phylogenetic inference. These results indicate an important role and additive value of phenotypic data in increasing the stability of phylogenetic hypotheses in total-evidence analyses.  相似文献   


Background and Aims

Here evidence for reticulation in the pantropical orchid genus Polystachya is presented, using gene trees from five nuclear and plastid DNA data sets, first among only diploid samples (homoploid hybridization) and then with the inclusion of cloned tetraploid sequences (allopolyploids). Two groups of tetraploids are compared with respect to their origins and phylogenetic relationships.


Sequences from plastid regions, three low-copy nuclear genes and ITS nuclear ribosomal DNA were analysed for 56 diploid and 17 tetraploid accessions using maximum parsimony and Bayesian inference. Reticulation was inferred from incongruence between gene trees using supernetwork and consensus network analyses and from cloning and sequencing duplicated loci in tetraploids.

Key Results

Diploid trees from individual loci showed considerable incongruity but little reticulation signal when support from more than one gene tree was required to infer reticulation. This was coupled with generally low support in the individual gene trees. Sequencing the duplicated gene copies in tetraploids showed clearer evidence of hybrid evolution, including multiple origins of one group of tetraploids included in the study.


A combination of cloning duplicate gene copies in allotetraploids and consensus network comparison of gene trees allowed a phylogenetic framework for reticulation in Polystachya to be built. There was little evidence for homoploid hybridization, but our knowledge of the origins and relationships of three groups of allotetraploids are greatly improved by this study. One group showed evidence of multiple long-distance dispersals to achieve a pantropical distribution; another showed no evidence of multiple origins or long-distance dispersal but had greater morphological variation, consistent with hybridization between more distantly related parents.  相似文献   

Supermatrix and supertree are two methods for constructing a phylogenetic tree by using multiple data sets. However, these methods are not a panacea, as conflicting signals between data sets can lead to misinterpret the evolutionary history of taxa. In particular, the supermatrix approach is expected to be misleading if the species-tree signal is not dominant after the combination of the data sets. Moreover, most current supertree methods suffer from two limitations: (i) they ignore or misinterpret secondary (non-dominant) phylogenetic signals of the different data sets; and (ii) the logical basis of node robustness measures is unclear.To overcome these limitations, we propose a new approach, called SuperTRI, which is based on the branch support analyses of the independent data sets, and where the reliability of the nodes is assessed using three measures: the supertree Bootstrap percentage and two other values calculated from the separate analyses: the mean branch support (mean Bootstrap percentage or mean posterior probability) and the reproducibility index.The SuperTRI approach is tested on a data matrix including seven genes for 82 taxa of the family Bovidae (Mammalia, Ruminantia), and the results are compared to those found with the supermatrix approach. The phylogenetic analyses of the supermatrix and independent data sets were done using four methods of tree reconstruction: Bayesian inference, maximum likelihood, and unweighted and weighted maximum parsimony. The results indicate, firstly, that the SuperTRI approach shows less sensitivity to the four phylogenetic methods, secondly, that it is more accurate to interpret the relationships among taxa, and thirdly, that interesting conclusions on introgression and radiation can be drawn from the comparisons between SuperTRI and supermatrix analyses. To cite this article: A. Ropiquet et al., C. R. Biologies 332 (2009).  相似文献   

Multilocus genomic data sets can be used to infer a rich set of information about the evolutionary history of a lineage, including gene trees, species trees, and phylogenetic networks. However, user‐friendly tools to run such integrated analyses are lacking, and workflows often require tedious reformatting and handling time to shepherd data through a series of individual programs. Here, we present a tool written in Python—TREEasy—that performs automated sequence alignment (with MAFFT), gene tree inference (with IQ‐Tree), species inference from concatenated data (with IQ‐Tree and RaxML‐NG), species tree inference from gene trees (with ASTRAL, MP‐EST, and STELLS2), and phylogenetic network inference (with SNaQ and PhyloNet). The tool only requires FASTA files and nine parameters as inputs. The tool can be run as command line or through a Graphical User Interface (GUI). As examples, we reproduced a recent analysis of staghorn coral evolution, and performed a new analysis on the evolution of the “WGD clade” of yeast. The latter revealed novel patterns that were not identified by previous analyses. TREEasy represents a reliable and simple tool to accelerate research in systematic biology ( https://github.com/MaoYafei/TREEasy ).  相似文献   

Nonparamtric bootstrapping methods may be useful for assessing confidence in a supertree inference. We examined the performance of two supertree bootstrapping methods on four published data sets that each include sequence data from more than 100 genes. In "input tree bootstrapping," input gene trees are sampled with replacement and then combined in replicate supertree analyses; in "stratified bootstrapping," trees from each gene's separate (conventional) bootstrap tree set are sampled randomly with replacement and then combined. Generally, support values from both supertree bootstrap methods were similar or slightly lower than corresponding bootstrap values from a total evidence, or supermatrix, analysis. Yet, supertree bootstrap support also exceeded supermatrix bootstrap support for a number of clades. There was little overall difference in support scores between the input tree and stratified bootstrapping methods. Results from supertree bootstrapping methods, when compared to results from corresponding supermatrix bootstrapping, may provide insights into patterns of variation among genes in genome-scale data sets.  相似文献   

Trees inferred from DNA sequence data provide only limited insight into the phylogeny of seed plants because the living lineages (cycads, Ginkgo, conifers, gnetophytes, and angiosperms) represent fewer than half of the major lineages that have been detected in the fossil record. Nevertheless, phylogenetic trees of living seed plants inferred from sequence data can provide a test of relationships inferred in analyses that include fossils. So far, however, significant uncertainty persists because nucleotide data support several conflicting hypotheses. It is likely that improved sampling of gymnosperm diversity in nucleotide data sets will help alleviate some of the analytical issues encountered in the estimation of seed plant phylogeny, providing a more definitive test of morphological trees. Still, rigorous morphological analyses will be required to answer certain fundamental questions, such as the identity of the angiosperm sister group and the rooting of crown seed plants. Moreover, it will be important to identify approaches for incorporating insights from data that may be accurate but less likely than sequence data to generate results supported by high bootstrap values. How best to weigh evidence and distinguish among hypotheses when some types of data give high support values and others do not remains an important problem.  相似文献   

We explored the use of multidimensional scaling (MDS) of tree-to-tree pairwise distances to visualize the relationships among sets of phylogenetic trees. We found the technique to be useful for exploring "tree islands" (sets of topologically related trees among larger sets of near-optimal trees), for comparing sets of trees obtained from bootstrapping and Bayesian sampling, for comparing trees obtained from the analysis of several different genes, and for comparing multiple Bayesian analyses. The technique was also useful as a teaching aid for illustrating the progress of a Bayesian analysis and as an exploratory tool for examining large sets of phylogenetic trees. We also identified some limitations to the method, including distortions of the multidimensional tree space into two dimensions through the MDS technique, and the definition of the MDS-defined space based on a limited sample of trees. Nonetheless, the technique is a useful approach for the analysis of large sets of phylogenetic trees.  相似文献   

The order Cornales descends from the earliest split in the Asterid clade of flowering plants. Despite a few phylogenetic studies, relationships among families within Cornales remain unclear. In the present study, we increased taxon and character sampling to further resolve the relationships and to date the early diversification events of the order. We conducted phylogenetic analyses of sequence data from 26S rDNA and six chloroplast DNA (cpDNA) regions using parsimony (MP), maximum likelihood (ML), and Bayesian inference (BI) methods with different partition models and different data sets. We employed relaxed, uncorrelated molecular clocks on BEAST to date the phylogeny and examined the effects of different taxon sampling, fossil calibration, and data partitions. Our results from ML and BI analyses of the combined cpDNA sequences and combined cpDNA and 26S rDNA data suggested the monophyly of each family and the following familial relationships ((Cornaceae-Alangiaceae)-(Curtisiaceae-Grubbiaceae))-(((Nyssaceae-Davidiaceae)-Mastixiaceae)-((Hydrostachyaceae-(Hydrangeaceae-Loasaceae))). These relationships were strongly supported by posterior probability and bootstrap values, except for the sister relationship between the N-D-M and H-H-L clades. The 26S rDNA data and some MP trees from cpDNA and total evidence suggested some alternative alignments for Hydrostachyaceae within Cornales, but results of SH tests indicated that these trees were significantly worse explanations of the total data. Phylogenetic dating with simultaneous calibration of multiple nodes suggested that the crown group of Cornales originated around the middle Cretaceous and rapidly radiated into several major clades. The origins of most families dated back to the late Cretaceous except for Curtisiaceae and Grubbiaceae which may have diverged in the very early Tertiary. We found that reducing sampling density within families and analyzing partitioned data sets from coding and noncoding cpDNA, 26S rDNA, and combined data sets produced congruent estimation of divergence times, but reducing the number and changing positions of calibration points resulted in very different estimations.  相似文献   

Two qualitative taxonomic characters are potentially compatible if the states of each can be ordered into a character state tree in such a way that the two resulting character state trees are compatible. The number of potentially compatible pairs (NPCP) of qualitative characters from a data set may be considered to be a measure of its phylogenetic randomness. The value of NPCP depends on the number of evolutionary units (EUs), the number of characters, the number of states in the characters, the distributions of EUs among these states, and the amount and distribution of missing information and so does not directly indicate degree of phylogenetic randomness. Thus, for an observed data set, we used Monte Carlo methods to estimate the probability that a data set chosen equiprobably from among those identical (with respect to all the other above determining features) to the observed data set would have as high (or low) an NPCP as the observed data set. This probability, the realized significance of the observed NPCP, is attractive as an indication of phylogenetic randomness because it does not require the assumptions made by other such methods: No character state trees are assumed and consequently, only potential compatibility can be determined; no particular method of phylogenetic estimation is assumed; and no phylogenetic trees are constructed. We determined the values and significances of NPCP for analyses of 57 data sets taken from 53 published sources. All data sets from 37 of those sources exhibited realized significances of < 0.01, indicating high levels of phylogenetic nonrandomness. From each of the remaining 16 sources, at least one data set was more phylogenetically random. Inclusion of outgroups changed significance in some cases, but not always in the same direction. Data sets with significantly low NPCP may be consistent with an ancient hybrid origin (or other ancient polyphyletic gene exchange, crossing over, viral transfer, etc.) of the study group.  相似文献   

This study makes use of three sources of data, morphology and two chloroplast DNA sequences,ndhF andrbcL, to resolve relationships in Gesneriaceae. Cladograms from each of the three data sets separately are not topologically congruent. Statistical indices suggest that each data set is congruent with thendhF data althoughrbcL and morphology are themselves incongruent. Consensus methods provide no resolution of taxonomic relationships when trees from the different data sets are combined. Combining data sets generally results in cladograms that are more fully resolved than each of the data sets analyzed separately and support for the clades increases based on higher decay index and bootstrap values. These results indicate that there is a phylogenetic signal common to each of the data sets, however, the noise (errors due to homoplasy, mis-scoring, etc.) unique to each data source masks this signal. In combining the data, the evidence for the common evolutionary history in each data set overcomes the noise and is apparent in the resulting trees.  相似文献   

Genome-scale data sets result in an enhanced resolution of the phylogenetic inference by reducing stochastic errors. However, there is also an increase of systematic errors due to model violations, which can lead to erroneous phylogenies. Here, we explore the impact of systematic errors on the resolution of the eukaryotic phylogeny using a data set of 143 nuclear-encoded proteins from 37 species. The initial observation was that, despite the impressive amount of data, some branches had no significant statistical support. To demonstrate that this lack of resolution is due to a mutual annihilation of phylogenetic and nonphylogenetic signals, we created a series of data sets with slightly different taxon sampling. As expected, these data sets yielded strongly supported but mutually exclusive trees, thus confirming the presence of conflicting phylogenetic and nonphylogenetic signals in the original data set. To decide on the correct tree, we applied several methods expected to reduce the impact of some kinds of systematic error. Briefly, we show that (i) removing fast-evolving positions, (ii) recoding amino acids into functional categories, and (iii) using a site-heterogeneous mixture model (CAT) are three effective means of increasing the ratio of phylogenetic to nonphylogenetic signal. Finally, our results allow us to formulate guidelines for detecting and overcoming phylogenetic artefacts in genome-scale phylogenetic analyses.  相似文献   

Wing venation provides useful characters with which to classify extant and fossil insects. Recently, quantification of its shape using landmarks has increased the potential of wing venation to distinguish taxa. However, the use of wing landmarks in phylogenetic analyses remains largely unexplored. Here, we tested landmark analysis under parsimony (LAUP) to include wing shape data in a phylogenetic analysis of hornets and yellow jackets. Using 68 morphological characters, nine genes and wing landmarks, we produced the first total‐evidence phylogeny of Vespinae. We also tested the influence of LAUP parameters using simulated landmarks. Our data confirmed that optimization parameters, alignment method, landmark number and, under low optimization parameters, the initial orientation of aligned shapes can influence LAUP results. Furthermore, single landmark configurations never accurately reflected the topology used for data simulation, but results were significantly close when compared to random topologies. Thus, wing landmark configurations were unreliable phylogenetic characters when treated independently, but provided some useful insights when combined with other data. Our phylogeny corroborated the monophyly of most groups proposed on the basis of morphology and showed the fossil Palaeovespa is distantly related to extant genera. Unstable relationships among genera suggest that rapid radiations occurred in the early history of the Vespinae.  相似文献   

All methods proposed to date for mapping landmark configurations on a phylogenetic tree start from an alignment generated by methods that make no use of phylogenetic information, usually by superimposing all configurations against a consensus configuration. In order to properly interpret differences between landmark configurations along the tree as changes in shape, the metric chosen to define the ancestral assignments should also form the basis to superimpose the configurations. Thus, we present here a method that merges both steps, map and align, into a single procedure that (for the given tree) produces a multiple alignment and ancestral assignments such that the sum of the Euclidean distances between the corresponding landmarks along tree nodes is minimized. This approach is an extension of the method proposed by Catalano et al. (2010. Phylogenetic morphometrics (I): the use of landmark data in a phylogenetic framework. Cladistics. 26:539-549) for mapping landmark data with parsimony as optimality criterion. In the context of phylogenetics, this method allows maximizing the degree to which similarity in landmark positions can be accounted for by common ancestry. In the context of morphometrics, this approach guarantees (heuristics aside) that all the transformations inferred on the tree represent changes in shape. The performance of the method was evaluated on different data sets, indicating that the method produces marked improvements in tree score (up to 5% compared with generalized superimpositions, up to 11% compared with ordinary superimpositions). These empirical results stress the importance of incorporating the phylogenetic information into the alignment step.  相似文献   

A morphological data set and three sources of data from the chloroplast genome (two genes and a restriction site survey) were used to reconstruct the phylogenetic history of the pickerelweed family Pontederiaceae. The chloroplast data converged towards a single tree, presumably the true chloroplast phylogeny of the family. Unrooted trees estimated from each of the three chloroplast data sets were identical or extremely similar in shape to each other and mostly robustly supported. There was no evidence of significant heterogeneity among the data sets, and the few topological differences seen among unrooted trees from each chloroplast data set are probably artifacts of sampling error on short branches. Despite well-documented differences in rates of evolution for different characters in individual data sets, equally weighted parsimony permits accurate reconstructions of chloroplast relationships in Pontederiaceae. A separate morphology-based data set yielded trees that were very different from the chloroplast trees. Although there was substantial support from the morphological evidence for several major clades supported by chloroplast trees, most of the conflicting phylogenetic structure on the morphology trees was not robust. Nonetheless, several statistical tests of incongruence indicate significant heterogeneity between molecules and morphology. The source of this apparent incongruence appears to be a low ratio of phylogenetic signal to noise in the morphological data.  相似文献   

We introduce a new method for identifying optimal incomplete data sets from large sequence databases based on the graph theoretic concept of alpha-quasi-bicliques. The quasi-biclique method searches large sequence databases to identify useful phylogenetic data sets with a specified amount of missing data while maintaining the necessary amount of overlap among genes and taxa. The utility of the quasi-biclique method is demonstrated on large simulated sequence databases and on a data set of green plant sequences from GenBank. The quasi-biclique method greatly increases the taxon and gene sampling in the data sets while adding only a limited amount of missing data. Furthermore, under the conditions of the simulation, data sets with a limited amount of missing data often produce topologies nearly as accurate as those built from complete data sets. The quasi-biclique method will be an effective tool for exploiting sequence databases for phylogenetic information and also may help identify critical sequences needed to build large phylogenetic data sets.  相似文献   

Effects of taxonomic sampling and conflicting signal on the inference of seed plant trees supported in previous molecular analyses were explored using 13 single-locus data sets. Changing the number of taxa in single-locus analyses had limited effects on log likelihood differences between the gnepine (Gnetales plus Pinaceae) and gnetifer (Gnetales plus conifers) trees. Distinguishing among these trees also was little affected by the use of different substitution parameters. The 13-locus combined data set was partitioned into nine classes based on substitution rates. Sites evolving at intermediate rates had the best likelihood and parsimony scores on gnepine trees, and those evolving at the fastest rates had the best parsimony scores on Gnetales-sister trees (Gnetales plus other seed plants). When the fastest evolving sites were excluded from parsimony analyses, well-supported gnepine trees were inferred from the combined data and from each genomic partition. When all sites were included, Gnetales-sister trees were inferred from the combined data, whereas a different tree was inferred from each genomic partition. Maximum likelihood trees from the combined data and from each genomic partition were well-supported gnepine trees. A preliminary stratigraphic test highlights the poor fit of Gnetales-sister trees to the fossil data.  相似文献   

Traditionally, single-copy orthologs have been the gold standard in phylogenomics. Most phylogenomic studies identify putative single-copy orthologs using clustering approaches and retain families with a single sequence per species. This limits the amount of data available by excluding larger families. Recent advances have suggested several ways to include data from larger families. For instance, tree-based decomposition methods facilitate the extraction of orthologs from large families. Additionally, several methods for species tree inference are robust to the inclusion of paralogs and could use all of the data from larger families. Here, we explore the effects of using all families for phylogenetic inference by examining relationships among 26 primate species in detail and by analyzing five additional data sets. We compare single-copy families, orthologs extracted using tree-based decomposition approaches, and all families with all data. We explore several species tree inference methods, finding that identical trees are returned across nearly all subsets of the data and methods for primates. The relationships among Platyrrhini remain contentious; however, the species tree inference method matters more than the subset of data used. Using data from larger gene families drastically increases the number of genes available and leads to consistent estimates of branch lengths, nodal certainty and concordance, and inferences of introgression in primates. For the other data sets, topological inferences are consistent whether single-copy families or orthologs extracted using decomposition approaches are analyzed. Using larger gene families is a promising approach to include more data in phylogenomics without sacrificing accuracy, at least when high-quality genomes are available.  相似文献   

Despite the broad adoption of multispecies coalescent (MSC) methods for nuclear phylogenomics, they have yet to be applied to mitochondrial (mt) genomic data. As the potential sources of phylogenomic bias that MSC methods can address, such as incomplete lineage sorting, horizontal gene transfer and gene tree heterogeneity, have been found in mt genomic data, these approaches may improve the accuracy of phylogenetic inference with these data. In the present study, we examined the behaviour of MSC methods in reconstructing the phylogeny of Lepidoptera (butterflies and moths), a group for which mt genomic data are known to have strong resolving power. Traditional concatenation methods of analysing mt genomes for Lepidoptera infer topologies highly congruent with those generated from independent nuclear datasets. Individual mt gene trees performed poorly in recovering consensus relationships at deep levels (i.e. superfamily monophyly and inter-relationships) and only moderately well for shallow relationships (i.e. within Papilionoidea). In contrast, MSC analyses with ASTRAL performed strongly with almost complete concordance to both concatenated mt genome analyses and independent nuclear analyses at both deep and shallow phylogenetic scales. Outgroup choice had a limited impact on tree accuracy, with even phylogenetically distant outgroups still resulting in topologies highly congruent with results from nuclear datasets, although MSC analyses appeared to be marginally more affected by outgroup choice than concatenation analyses. In general, discordance between concatenation and MSC analyses was found at nodes whose resolution varied between previous nuclear phylogenomic studies. The sensitivity of individual relationships to analysis with MSC vs concatenation can thus be used to test the robustness of phylogenetic hypotheses. For insect phylogenetics, MSC is a reliable inference method for mt genomic data and is thus a useful complement to the already widely used concatenation approaches.  相似文献   

Despite the widespread perception that evolutionary inference from molecular sequences is a statistical problem, there has been very little attention paid to questions of experimental design. Previous consideration of this topic has led to little more than an empirical folklore regarding the choice of suitable genes for analysis, and to dispute over the best choice of taxa for inclusion in data sets. I introduce what I believe are new methods that permit the quantification of phylogenetic information in a sequence alignment. The methods use likelihood calculations based on Markov-process models of nucleotide substitution allied with phylogenetic trees, and allow a general approach to optimal experimental design. Two examples are given, illustrating realistic problems in experimental design in molecular phylogenetics and suggesting more general conclusions about the choice of genomic regions, sequence lengths and taxa for evolutionary studies.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号