首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Over the past two decades, there has been a long-standing debate about the impact of taxon sampling on phylogenetic inference. Studies have been based on both real and simulated data sets, within actual and theoretical contexts, and using different inference methods, to study the impact of taxon sampling. In some cases, conflicting conclusions have been drawn for the same data set. The main questions explored in studies to date have been about the effects of using sparse data, adding new taxa, including more characters from genome sequences and using different (or concatenated) locus regions. These questions can be reduced to more fundamental ones about the assessment of data quality and the design guidelines of taxon sampling in phylogenetic inference experiments. This review summarizes progress to date in understanding the impact of taxon sampling on the accuracy of phylogenetic analysis.  相似文献   

2.

Background

Genome level analyses have enhanced our view of phylogenetics in many areas of the tree of life. With the production of whole genome DNA sequences of hundreds of organisms and large-scale EST databases a large number of candidate genes for inclusion into phylogenetic analysis have become available. In this work, we exploit the burgeoning genomic data being generated for plant genomes to address one of the more important plant phylogenetic questions concerning the hierarchical relationships of the several major seed plant lineages (angiosperms, Cycadales, Gingkoales, Gnetales, and Coniferales), which continues to be a work in progress, despite numerous studies using single, few or several genes and morphology datasets. Although most recent studies support the notion that gymnosperms and angiosperms are monophyletic and sister groups, they differ on the topological arrangements within each major group.

Methodology

We exploited the EST database to construct a supermatrix of DNA sequences (over 1,200 concatenated orthologous gene partitions for 17 taxa) to examine non-flowering seed plant relationships. This analysis employed programs that offer rapid and robust orthology determination of novel, short sequences from plant ESTs based on reference seed plant genomes. Our phylogenetic analysis retrieved an unbiased (with respect to gene choice), well-resolved and highly supported phylogenetic hypothesis that was robust to various outgroup combinations.

Conclusions

We evaluated character support and the relative contribution of numerous variables (e.g. gene number, missing data, partitioning schemes, taxon sampling and outgroup choice) on tree topology, stability and support metrics. Our results indicate that while missing characters and order of addition of genes to an analysis do not influence branch support, inadequate taxon sampling and limited choice of outgroup(s) can lead to spurious inference of phylogeny when dealing with phylogenomic scale data sets. As expected, support and resolution increases significantly as more informative characters are added, until reaching a threshold, beyond which support metrics stabilize, and the effect of adding conflicting characters is minimized.  相似文献   

3.
To study the evolution of mtDNA and the intergeneric relationships of New World Jays (Aves: Corvidae), we sequenced the entire mitochondrial DNA control region (CR) from 21 species representing all genera of New World jays, an Old World jay, crows, and a magpie. Using maximum likelihood methods, we found that both the transition/transversion ratio (κ) and among site rate variation (α) were higher in flanking domains I and II than in the conserved central domain and that the frequency of indels was highest in domain II. Estimates of κ and α were much more influenced by the density of taxon sampling than by alternative optimal tree topologies. We implemented a successive approximation method incorporating these parameters into phylogenetic analysis. In addition we compared our study in detail to a previous study using cytochrome b and morphology to examine the effect of taxon sampling, evolutionary rates of genes, and combined data on tree resolution. We found that the particular weighting scheme used had no effect on tree topology and little effect on tree robustness. Taxon sampling had a significant effect on tree robustness but little effect on the topology of the best tree. The CR data set differed nonsignificantly from the tree derived from the cytochrome b/morphological data set primarily in the placement of the genus Gymnorhinus, which is near the base of the CR tree. However, contrary to conventional taxonomy, the CR data set suggested that blue and black jays (Cyanocorax sensu lato) might be paraphyletic and that the brown jay Psilorhinus (=Cyanocorax) morio is the sister group to magpie jays (Calocitta), a phylogenetic hypothesis that is likely as parsimonious with regard to nonmolecular characters as monophyly of Cyanocorax. The CR tree also suggests that the common ancestor of NWJs was likely a cooperative breeder. Consistent with recent systematic theory, our data suggest that DNA sequences with high substitution rates such as the CR may nonetheless be useful in reconstructing relatively deep phylogenetic nodes in avian groups. Received: 10 November 1999 / Accepted: 16 March 2000  相似文献   

4.
5.
类群取样与系统发育分析精确度之探索   总被引:6,自引:2,他引:4  
Appropriate and extensive taxon sampling is one of the most important determinants of accurate phylogenetic estimation. In addition, accuracy of inferences about evolutionary processes obtained from phylogenetic analyses is improved significantly by thorough taxon sampling efforts. Many recent efforts to improve phylogenetic estimates have focused instead on increasing sequence length or the number of overall characters in the analysis, and this often does have a beneficial effect on the accuracy of phylogenetic analyses. However, phylogenetic analyses of few taxa (but each represented by many characters) can be subject to strong systematic biases, which in turn produce high measures of repeatability (such as bootstrap proportions) in support of incorrect or misleading phylogenetic results. Thus, it is important for phylogeneticists to consider both the sampling of taxa, as well as the sampling of characters, in designing phylogenetic studies. Taxon sampling also improves estimates of evolutionary parameters derived from phylogenetic trees, and is thus important for improved applications of phylogenetic analyses. Analysis of sensitivity to taxon inclusion, the possible effects of long-branch attraction, and sensitivity of parameter estimation for model-based methods should be a part of any careful and thorough phylogenetic analysis. Furthermore, recent improvements in phylogenetic algorithms and in computational power have removed many constraints on analyzing large, thoroughly sampled data sets. Thorough taxon sampling is thus one of the most practical ways to improve the accuracy of phylogenetic estimates, as well as the accuracy of biological inferences that are based on these phylogenetic trees.  相似文献   

6.
The first step of any molecular phylogenetic analysis is the selection of the species and sequences to be included, the taxon sampling. Already here different pitfalls exist. Sequences can contain errors, annotations in databases can be inaccurate and even the taxonomic classification of a species can be wrong. Usually, these artefacts become evident only after calculation of the phylogenetic tree. Following, the taxon sampling has to be corrected iteratively. This can become tedious and time consuming, as in most cases the taxon sampling is de-coupled from the further steps of the phylogenetic analysis. Here, we present the ITS2 Workbench (http://its2.bioapps.biozentrum.uni-wuerzburg.de/), which eliminates this problem by a tight integration of taxon sampling, secondary structure prediction, multiple alignment and phylogenetic tree calculation. The ITS2 Workbench has access to more than 280,000 ITS2 sequences and their structures provided by the ITS2 database enabling sequence-structure based alignment and tree reconstruction. This allows the interactive improvement of the taxon sampling throughout the whole phylogenetic tree reconstruction process. Thus, the ITS2 Workbench enables a fast, interactive and iterative taxon sampling leading to more accurate ITS2 based phylogenies.  相似文献   

7.
8.
Many phylogenetic analyses that include numerous terminals but few genes show high resolution and branch support for relatively recently diverged clades, but lack of resolution and/or support for "basal" clades of the tree. The various benefits of increased taxon and character sampling have been widely discussed in the literature, albeit primarily based on simulations rather than empirical data. In this study, we used a well-sampled gene-tree analysis (based on 100 mitochondrial genomes of higher teleost fishes) to test empirically the efficiency of different methods of data sampling and phylogenetic inference to "correctly" resolve the basal clades of a tree (based on congruence with the reference tree constructed using all 100 taxa and 7990 characters). By itself, increased character sampling was an inefficient method by which to decrease the likelihood of "incorrect" resolution (i.e., incongruence with the reference tree) for parsimony analyses. Although increased taxon sampling was a powerful approach to alleviate "incorrect" resolution for parsimony analyses, it had the general effect of increasing the number of, and support for, "incorrectly" resolved clades in the Bayesian analyses. For both the parsimony and Bayesian analyses, increased taxon sampling, by itself, was insufficient to help resolve the basal clades, making this sampling strategy ineffective for that purpose. For this empirical study, the most efficient of the six approaches considered to resolve the basal clades when adding nucleotides to a dataset that consists of a single gene sampled for a small, but representative, number of taxa, is to increase character sampling and analyze the characters using the Bayesian method.  相似文献   

9.
Wilson JJ 《PloS one》2011,6(9):e24769

Background

A common perception is that DNA barcode datamatrices have limited phylogenetic signal due to the small number of characters available per taxon. However, another school of thought suggests that the massively increased taxon sampling afforded through the use of DNA barcodes may considerably increase the phylogenetic signal present in a datamatrix. Here I test this hypothesis using a large dataset of macrolepidopteran DNA barcodes.

Methodology/Principal Findings

Taxon sampling was systematically increased in datamatrices containing macrolepidopteran DNA barcodes. Sixteen family groups were designated as concordance groups and two quantitative measures; the taxon consistency index and the taxon retention index, were used to assess any changes in phylogenetic signal as a result of the increase in taxon sampling. DNA barcodes alone, even with maximal taxon sampling (500 species per family), were not sufficient to reconstruct monophyly of families and increased taxon sampling generally increased the number of clades formed per family. However, the scores indicated a similar level of taxon retention (species from a family clustering together) in the cladograms as the number of species included in the datamatrix was increased, suggesting substantial phylogenetic signal below the ‘family’ branch.

Conclusions/Significance

The development of supermatrix, supertree or constrained tree approaches could enable the exploitation of the massive taxon sampling afforded through DNA barcodes for phylogenetics, connecting the twigs resolved by barcodes to the deep branches resolved through phylogenomics.  相似文献   

10.
Missing data are a widely recognized nuisance factor in phylogenetic analyses, and the fear of missing data may deter systematists from including characters that are highly incomplete. In this paper, I used simulations to explore the consequences of including sets of characters that contain missing data. More specifically, I tested whether the benefits of increasing the number of characters outweigh the costs of adding missing data cells to a matrix. The results show that the addition of a set of characters with missing data is generally more likely to increase phylogenetic accuracy than decrease it, but the potential benefits of adding these characters quickly disappear as the proportion of missing data increases. Furthermore, despite the overall trend, adding characters with missing data does decrease accuracy in some cases. In these situations, the missing data entries are not themselves misleading, but their presence may mimic the effects of limited taxon sampling, which can positively mislead. Criteria are discussed for predicting whether adding characters with missing data may increase or decrease accuracy. The results of this study also suggest that accuracy can be increased to a surprising degree by (1) "filling the holes" in a data matrix as much as possible (even when relatively few taxa are missing data), and (2) adding fewer characters scored for all taxa rather than adding a larger number of characters known for fewer taxa. Missing data can also be eliminated from an analysis through the exclusion of incomplete taxa rather than incomplete characters, but this approach may reduce the usefulness of the analysis and (in some cases) the accuracy of the estimated trees.  相似文献   

11.

Background  

Phylogenomic studies based on multi-locus sequence data sets are usually characterized by partial taxon coverage, in which sequences for some loci are missing for some taxa. The impact of missing data has been widely studied in phylogenetics, but it has proven difficult to distinguish effects due to error in tree reconstruction from effects due to missing data per se. We approach this problem using a explicitly phylogenomic criterion of success, decisiveness, which refers to whether the pattern of taxon coverage allows for uniquely defining a single tree for all taxa.  相似文献   

12.
Recent studies have shown that addition or deletion of taxa from a data matrix can change the estimate of phylogeny. I used 29 data sets from the literature to examine the effect of taxon sampling on phylogeny estimation within data sets. I then used multiple regression to assess the effect of number of taxa, number of characters, homoplasy, strength of support, and tree symmetry on the sensitivity of data sets to taxonomic sampling. Sensitivity to sampling was measured by mapping characters from a matrix of culled taxa onto optimal trees for that reduced matrix and onto the pruned optimal tree for the entire matrix, then comparing the length of the reduced tree to the length of the pruned complete tree. Within-data-set patterns can be described by a second-order equation relating fraction of taxa sampled to sensitivity to sampling. Multiple regression analyses found number of taxa to be a significant predictor of sensitivity to sampling; retention index, number of informative characters, total support index, and tree symmetry were nonsignificant predictors. I derived a predictive regression equation relating fraction of taxa sampled and number of taxa potentially sampled to sensitivity to taxonomic sampling and calculated values for this equation within the bounds of the variables examined. The length difference between the complete tree and a subsampled tree was generally small (average difference of 0-2.9 steps), indicating that subsampling taxa is probably not an important problem for most phylogenetic analyses using up to 20 taxa.  相似文献   

13.
Missing data are commonly thought to impede a resolved or accurate reconstruction of phylogenetic relationships, and probabilistic analysis techniques are increasingly viewed as less vulnerable to the negative effects of data incompleteness than parsimony analyses. We test both assumptions empirically by conducting parsimony and Bayesian analyses on an approximately 1.5 × 106‐cell (27 965 characters × 52 species) mustelid–procyonid molecular supermatrix with 62.7% missing entries. Contrary to the first assumption, phylogenetic relationships inferred from our analyses are fully (Bayesian) or almost fully (parsimony) resolved topologically with mostly strong support and also largely in accord with prior molecular estimations of mustelid and procyonid phylogeny derived with parsimony, Bayesian, and other probabilistic analysis techniques from smaller but complete or nearly complete data sets. Contrary to the second assumption, we found no compelling evidence in support of a relationship between the inferior performance of parsimony and taxon incompleteness (i.e. the proportion of missing character data for a taxon), although we found evidence for a connection between the inferior performance of parsimony and character incompleteness (i.e. no overlap in character data between some taxa). The relatively good performance of our analyses may be related to the large number of sampled characters, so that most taxa (even highly incomplete ones) are represented by a sufficient number of characters allowing both approaches to resolve their relationships. © The Willi Hennig Society 2009.  相似文献   

14.
The problem of missing data is often considered to be the most important obstacle in reconstructing the phylogeny of fossil taxa and in combining data from diverse characters and taxa for phylogenetic analysis. Empirical and theoretical studies show that including highly incomplete taxa can lead to multiple equally parsimonious trees, poorly resolved consensus trees, and decreased phylogenetic accuracy. However, the mechanisms that cause incomplete taxa to be problematic have remained unclear. It has been widely assumed that incomplete taxa are problematic because of the proportion or amount of missing data that they bear. In this study, I use simulations to show that the reduced accuracy associated with including incomplete taxa is caused by these taxa bearing too few complete characters rather than too many missing data cells. This seemingly subtle distinction has a number of important implications. First, the so-called missing data problem for incomplete taxa is, paradoxically, not directly related to their amount or proportion of missing data. Thus, the level of completeness alone should not guide the exclusion of taxa (contrary to common practice), and these results may explain why empirical studies have sometimes found little relationship between the completeness of a taxon and its impact on an analysis. These results also (1) suggest a more effective strategy for dealing with incomplete taxa, (2) call into question a justification of the controversial phylogenetic supertree approach, and (3) show the potential for the accurate phylogenetic placement of highly incomplete taxa, both when combining diverse data sets and when analyzing relationships of fossil taxa.  相似文献   

15.
JJ Wiens  J Tiu 《PloS one》2012,7(8):e42925

Background

Phylogenies are essential to many areas of biology, but phylogenetic methods may give incorrect estimates under some conditions. A potentially common scenario of this type is when few taxa are sampled and terminal branches for the sampled taxa are relatively long. However, the best solution in such cases (i.e., sampling more taxa versus more characters) has been highly controversial. A widespread assumption in this debate is that added taxa must be complete (no missing data) in order to save analyses from the negative impacts of limited taxon sampling. Here, we evaluate whether incomplete taxa can also rescue analyses under these conditions (empirically testing predictions from an earlier simulation study).

Methodology/Principal Findings

We utilize DNA sequence data from 16 vertebrate species with well-established phylogenetic relationships. In each replicate, we randomly sample 4 species, estimate their phylogeny (using Bayesian, likelihood, and parsimony methods), and then evaluate whether adding in the remaining 12 species (which have 50, 75, or 90% of their data replaced with missing data cells) can improve phylogenetic accuracy relative to analyzing the 4 complete taxa alone. We find that in those cases where sampling few taxa yields an incorrect estimate, adding taxa with 50% or 75% missing data can frequently (>75% of relevant replicates) rescue Bayesian and likelihood analyses, recovering accurate phylogenies for the original 4 taxa. Even taxa with 90% missing data can sometimes be beneficial.

Conclusions

We show that adding taxa that are highly incomplete can improve phylogenetic accuracy in cases where analyses are misled by limited taxon sampling. These surprising empirical results confirm those from simulations, and show that the benefits of adding taxa may be obtained with unexpectedly small amounts of data. These findings have important implications for the debate on sampling taxa versus characters, and for studies attempting to resolve difficult phylogenetic problems.  相似文献   

16.
The exclusive use of characters coding for specific life stages may bias tree reconstruction. If characters from several life stages are coded, the type of coding becomes important. Here, we simulate the influence on tree reconstruction of morphological characters of Odonata larvae incorporated into a data matrix based on the adult body under different coding schemes. For testing purposes, our analysis is focused on a well‐supported hypothesis: the relationships of the suborders Zygoptera, ‘Anisozygoptera’, and Anisoptera. We studied the cephalic morphology of Epiophlebia, a key taxon among Odonata, and compared it with representatives of Zygoptera and Anisoptera in order to complement the data matrix. Odonate larvae are characterized by a peculiar morphology, such as the specific head form, mouthpart configuration, ridge configuration, cephalic musculature, and leg and gill morphology. Four coding strategies were used to incorporate the larval data: artificial coding (AC), treating larvae as independent terminal taxa; non‐multistate coding (NMC), preferring the adult life stage; multistate coding (MC); and coding larval and adult characters separately (SC) within the same taxon. As expected, larvae are ‘monophyletic’ in the AC strategy, but with anisopteran and zygopteran larvae as sister groups. Excluding larvae in the NMC approach leads to strong support for both monophyletic Odonata and Epiprocta, whereas MC erodes phylogenetic signal completely. This is an obvious result of the larval morphology leading to many multistate characters. SC results in the strongest support for Odonata, and Epiprocta receives the same support as with NMC. Our results show the deleterious effects of larval morphology on tree reconstruction when multistate coding is applied. Coding larval characters separately is still the best approach in a phylogenetic framework. © 2015 The Linnean Society of London  相似文献   

17.
There has been increasing interest in integrating a regional tree of life with community assembly rules in the ecological research. This raises questions regarding the impacts of taxon sampling strategies at the regional versus global scales on the topology. To address this concern, we constructed two trees for the nitrogen-fixing clade: (i) a genus-level global tree including 1023 genera; and (ii) a regional tree comprising 303 genera, with taxon sampling limited to China. We used the supermatrix approach and performed maximum likelihood analyses on combined matK, rbcL, and trnL-F plastid sequences. We found that the topology of the global and the regional tree of the N-fixing clade were generally congruent. However, whereas relationships among the four orders obtained with the global tree agreed with the accepted topology obtained in focused analyses with more genes, the regional topology obtained different relationships, albeit weakly supported. At a finer scale, the phylogenetic position of the family Myricaceae was found to be sensitive to sampling density. We expect that internal support throughout the phylogeny could be improved with denser taxon sampling. The taxon sampling approach (global vs. regional) did not have a major impact on fine-level branching patterns of the N-fixing clade. Thus, a well-resolved phylogeny with relatively dense taxon sampling strategy at the regional scale appears, in this case, to be a good representation of the overall phylogenetic pattern and could be used in ecological research. Otherwise, the regional tree should be adjusted according to the correspondingly reliable global tree.  相似文献   

18.
Comprehensive sampling of genomic biodiversity is fast becoming a reality for some genomic regions and complete organelle genomes. Genomic biodiversity is defined as large genomic sequences from many species, and here some recent work is reviewed that demonstrates the potential benefits of genomic biodiversity for molecular evolutionary analysis and phylogenetic reconstruction. This work shows that using likelihood-based approaches, taxon addition can dramatically improve phylogenetic reconstruction. Features or dynamics of the evolutionary process are much more easily inferred with large numbers of taxa, and large numbers are essential for discriminating differences in evolutionary patterns between sites. Accurate prediction of site-specific patterns can improve phylogenetic reconstruction by an amount equivalent to quadrupling sequence length. Genomic biodiversity is particularly central to research relating patterns of evolution, adaptation and coevolution to structural and functional features of proteins. Research on detecting coevolution between amino acid residues in proteins demonstrates a clear need for much greater numbers of closely related taxa to better discriminate site-specific patterns of interaction, and to allow more detailed analysis of coevolutionary interactions between subunits in protein complexes. It is argued that parsing out coevolutionary and other context-dependent substitution probabilities is essential for discriminating between coevolution and adaptation, and for more realistically modelling the evolution of proteins. Also reviewed is research that argues for increasing the efficiency of acquiring genomic biodiversity, and suggests that this might be done by simultaneously shotgun cloning and sequencing genomic mixtures from many species. Increased efficiency is a prerequisite if genomic biodiversity levels are to rapidly increase by orders of magnitude, and thus lead to dramatically improved understanding of interactions between protein structure, function and sequence evolution.  相似文献   

19.
Ancestral state reconstruction is a method used to study the evolutionary trajectories of quantitative characters on phylogenies. Although efficient methods for univariate ancestral state reconstruction under a Brownian motion model have been described for at least 25 years, to date no generalization has been described to allow more complex evolutionary models, such as multivariate trait evolution, non‐Brownian models, missing data, and within‐species variation. Furthermore, even for simple univariate Brownian motion models, most phylogenetic comparative R packages compute ancestral states via inefficient tree rerooting and full tree traversals at each tree node, making ancestral state reconstruction extremely time‐consuming for large phylogenies. Here, a computationally efficient method for fast maximum likelihood ancestral state reconstruction of continuous characters is described. The algorithm has linear complexity relative to the number of species and outperforms the fastest existing R implementations by several orders of magnitude. The described algorithm is capable of performing ancestral state reconstruction on a 1,000,000‐species phylogeny in fewer than 2 s using a standard laptop, whereas the next fastest R implementation would take several days to complete. The method is generalizable to more complex evolutionary models, such as phylogenetic regression, within‐species variation, non‐Brownian evolutionary models, and multivariate trait evolution. Because this method enables fast repeated computations on phylogenies of virtually any size, implementation of the described algorithm can drastically alleviate the computational burden of many otherwise prohibitively time‐consuming tasks requiring reconstruction of ancestral states, such as phylogenetic imputation of missing data, bootstrapping procedures, Expectation‐Maximization algorithms, and Bayesian estimation. The described ancestral state reconstruction algorithm is implemented in the Rphylopars functions anc.recon and phylopars.  相似文献   

20.
We examined how alignment of internal transcribed spacers of rDNA in fungi and plants changes with increasing genetic distance by successive removal of sequences from each data set followed by realignment and phylogenetic analysis. Increasing genetic distance can negatively affect phylogenetic reconstruction in two ways. First, it may cause errors in the alignment and therefore the homology hypotheses of the sequence characters. Second, it may cause errors in the homology assessments of character states because of multiple hits on individual branches. These two causes of error in phylogenetic inference were distinguished from one another in our analysis. The errors in alignment caused by increasing genetic distance were primarily due to inserting too few gaps and inserting gaps at the wrong positions. Errors in tree resolution, topology, and/or branch-support values were more often caused by multiple hits than by misaligned positions. This suggests that increasing genetic distance negatively affects our primary homology assessments of character states more severely than our primary homology assessments of characters. We suggest that increasing taxon sampling with the aim of subdividing long branches is a strategy for obtaining reliable alignments.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号