首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The rapid accumulation of whole-genome data has renewed interest in the study of genomic rearrangements. Comparative genomics, evolutionary biology, and cancer research all require models and algorithms to elucidate the mechanisms, history, and consequences of these rearrangements. However, even simple models lead to NP-hard problems, particularly in the area of phylogenetic analysis. Current approaches are limited to small collections of genomes and low-resolution data (typically a few hundred syntenic blocks). Moreover, whereas phylogenetic analyses from sequence data are deemed incomplete unless bootstrapping scores (a measure of confidence) are given for each tree edge, no equivalent to bootstrapping exists for rearrangement-based phylogenetic analysis. We describe a fast and accurate algorithm for rearrangement analysis that scales up, in both time and accuracy, to modern high-resolution genomic data. We also describe a novel approach to estimate the robustness of results-an equivalent to the bootstrapping analysis used in sequence-based phylogenetic reconstruction. We present the results of extensive testing on both simulated and real data showing that our algorithm returns very accurate results, while scaling linearly with the size of the genomes and cubically with their number. We also present extensive experimental results showing that our approach to robustness testing provides excellent estimates of confidence, which, moreover, can be tuned to trade off thresholds between false positives and false negatives. Together, these two novel approaches enable us to attack heretofore intractable problems, such as phylogenetic inference for high-resolution vertebrate genomes, as we demonstrate on a set of six vertebrate genomes with 8,380 syntenic blocks. A copy of the software is available on demand.  相似文献   

2.

Background  

In recent years, gene order data has attracted increasing attention from both biologists and computer scientists as a new type of data for phylogenetic analysis. If gene orders are viewed as one character with a large number of states, traditional bootstrap procedures cannot be applied. Researchers began to use a jackknife resampling method to assess the quality of gene order phylogenies.  相似文献   

3.
Gene content has been shown to contain a strong phylogenetic signal, yet its usage for phylogenetic questions is hampered by horizontal gene transfer and parallel gene loss and until now required completely sequenced genomes. Here, we introduce an approach that allows the phylogenetic signal in gene content to be applied to any set of sequences, using signature genes for phylogenetic classification. The hundreds of publicly available genomes allow us to identify signature genes at various taxonomic depths, and we show how the presence of signature genes in an unspecified sample can be used to characterize its taxonomic composition. We identify 8,362 signature genes specific for 112 prokaryotic taxa. We show that these signature genes can be used to address phylogenetic questions on the basis of gene content in cases where classic gene content or sequence analyses provide an ambiguous answer, such as for Nanoarchaeum equitans, and even in cases where complete genomes are not available, such as for metagenomics data. Cross-validation experiments leaving out up to 30% of the species show that approximately 92% of the signature genes correctly place the species in a related clade. Analyses of metagenomics data sets with the signature gene approach are in good agreement with the previously reported species distributions based on phylogenetic analysis of marker genes. Summarizing, signature genes can complement traditional sequence-based methods in addressing taxonomic questions.  相似文献   

4.
Use of whole genome sequence data to infer baculovirus phylogeny   总被引:18,自引:0,他引:18       下载免费PDF全文
Several phylogenetic methods based on whole genome sequence data were evaluated using data from nine complete baculovirus genomes. The utility of three independent character sets was assessed. The first data set comprised the sequences of the 63 genes common to these viruses. The second set of characters was based on gene order, and phylogenies were inferred using both breakpoint distance analysis and a novel method developed here, termed neighbor pair analysis. The third set recorded gene content by scoring gene presence or absence in each genome. All three data sets yielded phylogenies supporting the separation of the Nucleopolyhedrovirus (NPV) and Granulovirus (GV) genera, the division of the NPVs into groups I and II, and species relationships within group I NPVs. Generation of phylogenies based on the combined sequences of all 63 shared genes proved to be the most effective approach to resolving the relationships among the group II NPVs and the GVs. The history of gene acquisitions and losses that have accompanied baculovirus diversification was visualized by mapping the gene content data onto the phylogenetic tree. This analysis highlighted the fluid nature of baculovirus genomes, with evidence of frequent genome rearrangements and multiple gene content changes during their evolution. Of more than 416 genes identified in the genomes analyzed, only 63 are present in all nine genomes, and 200 genes are found only in a single genome. Despite this fluidity, the whole genome-based methods we describe are sufficiently powerful to recover the underlying phylogeny of the viruses.  相似文献   

5.
A number of recent papers have suggested that gene family content can be used to resolve phylogenies, particularly in the case of prokaryotes, in which extensive horizontal gene transfer means that individual gene phylogenies may not mirror the organismal phylogeny. However, no study has yet examined how sensitive such analyses are to the criterion of homology assessment used to assemble multigene families. Using data from 99 completely sequenced prokaryotic genomes, we examined the effect of homology criteria in phylogenetic analyses wherein presence or absence of each family in the genome was used as a cladistic character. Different criteria resulted in evidence for contradictory tree topologies, sometimes with high bootstrap support. A moderately strict criterion seemed best for assembling multigene families in a biologically meaningful way, but it was not necessarily preferable for phylogenetic analysis. Instead, a very strict criterion, which broke up gene families into smaller subfamilies, seemed to have advantages for phylogenetic purposes. The poor performance of gene family content-based phylogenetic analysis in the case of prokaryotes appears to reflect high levels of homoplasy resulting not only from horizontal gene transfer but also, more importantly, from extensive parallel loss of gene families in certain bacteria genomes.  相似文献   

6.
Mitochondrial genomes provide a valuable dataset for phylogenetic studies, in particular of metazoan phylogeny because of the extensive taxon sample that is available. Beyond the traditional sequence-based analysis it is possible to extract phylogenetic information from the gene order. Here we present a novel approach utilizing these data based on cyclic list alignments of the gene orders. A progressive alignment approach is used to combine pairwise list alignments into a multiple alignment of gene orders. Parsimony methods are used to reconstruct phylogenetic trees, ancestral gene orders, and consensus patterns in a straightforward approach. We apply this method to study the phylogeny of protostomes based exclusively on mitochondrial genome arrangements. We, furthermore, demonstrate that our approach is also applicable to the much larger genomes of chloroplasts.  相似文献   

7.
CLADISTICS: WHAT'S IN A WORD?   总被引:10,自引:0,他引:10  
Abstract— Cladistics has changed considerably with the availability of new methods and sources of data, and the increasing realization that cladograms are relevant to all manner of historical questions. Criticisms of, and justifications for, consensus hypotheses in phylogenetic inference are reviewed. The conclusion is overwhelmingly against taxonomic congruence which deliberately seeks consensus propositions. The total evidence approach is not so burdened. A preference for suboptimal cladograms is also critized, as is the protocol for mapping characters of special interest onto a phylogenetic hypothesis derived from other evidence. The bootstrap and jackknife resampling techniques are questioned because their underlying assumptions are violated and they are sensitive to character frequencies. These findings suggest that cladistics is being redefined in ways that contradict the practices and principles responsible for its pre-eminence in phylogenetic inference.  相似文献   

8.
MOTIVATION: Phylogenomics integrates the vast amount of phylogenetic information contained in complete genome sequences, and is rapidly becoming the standard for reliably inferring species phylogenies. There are, however, fundamental differences between the ways in which phylogenomic approaches like gene content, superalignment, superdistance and supertree integrate the phylogenetic information from separate orthologous groups. Furthermore, they all depend on the method by which the orthologous groups are initially determined. Here, we systematically compare these four phylogenomic approaches, in parallel with three approaches for large-scale orthology determination: pairwise orthology, cluster orthology and tree-based orthology. RESULTS: Including various phylogenetic methods, we apply a total of 54 fully automated phylogenomic procedures to the fungi, the eukaryotic clade with the largest number of sequenced genomes, for which we retrieved a golden standard phylogeny from the literature. Phylogenomic trees based on gene content show, relative to the other methods, a bias in the tree topology that parallels convergence in lifestyle among the species compared, indicating convergence in gene content. CONCLUSIONS: Complete genomes are no guarantee for good or even consistent phylogenies. However, the large amounts of data in genomes enable us to carefully select the data most suitable for phylogenomic inference. In terms of performance, the superalignment approach, combined with restrictive orthology, is the most successful in recovering a fungal phylogeny that agrees with current taxonomic views, and allows us to obtain a high-resolution phylogeny. We provide solid support for what has grown to be a common practice in phylogenomics during its advance in recent years. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.  相似文献   

9.
Quantifying branch support using the bootstrap and/or jackknife is generally considered to be an essential component of rigorous parsimony and maximum likelihood phylogenetic analyses. Previous authors have described how application of the frequency-within-replicates approach to treating multiple equally optimal trees found in a given bootstrap pseudoreplicate can provide apparent support for otherwise unsupported clades. We demonstrate how a similar problem may occur when a non-representative subset of equally optimal trees are held per pseudoreplicate, which we term the undersampling-within-replicates artifact. We illustrate the frequency-within-replicates and undersampling-within-replicates bootstrap and jackknife artifacts using both contrived and empirical examples, demonstrate that the artifacts can occur in both parsimony and likelihood analyses, and show that the artifacts occur in outputs from multiple different phylogenetic-inference programs. Based on our results, we make the following five recommendations, which are particularly relevant to supermatrix analyses, but apply to all phylogenetic analyses. First, when two or more optimal trees are found in a given pseudoreplicate they should be summarized using the strict-consensus rather than frequency-within-replicates approach. Second jackknife resampling should be used rather than bootstrap resampling. Third, multiple tree searches while holding multiple trees per search should be conducted in each pseudoreplicate rather than conducting only a single search and holding only a single tree. Fourth, branches with a minimum possible optimized length of zero should be collapsed within each tree search rather than collapsing branches only if their maximum possible optimized length is zero. Fifth, resampling values should be mapped onto the strict consensus of all optimal trees found rather than simply presenting the ≥ 50% bootstrap or jackknife tree or mapping the resampling values onto a single optimal tree.  相似文献   

10.
Phylogenies involving nonmodel species are based on a few genes, mostly chosen following historical or practical criteria. Because gene trees are sometimes incongruent with species trees, the resulting phylogenies may not accurately reflect the evolutionary relationships among species. The increase in availability of genome sequences now provides large numbers of genes that could be used for building phylogenies. However, for practical reasons only a few genes can be sequenced for a wide range of species. Here we asked whether we can identify a few genes, among the single-copy genes common to most fungal genomes, that are sufficient for recovering accurate and well-supported phylogenies. Fungi represent a model group for phylogenomics because many complete fungal genomes are available. An automated procedure was developed to extract single-copy orthologous genes from complete fungal genomes using a Markov Clustering Algorithm (Tribe-MCL). Using 21 complete, publicly available fungal genomes with reliable protein predictions, 246 single-copy orthologous gene clusters were identified. We inferred the maximum likelihood trees using the individual orthologous sequences and constructed a reference tree from concatenated protein alignments. The topologies of the individual gene trees were compared to that of the reference tree using three different methods. The performance of individual genes in recovering the reference tree was highly variable. Gene size and the number of variable sites were highly correlated and significantly affected the performance of the genes, but the average substitution rate did not. Two genes recovered exactly the same topology as the reference tree, and when concatenated provided high bootstrap values. The genes typically used for fungal phylogenies did not perform well, which suggests that current fungal phylogenies based on these genes may not accurately reflect the evolutionary relationships among species. Analyses on subsets of species showed that the phylogenetic performance did not seem to depend strongly on the sample. We expect that the best-performing genes identified here will be very useful for phylogenetic studies of fungi, at least at a large taxonomic scale. Furthermore, we compare the method developed here for finding genes for building robust phylogenies with previous ones and we advocate that our method could be applied to other groups of organisms when more complete genomes are available.  相似文献   

11.
In this study, we used an empirical example based on 100 mitochondrial genomes from higher teleost fishes to compare the accuracy of parsimony-based jackknife values with Bayesian support values. Phylogenetic analyses of 366 partitions, using differential taxon and character sampling from the entire data matrix of 100 taxa and 7,990 characters, were performed for both phylogenetic methods. The tree topology and branch-support values from each partition were compared with the tree inferred from all taxa and characters. Using this approach, we quantified the accuracy of the branch-support values assigned by the jackknife and Bayesian methods, with respect to each of 15 basal clades. In comparing the jackknife and Bayesian methods, we found that (1) both measures of support differ significantly from an ideal support index; (2) the jackknife underestimated support values; (3) the Bayesian method consistently overestimated support; (4) the magnitude by which Bayesian values overestimate support exceeds the magnitude by which the jackknife underestimates support; and (5) both methods performed poorly when taxon sampling was increased and character sampling was not increases. These results indicate that (1) the higher Bayesian support values are inappropriate (in magnitude), and (2) Bayesian support values should not be interpreted as probabilities that clades are correctly resolved. We advocate the continued use of the relatively conservative bootstrap and jackknife approaches to estimating branch support rather than the more extreme overestimates provided by the Markov Chain Monte Carlo-based Bayesian methods.  相似文献   

12.
The multispecies coalescent (MSC) is a statistical framework that models how gene genealogies grow within the branches of a species tree. The field of computational phylogenetics has witnessed an explosion in the development of methods for species tree inference under MSC, owing mainly to the accumulating evidence of incomplete lineage sorting in phylogenomic analyses. However, the evolutionary history of a set of genomes, or species, could be reticulate due to the occurrence of evolutionary processes such as hybridization or horizontal gene transfer. We report on a novel method for Bayesian inference of genome and species phylogenies under the multispecies network coalescent (MSNC). This framework models gene evolution within the branches of a phylogenetic network, thus incorporating reticulate evolutionary processes, such as hybridization, in addition to incomplete lineage sorting. As phylogenetic networks with different numbers of reticulation events correspond to points of different dimensions in the space of models, we devise a reversible-jump Markov chain Monte Carlo (RJMCMC) technique for sampling the posterior distribution of phylogenetic networks under MSNC. We implemented the methods in the publicly available, open-source software package PhyloNet and studied their performance on simulated and biological data. The work extends the reach of Bayesian inference to phylogenetic networks and enables new evolutionary analyses that account for reticulation.  相似文献   

13.
Composition vector trees(CVTrees)are inferred from whole-genome data by an alignment-free and parameter-free method.The agreement of these trees with the corresponding taxonomy provides an objective justification of the inferred phylogeny.In this work,we show the stability and self-consistency of CVTrees by performing bootstrap and jackknife re-sampling tests adapted to this alignment-free approach.Our ultimate goal is to advocate the viewpoint that time-consuming statistical re-sampling tests can be avoided at all in using this alignment-free approach.Agreement with taxonomy should be taken as a major criterion to estimate prokaryotic phylogenetic trees.  相似文献   

14.
Evolution operates on whole genomes through direct rearrangements of genes, such as inversions, transpositions, and inverted transpositions, as well as through operations, such as duplications, losses, and transfers, that also affect the gene content of the genomes. Because these events are rare relative to nucleotide substitutions, gene order data offer the possibility of resolving ancient branches in the tree of life; the combination of gene order data with sequence data also has the potential to provide more robust phylogenetic reconstructions, since each can elucidate evolution at different time scales. Distance corrections greatly improve the accuracy of phylogeny reconstructions from DNA sequences, enabling distance-based methods to approach the accuracy of the more elaborate methods based on parsimony or likelihood at a fraction of the computational cost. This paper focuses on developing distance correction methods for phylogeny reconstruction from whole genomes. The main question we investigate is how to estimate evolutionary histories from whole genomes with equal gene content, and we present a technique, the empirically derived estimator (EDE), that we have developed for this purpose. We study the use of EDE on whole genomes with identical gene content, and we explore the accuracy of phylogenies inferred using EDE with the neighbor joining and minimum evolution methods under a wide range of model conditions. Our study shows that tree reconstruction under these two methods is much more accurate when based on EDE distances than when based on other distances previously suggested for whole genomes. Electronic Supplementary Material Electronic Supplementary material is available for this article at and accessible for authorised users. [Reviewing Editor: Dr. Martin Kreitman]  相似文献   

15.
16.
Understanding the evolutionary history of species is at the core of molecular evolution and is done using several inference methods. The critical issue is to quantify the uncertainty of the inference. The posterior probabilities in Bayesian phylogenetic inference and the bootstrap values in frequentist approaches measure the variability of the estimates due to the sampling of sites from genes and the sampling of genes from genomes. However, they do not measure the uncertainty due to taxon sampling. Taxa that experienced molecular homoplasy, recent selection, a spur of evolution, and so forth may disrupt the inference and cause incongruences in the estimated phylogeny. We define a taxon influence index to assess the influence of each taxon on the phylogeny. We found that although most taxa have a weak influence on the phylogeny, a small fraction of influential taxa strongly alter it even in clades only loosely related to them. We conclude that highly influential taxa should be given special attention and sampling them more thoroughly can lead to more dependable phylogenies.  相似文献   

17.
The application of phylogenetic inference methods, to data for a set of independent genes sampled randomly throughout the genome, often results in substantial incongruence in the single-gene phylogenetic estimates. Among the processes known to produce discord between single-gene phylogenies, two of the best studied in a phylogenetic context are hybridization and incomplete lineage sorting. Much recent attention has focused on the development of methods for estimating species phylogenies in the presence of incomplete lineage sorting, but phylogenetic models that allow for hybridization have been more limited. Here we propose a model that allows incongruence in single-gene phylogenies to be due to both hybridization and incomplete lineage sorting, with the goal of determining the contribution of hybridization to observed gene tree incongruence in the presence of incomplete lineage sorting. Using our model, we propose methods for estimating the extent of the role of hybridization in both a likelihood and a Bayesian framework. The performance of our methods is examined using both simulated and empirical data.  相似文献   

18.
The evolutionary relationships of extant great apes and humans have been largely resolved by molecular studies, yet morphology-based phylogenetic analyses continue to provide conflicting results. In order to further investigate this discrepancy we present bootstrap clade support of morphological data based on two quantitative datasets, one dataset consisting of linear measurements of the whole skull from 5 hominoid genera and the second dataset consisting of 3D landmark data from the temporal bone of 5 hominoid genera, including 11 sub-species. Using similar protocols for both datasets, we were able to 1) compare distance-based phylogenetic methods to cladistic parsimony of quantitative data converted into discrete character states, 2) vary outgroup choice to observe its effect on phylogenetic inference, and 3) analyse male and female data separately to observe the effect of sexual dimorphism on phylogenies. Phylogenetic analysis was sensitive to methodological decisions, particularly outgroup selection, where designation of Pongo as an outgroup and removal of Hylobates resulted in greater congruence with the proposed molecular phylogeny. The performance of distance-based methods also justifies their use in phylogenetic analysis of morphological data. It is clear from our analyses that hominoid phylogenetics ought not to be used as an example of conflict between the morphological and molecular, but as an example of how outgroup and methodological choices can affect the outcome of phylogenetic analysis.  相似文献   

19.
Elongation factor 1 alpha (EF-1 alpha) is a highly conserved ubiquitous protein involved in translation that has been suggested to have desirable properties for phylogenetic inference. To examine the utility of EF-1 alpha as a phylogenetic marker for eukaryotes, we studied three properties of EF-1 alpha trees: congruency with other phyogenetic markers, the impact of species sampling, and the degree of substitutional saturation occurring between taxa. Our analyses indicate that the EF-1 alpha tree is congruent with some other molecular phylogenies in identifying both the deepest branches and some recent relationships in the eukaryotic line of descent. However, the topology of the intermediate portion of the EF-1 alpha tree, occupied by most of the protist lineages, differs for different phylogenetic methods, and bootstrap values for branches are low. Most problematic in this region is the failure of all phylogenetic methods to resolve the monophyly of two higher-order protistan taxa, the Ciliophora and the Alveolata. JACKMONO analyses indicated that the impact of species sampling on bootstrap support for most internal nodes of the eukaryotic EF-1 alpha tree is extreme. Furthermore, a comparison of observed versus inferred numbers of substitutions indicates that multiple overlapping substitutions have occurred, especially on the branch separating the Eukaryota from the Archaebacteria, suggesting that the rooting of the eukaryotic tree on the diplomonad lineage should be treated with caution. Overall, these results suggest that the phylogenies obtained from EF-1 alpha are congruent with other molecular phylogenies in recovering the monophyly of groups such as the Metazoa, Fungi, Magnoliophyta, and Euglenozoa. However, the interrelationships between these and other protist lineages are not well resolved. This lack of resolution may result from the combined effects of poor taxonomic sampling, relatively few informative positions, large numbers of overlapping substitutions that obscure phylogenetic signal, and lineage-specific rate increases in the EF-1 alpha data set. It is also consistent with the nearly simultaneous diversification of major eukaryotic lineages implied by the "big-bang" hypothesis of eukaryote evolution.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号