首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Gene duplication and gene loss as well as other biological events can result in multiple copies of genes in a given species. Because of these gene duplication and loss dynamics, in addition to variation in sequence evolution and other sources of uncertainty, different gene trees ultimately present different evolutionary histories. All of this together results in gene trees that give different topologies from each other, making consensus species trees ambiguous in places. Other sources of data to generate species trees are also unable to provide completely resolved binary species trees. However, in addition to gene duplication events, speciation events have provided some underlying phylogenetic signal, enabling development of algorithms to characterize these processes. Therefore, a soft parsimony algorithm has been developed that enables the mapping of gene trees onto species trees and modification of uncertain or weakly supported branches based on minimizing the number of gene duplication and loss events implied by the tree. The algorithm also allows for rooting of unrooted trees and for removal of in-paralogues (lineage-specific duplicates and redundant sequences masquerading as such). The algorithm has also been made available for download as a software package, Softparsmap.  相似文献   

2.
When gene copies are sampled from various species, the resulting gene tree might disagree with the containing species tree. The primary causes of gene tree and species tree discord include incomplete lineage sorting, horizontal gene transfer, and gene duplication and loss. Each of these events yields a different parsimony criterion for inferring the (containing) species tree from gene trees. With incomplete lineage sorting, species tree inference is to find the tree minimizing extra gene lineages that had to coexist along species lineages; with gene duplication, it becomes to find the tree minimizing gene duplications and/or losses. In this paper, we present the following results: 1) The deep coalescence cost is equal to the number of gene losses minus two times the gene duplication cost in the reconciliation of a uniquely leaf labeled gene tree and a species tree. The deep coalescence cost can be computed in linear time for any arbitrary gene tree and species tree. 2) The deep coalescence cost is always not less than the gene duplication cost in the reconciliation of an arbitrary gene tree and a species tree. 3) Species tree inference by minimizing deep coalescence events is NP-hard.  相似文献   

3.
Plasmodium falciparum is the parasite responsible for the most acute form of malaria in humans. Recently, the serine repeat antigen (SERA) in P. falciparum has attracted attention as a potential vaccine and drug target, and it has been shown to be a member of a large gene family. To clarify the relationships among the numerous P. falciparum SERAs and to identify orthologs to SERA5 and SERA6 in Plasmodium species affecting rodents, gene trees were inferred from nucleotide and amino acid sequence data for 33 putative SERA homologs in seven different species. (A distance method for nucleotide sequences that is specifically designed to accommodate differing GC content yielded results that were largely compatible with the amino acid tree. Standard-distance and maximum-likelihood methods for nucleotide sequences, on the other hand, yielded gene trees that differed in important respects.) To infer the pattern of duplication, speciation, and gene loss events in the SERA gene family history, the resulting gene trees were then "reconciled" with two competing Plasmodium species tree topologies that have been identified by previous phylogenetic studies. Parsimony of reconciliation was used as a criterion for selecting a gene tree/species tree pair and provided (1) support for one of the two species trees and for the core topology of the amino acid-derived gene tree, (2) a basis for critiquing fine detail in a poorly resolved region of the gene tree, (3) a set of predicted "missing genes" in some species, (4) clarification of the relationship among the P. falciparum SERA, and (5) some information about SERA5 and SERA6 orthologs in the rodent malaria parasites. Parsimony of reconciliation and a second criterion--implied mutational pattern at two key active sites in the SERA proteins-were also seen to be useful supplements to standard "bootstrap" analysis for inferred topologies.  相似文献   

4.
Arabidopsis thaliana is believed to have experienced at least two and possibly three whole-genome duplication events in its evolutionary history. In order to investigate the evolutionary relationships between these duplication events and diversification of disease resistance (R) genes, segmental-duplication events containing R genes belonging to the nucleotide binding-leucine rich repeat (NB-LRR) class were identified. Of 153 segmental-duplication events containing NB-LRR genes, only 22 contained NB-LRR genes in both members of the duplication pair, indicating a high frequency of NB-LRR gene loss after whole-genome duplication. The relative age of the duplication events was estimated based on the average synonymous substitution rate of the duplicated gene pairs in the segments. These data were combined with phylogenetic analyses. NB-LRR genes present in segment pairs derived from the most recent whole-genome duplication event, estimated to have occurred only 20 to 40 million years ago, occupy very distant branches of the NB-LRR phylogenetic tree. These data suggest that when NB-LRR clusters are duplicated as part of a whole-genome duplication, homoeologous NB-LRR genes are preferentially lost, either by eliminating one copy of the cluster or by eliminating individual genes such that only paralogous NB-LRR genes are maintained.  相似文献   

5.

Background

The abundance of new genomic data provides the opportunity to map the location of gene duplication and loss events on a species phylogeny. The first methods for mapping gene duplications and losses were based on a parsimony criterion, finding the mapping that minimizes the number of duplication and loss events. Probabilistic modeling of gene duplication and loss is relatively new and has largely focused on birth-death processes.

Results

We introduce a new maximum likelihood model that estimates the speciation and gene duplication and loss events in a gene tree within a species tree with branch lengths. We also provide an, in practice, efficient algorithm that computes optimal evolutionary scenarios for this model. We implemented the algorithm in the program DrML and verified its performance with empirical and simulated data.

Conclusions

In test data sets, DrML finds optimal gene duplication and loss scenarios within minutes, even when the gene trees contain sequences from several hundred species. In many cases, these optimal scenarios differ from the lca-mapping that results from a parsimony gene tree reconciliation. Thus, DrML provides a new, practical statistical framework on which to study gene duplication.
  相似文献   

6.
Yu Y  Degnan JH  Nakhleh L 《PLoS genetics》2012,8(4):e1002660
Gene tree topologies have proven a powerful data source for various tasks, including species tree inference and species delimitation. Consequently, methods for computing probabilities of gene trees within species trees have been developed and widely used in probabilistic inference frameworks. All these methods assume an underlying multispecies coalescent model. However, when reticulate evolutionary events such as hybridization occur, these methods are inadequate, as they do not account for such events. Methods that account for both hybridization and deep coalescence in computing the probability of a gene tree topology currently exist for very limited cases. However, no such methods exist for general cases, owing primarily to the fact that it is currently unknown how to compute the probability of a gene tree topology within the branches of a phylogenetic network. Here we present a novel method for computing the probability of gene tree topologies on phylogenetic networks and demonstrate its application to the inference of hybridization in the presence of incomplete lineage sorting. We reanalyze a Saccharomyces species data set for which multiple analyses had converged on a species tree candidate. Using our method, though, we show that an evolutionary hypothesis involving hybridization in this group has better support than one of strict divergence. A similar reanalysis on a group of three Drosophila species shows that the data is consistent with hybridization. Further, using extensive simulation studies, we demonstrate the power of gene tree topologies at obtaining accurate estimates of branch lengths and hybridization probabilities of a given phylogenetic network. Finally, we discuss identifiability issues with detecting hybridization, particularly in cases that involve extinction or incomplete sampling of taxa.  相似文献   

7.

Background  

The ever-increasing wealth of genomic sequence information provides an unprecedented opportunity for large-scale phylogenetic analysis. However, species phylogeny inference is obfuscated by incongruence among gene trees due to evolutionary events such as gene duplication and loss, incomplete lineage sorting (deep coalescence), and horizontal gene transfer. Gene tree parsimony (GTP) addresses this issue by seeking a species tree that requires the minimum number of evolutionary events to reconcile a given set of incongruent gene trees. Despite its promise, the use of gene tree parsimony has been limited by the fact that existing software is either not fast enough to tackle large data sets or is restricted in the range of evolutionary events it can handle.  相似文献   

8.
Phylogenetic analyses using genome-scale data sets must confront incongruence among gene trees, which in plants is exacerbated by frequent gene duplications and losses. Gene tree parsimony (GTP) is a phylogenetic optimization criterion in which a species tree that minimizes the number of gene duplications induced among a set of gene trees is selected. The run time performance of previous implementations has limited its use on large-scale data sets. We used new software that incorporates recent algorithmic advances to examine the performance of GTP on a plant data set consisting of 18,896 gene trees containing 510,922 protein sequences from 136 plant taxa (giving a combined alignment length of >2.9 million characters). The relationships inferred from the GTP analysis were largely consistent with previous large-scale studies of backbone plant phylogeny and resolved some controversial nodes. The placement of taxa that were present in few gene trees generally varied the most among GTP bootstrap replicates. Excluding these taxa either before or after the GTP analysis revealed high levels of phylogenetic support across plants. The analyses supported magnoliids sister to a eudicot + monocot clade and did not support the eurosid I and II clades. This study presents a nuclear genomic perspective on the broad-scale phylogenic relationships among plants, and it demonstrates that nuclear genes with a history of duplication and loss can be phylogenetically informative for resolving the plant tree of life.  相似文献   

9.
Gene family evolution is determined by microevolutionary processes (e.g., point mutations) and macroevolutionary processes (e.g., gene duplication and loss), yet macroevolutionary considerations are rarely incorporated into gene phylogeny reconstruction methods. We present a dynamic program to find the most parsimonious gene family tree with respect to a macroevolutionary optimization criterion, the weighted sum of the number of gene duplications and losses. The existence of a polynomial delay algorithm for duplication/loss phylogeny reconstruction stands in contrast to most formulations of phylogeny reconstruction, which are NP-complete. We next extend this result to obtain a two-phase method for gene tree reconstruction that takes both micro- and macroevolution into account. In the first phase, a gene tree is constructed from sequence data, using any of the previously known algorithms for gene phylogeny construction. In the second phase, the tree is refined by rearranging regions of the tree that do not have strong support in the sequence data to minimize the duplication/lost cost. Components of the tree with strong support are left intact. This hybrid approach incorporates both micro- and macroevolutionary considerations, yet its computational requirements are modest in practice because the two-phase approach constrains the search space. Our hybrid algorithm can also be used to resolve nonbinary nodes in a multifurcating gene tree. We have implemented these algorithms in a software tool, NOTUNG 2.0, that can be used as a unified framework for gene tree reconstruction or as an exploratory analysis tool that can be applied post hoc to any rooted tree with bootstrap values. The NOTUNG 2.0 graphical user interface can be used to visualize alternate duplication/loss histories, root trees according to duplication and loss parsimony, manipulate and annotate gene trees, and estimate gene duplication times. It also offers a command line option that enables high-throughput analysis of a large number of trees.  相似文献   

10.
The problem of reconstructing the duplication history of a set of tandemly repeated sequences was first introduced by Fitch (1977). Many recent studies deal with this problem, showing the validity of the unequal recombination model proposed by Fitch, describing numerous inference algorithms, and exploring the combinatorial properties of these new mathematical objects, which are duplication trees. In this paper, we deal with the topological rearrangement of these trees. Classical rearrangements used in phylogeny (NNI, SPR, TBR, ...) cannot be applied directly on duplication trees. We show that restricting the neighborhood defined by the SPR (Subtree Pruning and Regrafting) rearrangement to valid duplication trees, allows exploring the whole duplication tree space. We use these restricted rearrangements in a local search method which improves an initial tree via successive rearrangements. This method is applied to the optimization of parsimony and minimum evolution criteria. We show through simulations that this method improves all existing programs for both reconstructing the topology of the true tree and recovering its duplication events. We apply this approach to tandemly repeated human Zinc finger genes and observe that a much better duplication tree is obtained by our method than using any other program.  相似文献   

11.
The root of the angiosperm tree has not yet been established. Major morphological and molecular differences between angiosperms and other seed plants have introduced ambiguities and possibly spurious results. Because it is unlikely that extant species more closely related to angiosperms will be discovered, and because relevant fossils will almost certainly not yield molecular data, the use of duplicate genes for rooting purposes may provide the best hope of a solution. Simultaneous analysis of the genes resulting from a gene duplication event along the branch subtending angiosperms would yield an unrooted network, wherein two congruent gene trees should be connected by a single branch. In these circumstances the best rooted species tree is the one that corresponds to the two gene trees when the network is rooted along the connecting branch. In general, this approach can be viewed as choosing among rooted species trees by minimizing hypothesized events such as gene duplication, gene loss, lineage sorting, and lateral transfer. Of those gene families that are potentially relevant to the angiosperm problem, phytochrome genes warrant special attention. Phylogenetic analysis of a sample of complete phytochrome (PHY) sequences implies that an initial duplication event preceded (or occurred early within) the radiation of seed plants and that each of the two resulting copies duplicated again. In one of these cases, leading to thePHYAandPHYClineages, duplication appears to have occurred before the diversification of angiosperms. Duplicate gene trees are congruent in these broad analyses, but the sample of sequences is too limited to provide much insight into the rooting question. Preliminary analyses of partialPHYAandPHYCsequences from several presumably basal angiosperm lineages are promising, but more data are needed to critically evaluate the power of these genes to resolve the angiosperm radiation.  相似文献   

12.
Comprehensively sampled phylogenetic trees provide the most compelling foundations for strong inferences in comparative evolutionary biology. Mismatches are common, however, between the taxa for which comparative data are available and the taxa sampled by published phylogenetic analyses. Moreover, many published phylogenies are gene trees, which cannot always be adapted immediately for species level comparisons because of discordance, gene duplication, and other confounding biological processes. A new database, STBase, lets comparative biologists quickly retrieve species level phylogenetic hypotheses in response to a query list of species names. The database consists of 1 million single- and multi-locus data sets, each with a confidence set of 1000 putative species trees, computed from GenBank sequence data for 413,000 eukaryotic taxa. Two bodies of theoretical work are leveraged to aid in the assembly of multi-locus concatenated data sets for species tree construction. First, multiply labeled gene trees are pruned to conflict-free singly-labeled species-level trees that can be combined between loci. Second, impacts of missing data in multi-locus data sets are ameliorated by assembling only decisive data sets. Data sets overlapping with the user’s query are ranked using a scheme that depends on user-provided weights for tree quality and for taxonomic overlap of the tree with the query. Retrieval times are independent of the size of the database, typically a few seconds. Tree quality is assessed by a real-time evaluation of bootstrap support on just the overlapping subtree. Associated sequence alignments, tree files and metadata can be downloaded for subsequent analysis. STBase provides a tool for comparative biologists interested in exploiting the most relevant sequence data available for the taxa of interest. It may also serve as a prototype for future species tree oriented databases and as a resource for assembly of larger species phylogenies from precomputed trees.  相似文献   

13.
We investigated the evolutionary dynamics of duplicated copies of the granule-bound starch synthase I gene (GBSSI or Waxy) within polyploid Spartina species. Molecular cloning, sequencing, and phylogenetic analyses revealed incongruences between the expected species phylogeny and the inferred gene trees. Some genes within species were more divergent than expected from ploidy level alone, suggesting the existence of paralogous sets of Waxy loci in Spartina. Phylogenetic analyses indicate that this paralogy originated from a duplication that occurred prior to the divergence of Spartina from other Chloridoideae. Gene tree topologies revealed three divergent homoeologous sequences in the hexaploid S. alterniflora that are consistent with the proposal of an allopolyploid origin of the hexaploid clade. Waxy sequences differ in insertion–deletion events in introns, which may be used to diagnose gene copies. Both paralogous and homoeologous coding regions appear to evolving under selective constraints.  相似文献   

14.

Background  

The shape of phylogenetic trees has been used to make inferences about the evolutionary process by comparing the shapes of actual phylogenies with those expected under simple models of the speciation process. Previous studies have focused on speciation events, but gene duplication is another lineage splitting event, analogous to speciation, and gene loss or deletion is analogous to extinction. Measures of the shape of gene family phylogenies can thus be used to investigate the processes of gene duplication and loss. We make the first systematic attempt to use tree shape to study gene duplication using human gene phylogenies.  相似文献   

15.
Fourfold paralogy regions in the human genome have been considered historical remnants of whole-genome duplication events predicted to have occurred early in vertebrate evolution. Taking advantage of the well-annotated and high-quality human genomic sequence map as well as the ever-increasing accessibility of large-scale genomic sequence data from a diverse range of animal species, we investigated the prediction that the ancestral vertebrate genome was shaped by two rapid rounds of whole-genome duplication within a period of 10 million years. Both the map self-comparison approach and a phylogenetic analysis revealed that gene families identified as tetralogous on human chromosomes 1/2/8/20 arose by small-scale duplication events that occurred at widely different time points in animal evolution. Furthermore, the data discount the likelihood that tree topologies of the form ((A,B)(C,D)) are best explained by the octoploidy hypothesis. We instead propose that such symmetrical tree patterns are also consistent with local duplications and rearrangement events.  相似文献   

16.
MOTIVATION: When analyzing protein sequences using sequence similarity searches, orthologous sequences (that diverged by speciation) are more reliable predictors of a new protein's function than paralogous sequences (that diverged by gene duplication), because duplication enables functional diversification. The utility of phylogenetic information in high-throughput genome annotation ('phylogenomics') is widely recognized, but existing approaches are either manual or indirect (e.g. not based on phylogenetic trees). Our goal is to automate phylogenomics using explicit phylogenetic inference. A necessary component is an algorithm to infer speciation and duplication events in a given gene tree. RESULTS: We give an algorithm to infer speciation and duplication events on a gene tree by comparison to a trusted species tree. This algorithm has a worst-case running time of O(n(2)) which is inferior to two previous algorithms that are approximately O(n) for a gene tree of sequences. However, our algorithm is extremely simple, and its asymptotic worst case behavior is only realized on pathological data sets. We show empirically, using 1750 gene trees constructed from the Pfam protein family database, that it appears to be a practical (and often superior) algorithm for analyzing real gene trees. AVAILABILITY: http://www.genetics.wustl.edu/eddy/forester.  相似文献   

17.
MOTIVATION: Comparative sequence analysis is widely used to study genome function and evolution. This approach first requires the identification of homologous genes and then the interpretation of their homology relationships (orthology or paralogy). To provide help in this complex task, we developed three databases of homologous genes containing sequences, multiple alignments and phylogenetic trees: HOBACGEN, HOVERGEN and HOGENOM. In this paper, we present two new tools for automating the search for orthologs or paralogs in these databases. RESULTS: First, we have developed and implemented an algorithm to infer speciation and duplication events by comparison of gene and species trees (tree reconciliation). Second, we have developed a general method to search in our databases the gene families for which the tree topology matches a peculiar tree pattern. This algorithm of unordered tree pattern matching has been implemented in the FamFetch graphical interface. With the help of a graphical editor, the user can specify the topology of the tree pattern, and set constraints on its nodes and leaves. Then, this pattern is compared with all the phylogenetic trees of the database, to retrieve the families in which one or several occurrences of this pattern are found. By specifying ad hoc patterns, it is therefore possible to identify orthologs in our databases.  相似文献   

18.
Van de Peer Y  Frickey T  Taylor J  Meyer A 《Gene》2002,295(2):205-211
The ray-finned fishes (Actinopterygii) seem to have two copies of many tetrapod (Sarcopterygii) genes. The origin of these duplicate fish genes is the subject of some controversy. One explanation for the existence of these extra fish genes could be an increase in the rate of independent gene duplications in fishes. Alternatively, gene duplicates in fish may have been formed in the ancestor of all or most Actinopterygii during a complete genome duplication event. A third possibility is that tetrapods have lost more genes than fish after gene or genome duplication events in the common ancestor of both lineages. These three hypotheses can be tested by phylogenetic reconstruction. Previously, we found that a large number of anciently duplicated genes of zebrafish are sister sequences in evolutionary trees suggesting that they were produced in Actinopterygii after the divergence of Sarcopterygii [Phil. Trans. R. Soc. Lond. B 356 (2001) 119]. On the other hand, several well-supported trees showed one of the two fish genes as the sister sequence to a monophyletic clade that included the second fish gene and genes from frog, chicken, mouse and human. These so-called outgroup topologies suggest that the origin of many fish duplicates predates the divergence of the Sarcopterygii and Actinopterygii and support the hypothesis that tetrapods have lost duplicates that have been retained in fish. Here we show that many of these 'outgroup' tree topologies are erroneous and can be corrected when mutational saturation is taken into account. To this end, a Java-based application has been developed to visualize the amount of saturation in amino acid sequences. The program graphically displays the number of observed frequent and rare amino acid replacements between pairs of sequences against their overall evolutionary distance. Discrimination between frequent and rare amino acid replacements is based on substitution probability matrices (e.g. PAM and BLOSUM). Evolutionary distances between sequences can be computed from the fraction of unsaturated sites only and evolutionary trees inferred by pairwise distance methods. When trees are computed by omitting the saturated fraction of sites, most fish duplicates are sister sequences.  相似文献   

19.
We studied the phylogenetic relationships among Japanese Leptocarabus ground beetles, which show extensive trans-species polymorphisms in mitochondrial gene genealogies. Simultaneous analysis of combined nuclear data with partial sequences from the long-wavelength rhodopsin, wingless, phosphoenolpyruvate carboxykinase, and 28S rRNA genes resolved the relationships among the five species, although separate analyses of these genes provided topologies with low resolution. For both the nuclear gene tree resulting from the combined data from four genes and a mitochondrial cytochrome oxidase subunit I (COI) gene tree, we applied a Bayesian divergence time estimation using a common calibration method to identify mitochondrial introgression events that occurred after speciation. Three mitochondrial lineages shared by two or three species were likely subject to introgression due to interspecific hybridization because the coalescent times for these lineages were much shorter than the corresponding speciation times estimated from nuclear gene sequences. We demonstrated that when species phylogeny is fully resolved with nuclear gene sequence data, comparative analysis of nuclear and mitochondrial gene trees can be used to infer introgressive hybridization events that might cause trans-species polymorphisms in mitochondrial gene trees.  相似文献   

20.
Large-scale gene amplifications may have facilitated the evolution of morphological innovations that accompanied the origin of vertebrates. This hypothesis predicts that the genomes of extant jawless fish, scions of deeply branching vertebrate lineages, should bear a record of these events. Previous work suggests that nonvertebrate chordates have a single Hox cluster, but that gnathostome vertebrates have four or more Hox clusters. Did the duplication events that produced multiple vertebrate Hox clusters occur before or after the divergence of agnathan and gnathostome lineages? Can investigation of lamprey Hox clusters illuminate the origins of the four gnathostome Hox clusters? To approach these questions, we cloned and sequenced 13 Hox cluster genes from cDNA and genomic libraries in the lamprey, Petromyzon marinus. The results suggest that the lamprey has at least four Hox clusters and support the model that gnathostome Hox clusters arose by a two-round-no-cluster-loss mechanism, with tree topology [(AB)(CD)]. A three-round model, however, is not rigorously excluded by the data and, for this model, the tree topologies [(D(C(AB))] and [(C(D(AB))] are most parsimonious. Gene phylogenies suggest that at least one Hox cluster duplication occurred in the lamprey lineage after it diverged from the gnathostome lineage. The results argue against two or more rounds of duplication before the divergence of agnathan and gnathostome vertebrates. If Hox clusters were duplicated in whole-genome duplication events, then these data suggest that, at most, one whole genome duplication occurred before the evolution of vertebrate developmental innovations.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号