首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 521 毫秒
1.
Clustering of main orthologs for multiple genomes   总被引:1,自引:0,他引:1  
The identification of orthologous genes shared by multiple genomes is critical for both functional and evolutionary studies in comparative genomics. While it is usually done by sequence similarity search and reconciled tree construction in practice, recently a new combinatorial approach and high-throughput system MSOAR for ortholog identification between closely related genomes based on genome rearrangement and gene duplication has been proposed in Fu et al. MSOAR assumes that orthologous genes correspond to each other in the most parsimonious evolutionary scenario, minimizing the number of genome rearrangement and (postspeciation) gene duplication events. However, the parsimony approach used by MSOAR limits it to pairwise genome comparisons. In this paper, we extend MSOAR to multiple (closely related) genomes and propose an ortholog clustering method, called MultiMSOAR, to infer main orthologs in multiple genomes. As a preliminary experiment, we apply MultiMSOAR to rat, mouse, and human genomes, and validate our results using gene annotations and gene function classifications in the public databases. We further compare our results to the ortholog clusters predicted by MultiParanoid, which is an extension of the well-known program InParanoid for pairwise genome comparisons. The comparison reveals that MultiMSOAR gives more detailed and accurate orthology information, since it can effectively distinguish main orthologs from inparalogs.  相似文献   

2.
Assignment of orthologous genes via genome rearrangement   总被引:1,自引:0,他引:1  
The assignment of orthologous genes between a pair of genomes is a fundamental and challenging problem in comparative genomics. Existing methods that assign orthologs based on the similarity between DNA or protein sequences may make erroneous assignments when sequence similarity does not clearly delineate the evolutionary relationship among genes of the same families. In this paper, we present a new approach to ortholog assignment that takes into account both sequence similarity and evolutionary events at a genome level, where orthologous genes are assumed to correspond to each other in the most parsimonious evolving scenario under genome rearrangement. First, the problem is formulated as that of computing the signed reversal distance with duplicates between the two genomes of interest. Then, the problem is decomposed into two new optimization problems, called minimum common partition and maximum cycle decomposition, for which efficient heuristic algorithms are given. Following this approach, we have implemented a high-throughput system for assigning orthologs on a genome scale, called SOAR, and tested it on both simulated data and real genome sequence data. Compared to a recent ortholog assignment method based entirely on homology search (called INPARANOID), SOAR shows a marginally better performance in terms of sensitivity on the real data set because it is able to identify several correct orthologous pairs that are missed by INPARANOID. The simulation results demonstrate that SOAR, in general, performs better than the iterated exemplar algorithm in terms of computing the reversal distance and assigning correct orthologs.  相似文献   

3.
Shi G  Peng MC  Jiang T 《PloS one》2011,6(6):e20892
The identification of orthologous genes shared by multiple genomes plays an important role in evolutionary studies and gene functional analyses. Based on a recently developed accurate tool, called MSOAR 2.0, for ortholog assignment between a pair of closely related genomes based on genome rearrangement, we present a new system MultiMSOAR 2.0, to identify ortholog groups among multiple genomes in this paper. In the system, we construct gene families for all the genomes using sequence similarity search and clustering, run MSOAR 2.0 for all pairs of genomes to obtain the pairwise orthology relationship, and partition each gene family into a set of disjoint sets of orthologous genes (called super ortholog groups or SOGs) such that each SOG contains at most one gene from each genome. For each such SOG, we label the leaves of the species tree using 1 or 0 to indicate if the SOG contains a gene from the corresponding species or not. The resulting tree is called a tree of ortholog groups (or TOGs). We then label the internal nodes of each TOG based on the parsimony principle and some biological constraints. Ortholog groups are finally identified from each fully labeled TOG. In comparison with a popular tool MultiParanoid on simulated data, MultiMSOAR 2.0 shows significantly higher prediction accuracy. It also outperforms MultiParanoid, the Roundup multi-ortholog repository and the Ensembl ortholog database in real data experiments using gene symbols as a validation tool. In addition to ortholog group identification, MultiMSOAR 2.0 also provides information about gene births, duplications and losses in evolution, which may be of independent biological interest. Our experiments on simulated data demonstrate that MultiMSOAR 2.0 is able to infer these evolutionary events much more accurately than a well-known software tool Notung. The software MultiMSOAR 2.0 is available to the public for free.  相似文献   

4.

Background  

Ortholog assignment is a critical and fundamental problem in comparative genomics, since orthologs are considered to be functional counterparts in different species and can be used to infer molecular functions of one species from those of other species. MSOAR is a recently developed high-throughput system for assigning one-to-one orthologs between closely related species on a genome scale. It attempts to reconstruct the evolutionary history of input genomes in terms of genome rearrangement and gene duplication events. It assumes that a gene duplication event inserts a duplicated gene into the genome of interest at a random location (i.e., the random duplication model). However, in practice, biologists believe that genes are often duplicated by tandem duplications, where a duplicated gene is located next to the original copy (i.e., the tandem duplication model).  相似文献   

5.
6.
A key complication in comparative genomics for reliable gene function prediction is the existence of duplicated genes. To study the effect of gene duplication on function prediction, we analyze orthologs between pairs of genomes where in one genome the orthologous gene has duplicated after the speciation of the two genomes (i.e. inparalogs). For these duplicated genes we investigate whether the gene that is most similar on the sequence level is also the gene that has retained the ancestral gene-neighborhood. Although the majority of investigated cases show a consistent pattern between sequence similarity and gene-neighborhood conservation, a substantial fraction, 29–38%, is inconsistent. The observation of inconsistency is not the result of a chance outcome owing to a lack of divergence time between inparalogs, but rather it seems to be the result of a chance outcome caused by very similar rates of sequence evolution of both inparalogs relative to their ortholog. If one-to-one orthologous relationships are required, it is advisable to combine contextual information (i.e. gene-neighborhood in prokaryotes and co-expression in eukaryotes) with protein sequence information to predict the most probable functional equivalent ortholog in the presence of inparalogs.  相似文献   

7.

Background  

Gene duplication and gene loss during the evolution of eukaryotes have hindered attempts to estimate phylogenies and divergence times of species. Although current methods that identify clusters of orthologous genes in complete genomes have helped to investigate gene function and gene content, they have not been optimized for evolutionary sequence analyses requiring strict orthology and complete gene matrices. Here we adopt a relatively simple and fast genome comparison approach designed to assemble orthologs for evolutionary analysis. Our approach identifies single-copy genes representing only species divergences (panorthologs) in order to minimize potential errors caused by gene duplication. We apply this approach to complete sets of proteins from published eukaryote genomes specifically for phylogeny and time estimation.  相似文献   

8.
Gene duplication and divergence is a major evolutionary force. Despite the growing number of fully sequenced genomes, methods for investigating these events on a genome-wide scale are still in their infancy. Here, we present SYNERGY, a novel and scalable algorithm that uses sequence similarity and a given species phylogeny to reconstruct the underlying evolutionary history of all genes in a large group of species. In doing so, SYNERGY resolves homology relations and accurately distinguishes orthologs from paralogs. We applied our approach to a set of nine fully sequenced fungal genomes spanning 150 million years, generating a genome-wide catalog of orthologous groups and corresponding gene trees. Our results are highly accurate when compared to a manually curated gold standard, and are robust to the quality of input according to a novel jackknife confidence scoring. The reconstructed gene trees provide a comprehensive view of gene evolution on a genomic scale. Our approach can be applied to any set of sequenced eukaryotic species with a known phylogeny, and opens the way to systematic studies of the evolution of individual genes, molecular systems and whole genomes. Supplementary information: Supplementary data are available at Bioinformatics online.  相似文献   

9.
Broadly, computational approaches for ortholog assignment is a three steps process: (i) identify all putative homologs between the genomes, (ii) identify gene anchors and (iii) link anchors to identify best gene matches given their order and context. In this article, we engineer two methods to improve two important aspects of this pipeline [specifically steps (ii) and (iii)]. First, computing sequence similarity data [step (i)] is a computationally intensive task for large sequence sets, creating a bottleneck in the ortholog assignment pipeline. We have designed a fast and highly scalable sort-join method (afree) based on k-mer counts to rapidly compare all pairs of sequences in a large protein sequence set to identify putative homologs. Second, availability of complex genomes containing large gene families with prevalence of complex evolutionary events, such as duplications, has made the task of assigning orthologs and co-orthologs difficult. Here, we have developed an iterative graph matching strategy where at each iteration the best gene assignments are identified resulting in a set of orthologs and co-orthologs. We find that the afree algorithm is faster than existing methods and maintains high accuracy in identifying similar genes. The iterative graph matching strategy also showed high accuracy in identifying complex gene relationships. Standalone afree available from http://vbc.med.monash.edu.au/~kmahmood/afree. EGM2, complete ortholog assignment pipeline (including afree and the iterative graph matching method) available from http://vbc.med.monash.edu.au/~kmahmood/EGM2.  相似文献   

10.
The identification of orthologs to a set of known genes is often the starting point for evolutionary studies focused on gene families of interest. To date, the existing orthology detection tools (COG, InParanoid, OrthoMCL, etc.) are aimed at genome-wide ortholog identification and lack flexibility for the purposes of case studies. We developed a program OrthoFocus, which employs an extended reciprocal best hit approach to quickly search for orthologs in a pair of genomes. A group of paralogs from the input genome is used as the start for the forward search and the criterion for the reverse search, which allows handling many-to-one and many-to-many relationships. By pairwise comparison of genomes with the input species genome, OrthoFocus enables quick identification of orthologs in multiple genomes and generates a multiple alignment of orthologs so that it can further be used in phylogenetic analysis. The program is available at http://www.lipidomics.ru/.  相似文献   

11.
The ortholog conjecture posits that orthologous genes are functionally more similar than paralogous genes. This conjecture is a cornerstone of phylogenomics and is used daily by both computational and experimental biologists in predicting, interpreting, and understanding gene functions. A recent study, however, challenged the ortholog conjecture on the basis of experimentally derived Gene Ontology (GO) annotations and microarray gene expression data in human and mouse. It instead proposed that the functional similarity of homologous genes is primarily determined by the cellular context in which the genes act, explaining why a greater functional similarity of (within-species) paralogs than (between-species) orthologs was observed. Here we show that GO-based functional similarity between human and mouse orthologs, relative to that between paralogs, has been increasing in the last five years. Further, compared with paralogs, orthologs are less likely to be included in the same study, causing an underestimation in their functional similarity. A close examination of functional studies of homologs with identical protein sequences reveals experimental biases, annotation errors, and homology-based functional inferences that are labeled in GO as experimental. These problems and the temporary nature of the GO-based finding make the current GO inappropriate for testing the ortholog conjecture. RNA sequencing (RNA-Seq) is known to be superior to microarray for comparing the expressions of different genes or in different species. Our analysis of a large RNA-Seq dataset of multiple tissues from eight mammals and the chicken shows that the expression similarity between orthologs is significantly higher than that between within-species paralogs, supporting the ortholog conjecture and refuting the cellular context hypothesis for gene expression. We conclude that the ortholog conjecture remains largely valid to the extent that it has been tested, but further scrutiny using more and better functional data is needed.  相似文献   

12.
An automated comparative analysis of 17 complete microbial genomes   总被引:3,自引:0,他引:3  
MOTIVATION: As sequenced genomes become larger and sequencing becomes faster, there is a need to develop accurate automated genome comparison techniques and databases to facilitate derivation of genome functionality; identification of enzymes, putative operons and metabolic pathways; and to derive phylogenetic classification of microbes. RESULTS: This paper extends an automated pair-wise genome comparison technique (Bansal et al., Math. Model. Sci. Comput., 9, 1-23, 1998, Bansal and Bork, in First International Workshop of Declarative Languages, Springer, pp. 275-289, 1999) used to identify orthologs and gene groups to derive orthologous genes in a group of genomes and to identify genes with conserved functionality. Seventeen microbial genomes archived at ftp://ncbi.nlm.nih.gov/genbank/genomes have been compared using the automated technique. Data related to orthologs, gene groups, gene duplication, gene fusion, orthologs with conserved functionality, and genes specifically orthologous to Escherichia coli and pathogens has been presented and analyzed. AVAILABILITY: A prototype database is available at ftp://www.mcs.kent.edu/arvind/intellibio / orthos.html. The software is free for academic research under an academic license. The detailed database for every microbial genome in NCBI is commercially available through intellibio software and consultancy corporation (Web site: http://www.mcs.kent.edu/?rvind/intellibio . html). CONTACT: arvind@mcs.kent.edu.  相似文献   

13.
A novel mouse Siglec (mSiglec-F) belonging to the subfamily of Siglec-3-related Siglecs has been cloned and characterized. Unlike most human Siglec-3 (hSiglec-3)-related Siglecs with promiscuous linkage specificity, mSiglec-F shows a strong preference for alpha2-3-linked sialic acids. It is predominantly expressed in immature cells of the myelomonocytic lineage and in a subset of CD11b (Mac-1)-positive cells in some tissues. As with previously cloned Siglec-3-related mSiglecs, the lack of strong sequence similarity to a singular hSiglec made identification of the human ortholog difficult. We therefore conducted a comprehensive comparison of Siglecs between the human and mouse genomes. The mouse genome contains eight Siglec genes, whereas the human genome contains 11 Siglec genes and a Siglec-like gene. Although a one-to-one orthologous correspondence between human and mouse Siglecs 1, 2, and 4 is confirmed, the Siglec-3-related Siglecs showed marked differences between human and mouse. We found only four Siglec genes and two pseudogenes in the mouse chromosome 7 region syntenic to the Siglec-3-related gene cluster on human chromosome 19, which, in contrast, contains seven Siglec genes, a Siglec-like gene, and thirteen pseudogenes. Although analysis of gene maps and exon structures allows tentative assignments of mouse-human Siglec ortholog pairs, the possibility of unequal genetic recombination makes the assignments inconclusive. We therefore support a temporary lettered nomenclature for additional mouse Siglecs. Current information suggests that mSiglec-F is likely a hSiglec-5 ortholog. The previously reported mSiglec-3/CD33 and mSiglec-E/MIS are likely orthologs of hSiglec-3 and hSiglec-9, respectively. The other Siglec-3-like gene in the cluster (mSiglec-G) is probably a hSiglec-10 ortholog. Another mouse gene (mSiglec-H), without an apparent human ortholog, lies outside of the cluster. Thus, although some duplications of Siglec-3-related genes predated separation of the primate and rodent lineages (about 80-100 million years ago), this gene cluster underwent extensive duplications in the primate lineage thereafter.  相似文献   

14.
15.
The widely accepted notion that two whole-genome duplications occurred during early vertebrate evolution (the 2R hypothesis) stems from the fact that vertebrates often possess several genes corresponding to a single invertebrate homolog. However the number of genes predicted by the Human Genome Project is less than twice as many as in the Drosophila melanogaster or Caenorhabditis elegans genomes. This ratio could be explained by two rounds of genome duplication followed by extensive gene loss, by a single genome duplication, by sequential local duplications, or by a combination of any of the above. The traditional method used to distinguish between these possibilities is to reconstruct the phylogenetic relationships of vertebrate genes to their invertebrate orthologs; ratios of invertebrate-to-vertebrate counterparts are then used to infer the number of gene duplication events. The lancelet, amphioxus, is the closest living invertebrate relative of the vertebrates, and unlike protostomes such as flies or nematodes, is therefore the most appropriate outgroup for understanding the genomic composition of the last common ancestor of all vertebrates. We analyzed the relationships of all available amphioxus genes to their vertebrate homologs. In most cases, one to three vertebrate genes are orthologous to each amphioxus gene (median number=2). Clearly this result, and those of previous studies using this approach, cannot distinguish between alternative scenarios of chordate genome expansion. We conclude that phylogenetic analyses alone will never be sufficient to determine whether genome duplication(s) occurred during early chordate evolution, and argue that a "phylogenomic" approach, which compares paralogous clusters of linked genes from complete amphioxus and human genome sequences, will be required if the pattern and process of early chordate genome evolution is ever to be reconstructed.  相似文献   

16.
A recent paper (Nehrt et al., PLoS Comput. Biol. 7:e1002073, 2011) has proposed a metric for the "functional similarity" between two genes that uses only the Gene Ontology (GO) annotations directly derived from published experimental results. Applying this metric, the authors concluded that paralogous genes within the mouse genome or the human genome are more functionally similar on average than orthologous genes between these genomes, an unexpected result with broad implications if true. We suggest, based on both theoretical and empirical considerations, that this proposed metric should not be interpreted as a functional similarity, and therefore cannot be used to support any conclusions about the "ortholog conjecture" (or, more properly, the "ortholog functional conservation hypothesis"). First, we reexamine the case studies presented by Nehrt et al. as examples of orthologs with divergent functions, and come to a very different conclusion: they actually exemplify how GO annotations for orthologous genes provide complementary information about conserved biological functions. We then show that there is a global ascertainment bias in the experiment-based GO annotations for human and mouse genes: particular types of experiments tend to be performed in different model organisms. We conclude that the reported statistical differences in annotations between pairs of orthologous genes do not reflect differences in biological function, but rather complementarity in experimental approaches. Our results underscore two general considerations for researchers proposing novel types of analysis based on the GO: 1) that GO annotations are often incomplete, potentially in a biased manner, and subject to an "open world assumption" (absence of an annotation does not imply absence of a function), and 2) that conclusions drawn from a novel, large-scale GO analysis should whenever possible be supported by careful, in-depth examination of examples, to help ensure the conclusions have a justifiable biological basis.  相似文献   

17.
We describe a simple theoretical framework for identifying orthologous sets of genes that deviate from a clock-like model of evolution. The approach used is based on comparing the evolutionary distances within a set of orthologs to a standard intergenomic distance, which was defined as the median of the distribution of the distances between all one-to-one orthologs. Under the clock-like model, the points on a plot of intergenic distances versus intergenomic distances are expected to fit a straight line. A statistical technique to identify significant deviations from the clock-like behavior is described. For several hundred analyzed orthologous sets representing three well-defined bacterial lineages, the alpha-Proteobacteria, the gamma-Proteobacteria, and the Bacillus-Clostridium group, the clock-like null hypothesis could not be rejected for approximately 70% of the sets, whereas the rest showed substantial anomalies. Subsequent detailed phylogenetic analysis of the genes with the strongest deviations indicated that over one-half of these genes probably underwent a distinct form of horizontal gene transfer, xenologous gene displacement, in which a gene is displaced by an ortholog from a different lineage. The remaining deviations from the clock-like model could be explained by lineage-specific acceleration of evolution. The results indicate that although xenologous gene displacement is a major force in bacterial evolution, a significant majority of orthologous gene sets in three major bacterial lineages evolved in accordance with the clock-like model. The approach described here allows rapid detection of deviations from this mode of evolution on the genome scale.  相似文献   

18.
Thomas  James W. 《Mammalian genome》2003,14(10):673-678
Comparative mapping and sequencing of the mouse and human genomes have defined large, conserved chromosomal segments in which gene content and order are highly conserved. These regions span megabase-sized intervals and together comprise the vast majority of both genomes. However, the evolutionary relationships among the small remaining portions of these genomes are not as well characterized. Here we describe the sequencing and annotation of a 341-kb region of mouse Chr 2 containing nine genes, including biliverdin reductase A (Blvra), and its comparison with the orthologous regions of the human and rat genomes. These analyses reveal that the known conserved synteny between mouse Chromosome (Chr) 2 and human Chr 7 reflects an interval containing one gene (Blvra/BLVRA) that is, at most, just 34 kb in the mouse genome. In the mouse, this segment is flanked proximally by genes orthologous to human chromosome 15q21 and distally by genes orthologous to human Chr 2q11. The observed differences between the human and mouse genomes likely resulted from one or more rearrangements in the rodent lineage. In addition to the resulting changes in gene order and location, these rearrangements also appear to have included genomic deletions that led to the loss of at least one gene in the rodent lineage. Finally, we also have identified a recent mouse-specific segmental duplication. These finding illustrate that small genomic regions outside the large mouse–human conserved segments can contain a single gene as well as sequences that are apparently unique to one genome. The nucleotide sequence data reported in this paper have been submitted to GenBank and assigned the accession numbers AC074224 and AC074041.  相似文献   

19.
Despite the great morphological diversity of early embryos, the underlying mechanisms of gastrulation are known to be broadly conserved in vertebrates. However, a number of genes characterized as fulfilling an essential function in this process in several model organisms display no clear ortholog in mammalian genomes. We have devised an in silico phylogenomic approach, based on exhaustive similarity searches in vertebrate genomes and subsequent bayesian phylogenetic analyses, to identify such missing genes, presumed to be highly divergent. This approach has been used to identify mammalian orthologs of Not, an homeodomain containing gene previously characterized in Xenopus, chick and zebrafish as playing a critical role in the formation of the notochord. This attempt led to the identification of a highly divergent mammalian Not-related gene in the mouse, human and rat. The results from phylogenetic reconstructions, synteny analyses, expression pattern analyses in wild-type and mutant mouse embryos, and overexpression experiments in Xenopus embryos converge to confirm these genes as representatives of the Not family in mammals. The identification of the mammalian Not gene delivers an important component for the understanding of the genetics underlying notochord formation in mammals and its evolution among vertebrates. The phylogenomic method used to retrieve this gene thus provides a tool, which can complement or validate genome annotations in situations when they are weakly supported.  相似文献   

20.
We previously reported two graph algorithms for analysis of genomic information: a graph comparison algorithm to detect locally similar regions called correlated clusters and an algorithm to find a graph feature called P-quasi complete linkage. Based on these algorithms we have developed an automatic procedure to detect conserved gene clusters and align orthologous gene orders in multiple genomes. In the first step, the graph comparison is applied to pairwise genome comparisons, where the genome is considered as a one-dimensionally connected graph with genes as its nodes, and correlated clusters of genes that share sequence similarities are identified. In the next step, the P-quasi complete linkage analysis is applied to grouping of related clusters and conserved gene clusters in multiple genomes are identified. In the last step, orthologous relations of genes are established among each conserved cluster. We analyzed 17 completely sequenced microbial genomes and obtained 2313 clusters when the completeness parameter P was 40%. About one quarter contained at least two genes that appeared in the metabolic and regulatory pathways in the KEGG database. This collection of conserved gene clusters is used to refine and augment ortholog group tables in KEGG and also to define ortholog identifiers as an extension of EC numbers.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号