首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
MOTIVATION: Gene duplications and losses (GDLs) are important events in genome evolution. They result in expansion or contraction of gene families, with a likely role in phenotypic evolution. As more genomes become available and their annotations are improved, software programs capable of rapidly and accurately identifying the content of ancestral genomes and the timings of GDLs become necessary to understand the unique evolution of each lineage. RESULTS: We report EvolMAP, a new algorithm and software that utilizes a species tree-based gene clustering method to join all-to-all symmetrical similarity comparisons of multiple gene sets in order to infer the gene composition of multiple ancestral genomes. The algorithm further uses Dollo parsimony-based comparison of the inferred ancestral genes to pinpoint the timings of GDLs onto evolutionary intervals marked by speciation events. Using EvolMAP, first we analyzed the expansion of four families of G-protein coupled receptors (GPCRs) within animal lineages. Additional to demonstrating the unique expansion tree for each family, results also show that the ancestral eumetazoan genome contained many fewer GPCRs than modern animals, and these families expanded through concurrent lineage-specific duplications. Second, we analyzed the history of GDLs in mammalian genomes by comparing seven proteomes. In agreement with previous studies, we report that the mammalian gene family sizes have changed drastically through their evolution. Interestingly, although we identified a potential source of duplication for 75% of the gained genes, remaining 25% did not have clear-cut sources, revealing thousands of genes that have likely gained their distinct sequence identities within the descent of mammals. AVAILABILITY: Query server, source code and executable are available at http://kosik-web.mcdb.ucsb.edu/evolmap/index.htm .  相似文献   

2.
A major goal of structural genomics is the provision of a structural template for a large fraction of protein domains. The magnitude of this task depends on the number and nature of protein sequence families. With a large number of bacterial genomes now fully sequenced, it is possible to obtain improved estimates of the number and diversity of families in that kingdom. We have used an automated clustering procedure to group all sequences in a set of genomes into protein families. Bench-marking shows the clustering method is sensitive at detecting remote family members, and has a low level of false positives. This comprehensive protein family set has been used to address the following questions. (1) What is the structure coverage for currently known families? (2) How will the number of known apparent families grow as more genomes are sequenced? (3) What is a practical strategy for maximizing structure coverage in future? Our study indicates that approximately 20% of known families with three or more members currently have a representative structure. The study indicates also that the number of apparent protein families will be considerably larger than previously thought: We estimate that, by the criteria of this work, there will be about 250,000 protein families when 1000 microbial genomes have been sequenced. However, the vast majority of these families will be small, and it will be possible to obtain structural templates for 70-80% of protein domains with an achievable number of representative structures, by systematically sampling the larger families.  相似文献   

3.
Thirty-two genome sequences of various Vibrionaceae members are compared, with emphasis on what makes V. cholerae unique. As few as 1,000 gene families are conserved across all the Vibrionaceae genomes analysed; this fraction roughly doubles for gene families conserved within the species V. cholerae. Of these, approximately 200 gene families that cluster on various locations of the genome are not found in other sequenced Vibrionaceae; these are possibly unique to the V. cholerae species. By comparing gene family content of the analysed genomes, the relatedness to a particular species is identified for two unspeciated genomes. Conversely, two genomes presumably belonging to the same species have suspiciously dissimilar gene family content. We are able to identify a number of genes that are conserved in, and unique to, V. cholerae. Some of these genes may be crucial to the niche adaptation of this species.  相似文献   

4.
MCScan is an algorithm able to scan multiple genomes or subgenomes in order to identify putative homologous chromosomal regions, and align these regions using genes as anchors. The MCScanX toolkit implements an adjusted MCScan algorithm for detection of synteny and collinearity that extends the original software by incorporating 14 utility programs for visualization of results and additional downstream analyses. Applications of MCScanX to several sequenced plant genomes and gene families are shown as examples. MCScanX can be used to effectively analyze chromosome structural changes, and reveal the history of gene family expansions that might contribute to the adaptation of lineages and taxa. An integrated view of various modes of gene duplication can supplement the traditional gene tree analysis in specific families. The source code and documentation of MCScanX are freely available at http://chibba.pgml.uga.edu/mcscan2/.  相似文献   

5.
We developed a novel efficient scheme, DEFOG (for "deciphering families of genes"), for determining sequences of numerous genes from a family of interest. The scheme provides a powerful means to obtain a gene family composition in species for which high-throughput genomic sequencing data are not available. DEFOG uses two key procedures. The first is a novel algorithm for designing highly degenerate primers based on a set of known genes from the family of interest. These primers are used in PCR reactions to amplify the members of the gene family. The second combines oligofingerprinting of the cloned PCR products with clustering of the clones based on their fingerprints. By selecting members from each cluster, a low-redundancy clone subset is chosen for sequencing. We applied the scheme to the human olfactory receptor (OR) genes. OR genes constitute the largest gene superfamily in the human genome, as well as in the genomes of other vertebrate species. DEFOG almost tripled the size of the initial repertoire of human ORs in a single experiment, and only 7% of the PCR clones had to be sequenced. Extremely high degeneracies, reaching over a billion combinations of distinct PCR primer pairs, proved to be very effective and yielded only 0.4% nonspecific products.  相似文献   

6.
Bacteriophage genomes show pervasive mosaicism, indicating the importance of horizontal gene exchange in their evolution. Phage genomes represent unique combinations of modules, each of them with a different phylogenetic history. The traditional classification, based on a variety of criteria such as nucleic acid type (single/double-stranded DNA/RNA), morphology, and host range, appeared inconsistent with sequence analyses. With the genomic era, an ever increasing number of sequenced phages cannot be classified, in part due to a lack of morphological information and in part to the intrinsic incapability of tree-based methods to efficiently deal with mosaicism. This problem led some virologists to call for a moratorium on the creation of additional taxa in the order Caudovirales, in order to let virologists discuss classification schemes that might better suit phage evolution. In this context, we propose a framework for a reticulate classification of phages based on gene content. Starting from gene families, we built a weighted graph, where nodes represent phages and edges represent phage-phage similarities in terms of shared genes. We then apply various measures of graph topology to analyze the resulting graph. Most double-stranded DNA phages are found in a single component. The values of the clustering coefficient and closeness distinguish temperate from virulent phages, whereas chimeric phages are characterized by a high betweenness coefficient. We apply a 2-step clustering method to this graph to generate a reticulate classification of phages: Each phage is associated with a membership vector, which quantitatively characterizes its membership to the set of clusters. Furthermore, we cluster genes based on their "phylogenetic profiles" to define "evolutionary cohesive modules." In virulent phages, evolutionary modules span several functional categories, whereas in temperate phages they correspond better to functional modules. Moreover, despite the fact that modules only cover a fraction of all phage genes, phage groups can be distinguished by their different combination of modules, serving the bases for a higher level reticulate classification. These 2 classification schemes provide an automatic and dynamic way of representing the relationships within the phage population and can be extended to include newly sequenced phage genomes, as well as other types of genetic elements.  相似文献   

7.
Predicting protein-coding genes still remains a significant challenge. Although a variety of computational programs that use commonly machine learning methods have emerged, the accuracy of predictions remains a low level when implementing in large genomic sequences. Moreover, computational gene finding in newly se- quenced genomes is especially a difficult task due to the absence of a training set of abundant validated genes. Here we present a new gene-finding program, SCGPred, to improve the accuracy of prediction by combining multiple sources of evidence. SCGPred can perform both supervised method in previously well-studied genomes and unsupervised one in novel genomes. By testing with datasets composed of large DNA sequences from human and a novel genome of Ustilago maydi, SCGPred gains a significant improvement in comparison to the popular ab initio gene predictors. We also demonstrate that SCGPred can significantly improve prediction in novel genomes by combining several foreign gene finders with similarity alignments, which is superior to other unsupervised methods. Therefore, SCGPred can serve as an alternative gene-finding tool for newly sequenced eukaryotic genomes. The program is freely available at http://bio.scu.edu.cn/SCGPred/.  相似文献   

8.
Hahn MW  Han MV  Han SG 《PLoS genetics》2007,3(11):e197
Comparison of whole genomes has revealed large and frequent changes in the size of gene families. These changes occur because of high rates of both gene gain (via duplication) and loss (via deletion or pseudogenization), as well as the evolution of entirely new genes. Here we use the genomes of 12 fully sequenced Drosophila species to study the gain and loss of genes at unprecedented resolution. We find large numbers of both gains and losses, with over 40% of all gene families differing in size among the Drosophila. Approximately 17 genes are estimated to be duplicated and fixed in a genome every million years, a rate on par with that previously found in both yeast and mammals. We find many instances of extreme expansions or contractions in the size of gene families, including the expansion of several sex- and spermatogenesis-related families in D. melanogaster that also evolve under positive selection at the nucleotide level. Newly evolved gene families in our dataset are associated with a class of testes-expressed genes known to have evolved de novo in a number of cases. Gene family comparisons also allow us to identify a number of annotated D. melanogaster genes that are unlikely to encode functional proteins, as well as to identify dozens of previously unannotated D. melanogaster genes with conserved homologs in the other Drosophila. Taken together, our results demonstrate that the apparent stasis in total gene number among species has masked rapid turnover in individual gene gain and loss. It is likely that this genomic revolving door has played a large role in shaping the morphological, physiological, and metabolic differences among species.  相似文献   

9.
D Frishman  A Mironov  M Gelfand 《Gene》1999,234(2):257-265
Exact mapping of gene starts is an important problem in the computer-assisted functional analysis of newly sequenced prokaryotic genomes. We describe an algorithm for finding ribosomal binding sites without a learning sample. This algorithm is particularly useful for analysis of genomes with little or no experimentally mapped genes. There is a clear correlation between the ribosomal binding site (RBS) properties of a given genome and the potential gene start prediction accuracy. This correlation is of considerable predictive power and may be useful for estimating the expected success of future genome analysis efforts. We also demonstrate that the RBS properties depend on the phylogenetic position of a genome.  相似文献   

10.
We present an annotation pipeline that accurately predicts exon–intron structures and protein-coding sequences (CDSs) on the basis of full-length cDNAs (FLcDNAs). This annotation pipeline was used to identify genes in 10 plant genomes. In particular, we show that interspecies mapping of FLcDNAs to genomes is of great value in fully utilizing FLcDNA resources whose availability is limited to several species. Because low sequence conservation at 5′- and 3′-ends of FLcDNAs between different species tends to result in truncated CDSs, we developed an improved algorithm to identify complete CDSs by the extension of both ends of truncated CDSs. Interspecies mapping of 71 801 monocot FLcDNAs to the Oryza sativa genome led to the detection of 22 142 protein-coding regions. Moreover, in comparing two mapping programs and three ab initio prediction programs, we found that our pipeline was more capable of identifying complete CDSs. As demonstrated by monocot interspecies mapping, in which nucleotide identity between FLcDNAs and the genome was ∼80%, the resultant inferred CDSs were sufficiently accurate. Finally, we applied both inter- and intraspecies mapping to 10 monocot and dicot genomes and identified genes in 210 551 loci. Interspecies mapping of FLcDNAs is expected to effectively predict genes and CDSs in newly sequenced genomes.  相似文献   

11.
MBGD is a workbench system for comparative analysis of completely sequenced microbial genomes. The central function of MBGD is to create an orthologous gene classification table using precomputed all-against-all similarity relationships among genes in multiple genomes. In MBGD, an automated classification algorithm has been implemented so that users can create their own classification table by specifying a set of organisms and parameters. This feature is especially useful when the user's interest is focused on some taxonomically related organisms. The created classification table is stored into the database and can be explored combining with the data of individual genomes as well as similarity relationships among genomes. Using these data, users can carry out comparative analyses from various points of view, such as phylogenetic pattern analysis, gene order comparison and detailed gene structure comparison. MBGD is accessible at http://mbgd.genome.ad.jp/.  相似文献   

12.
The identification of single copy (1-to-1) orthologs in any group of organisms is important for functional classification and phylogenetic studies. The Metazoa are no exception, but only recently has there been a wide-enough distribution of taxa with sufficiently high quality sequenced genomes to gain confidence in the wide-spread single copy status of a gene.Here, we present a phylogenetic approach for identifying overlooked single copy orthologs from multigene families and apply it to the Metazoa. Using 18 sequenced metazoan genomes of high quality we identified a robust set of 1,126 orthologous groups that have been retained in single copy since the last common ancestor of Metazoa. We found that the use of the phylogenetic procedure increased the number of single copy orthologs found by over a third more than standard taxon-count approaches. The orthologs represented a wide range of functional categories, expression profiles and levels of divergence.To demonstrate the value of our set of single copy orthologs, we used them to assess the completeness of 24 currently published metazoan genomes and 62 EST datasets. We found that the annotated genes in published genomes vary in coverage from 79% (Ciona intestinalis) to 99.8% (human) with an average of 92%, suggesting a value for the underlying error rate in genome annotation, and a strategy for identifying single copy orthologs in larger datasets. In contrast, the vast majority of EST datasets with no corresponding genome sequence available are largely under-sampled and probably do not accurately represent the actual genomic complement of the organisms from which they are derived.  相似文献   

13.
14.
Gene duplication and divergence is a major evolutionary force. Despite the growing number of fully sequenced genomes, methods for investigating these events on a genome-wide scale are still in their infancy. Here, we present SYNERGY, a novel and scalable algorithm that uses sequence similarity and a given species phylogeny to reconstruct the underlying evolutionary history of all genes in a large group of species. In doing so, SYNERGY resolves homology relations and accurately distinguishes orthologs from paralogs. We applied our approach to a set of nine fully sequenced fungal genomes spanning 150 million years, generating a genome-wide catalog of orthologous groups and corresponding gene trees. Our results are highly accurate when compared to a manually curated gold standard, and are robust to the quality of input according to a novel jackknife confidence scoring. The reconstructed gene trees provide a comprehensive view of gene evolution on a genomic scale. Our approach can be applied to any set of sequenced eukaryotic species with a known phylogeny, and opens the way to systematic studies of the evolution of individual genes, molecular systems and whole genomes. Supplementary information: Supplementary data are available at Bioinformatics online.  相似文献   

15.
MOTIVATION: Microarrays have become a central tool in biological research. Their applications range from functional annotation to tissue classification and genetic network inference. A key step in the analysis of gene expression data is the identification of groups of genes that manifest similar expression patterns. This translates to the algorithmic problem of clustering genes based on their expression patterns. RESULTS: We present a novel clustering algorithm, called CLICK, and its applications to gene expression analysis. The algorithm utilizes graph-theoretic and statistical techniques to identify tight groups (kernels) of highly similar elements, which are likely to belong to the same true cluster. Several heuristic procedures are then used to expand the kernels into the full clusters. We report on the application of CLICK to a variety of gene expression data sets. In all those applications it outperformed extant algorithms according to several common figures of merit. We also point out that CLICK can be successfully used for the identification of common regulatory motifs in the upstream regions of co-regulated genes. Furthermore, we demonstrate how CLICK can be used to accurately classify tissue samples into disease types, based on their expression profiles. Finally, we present a new java-based graphical tool, called EXPANDER, for gene expression analysis and visualization, which incorporates CLICK and several other popular clustering algorithms. AVAILABILITY: http://www.cs.tau.ac.il/~rshamir/expander/expander.html  相似文献   

16.

Background  

We present here the PhIGs database, a phylogenomic resource for sequenced genomes. Although many methods exist for clustering gene families, very few attempt to create truly orthologous clusters sharing descent from a single ancestral gene across a range of evolutionary depths. Although these non-phylogenetic gene family clusters have been used broadly for gene annotation, errors are known to be introduced by the artifactual association of slowly evolving paralogs and lack of annotation for those more rapidly evolving. A full phylogenetic framework is necessary for accurate inference of function and for many studies that address pattern and mechanism of the evolution of the genome. The automated generation of evolutionary gene clusters, creation of gene trees, determination of orthology and paralogy relationships, and the correlation of this information with gene annotations, expression information, and genomic context is an important resource to the scientific community.  相似文献   

17.
T-box genes are defined by the presence of a conserved sequence, the so-called T-box; this codes for the T-domain, which is involved in DNA-binding and protein dimerisation. Members of this gene family have been found in all metazoans, from diploblasts to humans, and mutations in T-box gene family members in humans have been linked to several congenital disorders. Sequencing of the complete genomes of a range of invertebrate and vertebrate species has allowed the classification of individual T-box genes into five subfamilies: Brachyury, T-brain1, Tbx1, Tbx2 and Tbx6. This review will largely focus on T-box genes identified in organisms whose genomes have been fully sequenced, emphasising how comparative studies of the T-box gene family will help to reveal the roles of these genes during development and in the adult.  相似文献   

18.
Phylogenies involving nonmodel species are based on a few genes, mostly chosen following historical or practical criteria. Because gene trees are sometimes incongruent with species trees, the resulting phylogenies may not accurately reflect the evolutionary relationships among species. The increase in availability of genome sequences now provides large numbers of genes that could be used for building phylogenies. However, for practical reasons only a few genes can be sequenced for a wide range of species. Here we asked whether we can identify a few genes, among the single-copy genes common to most fungal genomes, that are sufficient for recovering accurate and well-supported phylogenies. Fungi represent a model group for phylogenomics because many complete fungal genomes are available. An automated procedure was developed to extract single-copy orthologous genes from complete fungal genomes using a Markov Clustering Algorithm (Tribe-MCL). Using 21 complete, publicly available fungal genomes with reliable protein predictions, 246 single-copy orthologous gene clusters were identified. We inferred the maximum likelihood trees using the individual orthologous sequences and constructed a reference tree from concatenated protein alignments. The topologies of the individual gene trees were compared to that of the reference tree using three different methods. The performance of individual genes in recovering the reference tree was highly variable. Gene size and the number of variable sites were highly correlated and significantly affected the performance of the genes, but the average substitution rate did not. Two genes recovered exactly the same topology as the reference tree, and when concatenated provided high bootstrap values. The genes typically used for fungal phylogenies did not perform well, which suggests that current fungal phylogenies based on these genes may not accurately reflect the evolutionary relationships among species. Analyses on subsets of species showed that the phylogenetic performance did not seem to depend strongly on the sample. We expect that the best-performing genes identified here will be very useful for phylogenetic studies of fungi, at least at a large taxonomic scale. Furthermore, we compare the method developed here for finding genes for building robust phylogenies with previous ones and we advocate that our method could be applied to other groups of organisms when more complete genomes are available.  相似文献   

19.
We explored the biological diversity of hemiascomycetous yeasts using a set of 22000 newly identified genes in 13 species through BLASTX searches. Genes without clear homologue in Saccharomyces cerevisiae appeared to be conserved in several species, suggesting that they were recently lost by S. cerevisiae. They often identified well-known species-specific traits. Cases of gene acquisition through horizontal transfer appeared to occur very rarely if at all. All identified genes were ascribed to functional classes. Functional classes were differently represented among species. Species classification by functional clustering roughly paralleled rDNA phylogeny. Unequal distribution of rapidly evolving, ascomycete-specific, genes among species and functions was shown to contribute strongly to this clustering. A few cases of gene family amplification were documented, but no general correlation could be observed between functional differentiation of yeast species and variations of gene family sizes. Yeast biological diversity seems thus to result from limited species-specific gene losses or duplications, and for a large part from rapid evolution of genes and regulatory factors dedicated to specific functions.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号