首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
As the number of protein folds is quite limited, a mode of analysis that will be increasingly common in the future, especially with the advent of structural genomics, is to survey and re-survey the finite parts list of folds from an expanding number of perspectives. We have developed a new resource, called PartsList, that lets one dynamically perform these comparative fold surveys. It is available on the web at http://bioinfo.mbb.yale.edu/partslist and http://www.partslist.org. The system is based on the existing fold classifications and functions as a form of companion annotation for them, providing 'global views' of many already completed fold surveys. The central idea in the system is that of comparison through ranking; PartsList will rank the approximately 420 folds based on more than 180 attributes. These include: (i) occurrence in a number of completely sequenced genomes (e.g. it will show the most common folds in the worm versus yeast); (ii) occurrence in the structure databank (e.g. most common folds in the PDB); (iii) both absolute and relative gene expression information (e.g. most changing folds in expression over the cell cycle); (iv) protein-protein interactions, based on experimental data in yeast and comprehensive PDB surveys (e.g. most interacting fold); (v) sensitivity to inserted transposons; (vi) the number of functions associated with the fold (e.g. most multi-functional folds); (vii) amino acid composition (e.g. most Cys-rich folds); (viii) protein motions (e.g. most mobile folds); and (ix) the level of similarity based on a comprehensive set of structural alignments (e.g. most structurally variable folds). The integration of whole-genome expression and protein-protein interaction data with structural information is a particularly novel feature of our system. We provide three ways of visualizing the rankings: a profiler emphasizing the progression of high and low ranks across many pre-selected attributes, a dynamic comparer for custom comparisons and a numerical rankings correlator. These allow one to directly compare very different attributes of a fold (e.g. expression level, genome occurrence and maximum motion) in the uniform numerical format of ranks. This uniform framework, in turn, highlights the way that the frequency of many of the attributes falls off with approximate power-law behavior (i.e. according to V(-b), for attribute value V and constant exponent b), with a few folds having large values and most having small values.  相似文献   

2.
Hegyi H  Lin J  Greenbaum D  Gerstein M 《Proteins》2002,47(2):126-141
We conducted a structural genomics analysis of the folds and structural superfamilies in the first 20 completely sequenced genomes by focusing on the patterns of fold usage and trying to identify structural characteristics of typical and atypical folds. We assigned folds to sequences using PSI-blast, run with a systematic protocol to reduce the amount of computational overhead. On average, folds could be assigned to about a fourth of the ORFs in the genomes and about a fifth of the amino acids in the proteomes. More than 80% of all the folds in the SCOP structural classification were identified in one of the 20 organisms, with worm and E. coli having the largest number of distinct folds. Folds are particularly effective at comprehensively measuring levels of gene duplication, because they group together even very remote homologues. Using folds, we find the average level of duplication varies depending on the complexity of the organism, ranging from 2.4 in M. genitalium to 32 for the worm, values significantly higher than those observed based purely on sequence similarity. We rank the common folds in the 20 organisms, finding that the top three are the P-loop NTP hydrolase, the ferrodoxin fold, and the TIM-barrel, and discuss in detail the many factors that affect and bias these rankings. We also identify atypical folds that are "unique" to one of the organisms in our study and compare the characteristics of these folds with the most common ones. We find that common folds tend be more multifunctional and associated with more regular, "symmetrical" structures than the unique ones. In addition, many of the unique folds are associated with proteins involved in cell defense (e.g., toxins). We analyze specific patterns of fold occurrence in the genomes by associating some of them with instances of horizontal transfer and others with gene loss. In particular, we find three possible examples of transfer between archaea and bacteria and six between eukarya and bacteria. We make available our detailed results at http://genecensus.org/20.  相似文献   

3.

Background

As tertiary structure is currently available only for a fraction of known protein families, it is important to assess what parts of sequence space have been structurally characterized. We consider protein domains whose structure can be predicted by sequence similarity to proteins with solved structure and address the following questions. Do these domains represent an unbiased random sample of all sequence families? Do targets solved by structural genomic initiatives (SGI) provide such a sample? What are approximate total numbers of structure-based superfamilies and folds among soluble globular domains?

Results

To make these assessments, we combine two approaches: (i) sequence analysis and homology-based structure prediction for proteins from complete genomes; and (ii) monitoring dynamics of the assigned structure set in time, with the accumulation of experimentally solved structures. In the Clusters of Orthologous Groups (COG) database, we map the growing population of structurally characterized domain families onto the network of sequence-based connections between domains. This mapping reveals a systematic bias suggesting that target families for structure determination tend to be located in highly populated areas of sequence space. In contrast, the subset of domains whose structure is initially inferred by SGI is similar to a random sample from the whole population. To accommodate for the observed bias, we propose a new non-parametric approach to the estimation of the total numbers of structural superfamilies and folds, which does not rely on a specific model of the sampling process. Based on dynamics of robust distribution-based parameters in the growing set of structure predictions, we estimate the total numbers of superfamilies and folds among soluble globular proteins in the COG database.

Conclusion

The set of currently solved protein structures allows for structure prediction in approximately a third of sequence-based domain families. The choice of targets for structure determination is biased towards domains with many sequence-based homologs. The growing SGI output in the future should further contribute to the reduction of this bias. The total number of structural superfamilies and folds in the COG database are estimated as ~4000 and ~1700. These numbers are respectively four and three times higher than the numbers of superfamilies and folds that can currently be assigned to COG proteins.  相似文献   

4.
We give an overview of the emerging field of structural genomics, describing how genomes can be compared in terms of protein structure. As the number of genes in a genome and the total number of protein folds are both quite limited, these comparisons take the form of surveys of a finite parts list, similar in respects to demographic censuses. Fold surveys have many similarities with other whole-genome characterizations, e.g., analyses of motifs or pathways. However, structure has a number of aspects that make it particularly suitable for comparing genomes, namely the way it allows for the precise definition of a basic protein module and the fact that it has a better defined relationship to sequence similarity than does protein function. An essential requirement for a structure survey is a library of folds, which groups the known structures into 'fold families.' This library can be built up automatically using a structure comparison program, and we described how important objective statistical measures are for assessing similarities within the library and between the library and genome sequences. After building the library, one can use it to count the number of folds in genomes, expressing the results in the form of Venn diagrams and 'top-10' statistics for shared and common folds. Depending on the counting methodology employed, these statistics can reflect different aspects of the genome, such as the amount of internal duplication or gene expression. Previous analyses have shown that the common folds shared between very different microorganisms, i.e., in different kingdoms, have a remarkably similar structure, being comprised of repeated strand-helix-strand super-secondary structure units. A major difficulty with this sort of 'fold-counting' is that only a small subset of the structures in a complete genome are currently known and this subset is prone to sampling bias. One way of overcoming biases is through structure prediction, which can be applied uniformly and comprehensively to a whole genome. Various investigators have, in fact, already applied many of the existing techniques for predicting secondary structure and transmembrane (TM) helices to the recently sequenced genomes. The results have been consistent: microbial genomes have similar fractions of strands and helices even though they have significantly different amino acid composition. The fraction of membrane proteins with a given number of TM helices falls off rapidly with more TM elements, approximately according to a Zipf law. This latter finding indicates that there is no preference for the highly studied 7-TM proteins in microbial genomes. Continuously updated tables and further information pertinent to this review are available over the web at http://bioinfo.mbb.yale.edu/genome.  相似文献   

5.
Evolution of protein superfamilies and bacterial genome size   总被引:1,自引:0,他引:1  
We present the structural annotation of 56 different bacterial species based on the assignment of genes to 816 evolutionary superfamilies in the CATH domain structure database. These assignments have enabled us to analyse the recurrence of specific superfamilies within and across the genomes. We have selected the superfamilies that have a very broad representation and therefore appear to be universally distributed in a significant number of bacterial lineages. Occurrence profiles of these universally distributed superfamilies are compared with genome size in order to estimate the correlation between superfamily duplication and the increase in proteome size. This distinguishes between those size-dependent superfamilies where frequency of occurrence is highly correlated with increase in genome size, and size-independent superfamilies where no correlation is observed. Consideration of the size correlation and the ratio between the mean and the standard deviations for all the superfamily profiles allows more detailed subdivisions and classification of superfamilies. For example, within the size-independent superfamilies, we distinguished a group that are distributed evenly amongst all the genomes. Within the size-dependent superfamilies we differentiated two groups: linearly distributed and non-linearly distributed. Functional annotation using the COG database was performed for all superfamilies in each of these groups, and this revealed significant differences amongst the three sets of superfamilies. Evenly distributed, size-independent domains are shown to be involved primarily in protein translation and biosynthesis. For the size-dependent superfamilies, linearly distributed superfamilies are involved mainly in metabolism, and non-linearly distributed superfamily domains are involved principally in gene regulation.  相似文献   

6.
We present a prototype of a new database tool, GeneCensus, which focuses on comparing genomes globally, in terms of the collective properties of many genes, rather than in terms of the attributes of a single gene (e.g. sequence similarity for a particular ortholog). The comparisons are presented in a visual fashion over the web at GeneCensus.org. The system concentrates on two types of comparisons: (i) trees based on the sharing of generalized protein families between genomes, and (ii) whole pathway analysis in terms of activity levels. For the trees, we have developed a module (TreeViewer) that clusters genomes in terms of the folds, superfamilies or orthologs—all can be considered as generalized ‘families’ or ‘protein parts’—they share, and compares the resulting trees side-by-side with those built from sequence similarity of individual genes (e.g. a traditional tree built on ribosomal similarity). We also include comparisons to trees built on whole-genome dinucleotide or codon composition. For pathway comparisons, we have implemented a module (PathwayPainter) that graphically depicts, in selected metabolic pathways, the fluxes or expression levels of the associated enzymes (i.e. generalized ‘activities’). One can, consequently, compare organisms (and organism states) in terms of representations of these systemic quantities. Develop ment of this module involved compiling, calculating and standardizing flux and expression information from many different sources. We illustrate pathway analysis for enzymes involved in central metabolism. We are able to show that, to some degree, flux and expression fluctuations have characteristic values in different sections of the central metabolism and that control points in this system (e.g. hexokinase, pyruvate kinase, phosphofructokinase, isocitrate dehydrogenase and citric synthase) tend to be especially variable in flux and expression. Both the TreeViewer and PathwayPainter modules connect to other information sources related to individual-gene or organism properties (e.g. a single-gene structural annotation viewer).  相似文献   

7.
The explosion in gene sequence data and technological breakthroughs in protein structure determination inspired the launch of structural genomics (SG) initiatives. An often stated goal of structural genomics is the high-throughput structural characterisation of all protein sequence families, with the long-term hope of significantly impacting on the life sciences, biotechnology and drug discovery. Here, we present a comprehensive analysis of solved SG targets to assess progress of these initiatives. Eleven consortia have contributed 316 non-redundant entries and 323 protein chains to the Protein Data Bank (PDB), and 459 and 393 domains to the CATH and SCOP structure classifications, respectively. The quality and size of these proteins are comparable to those solved in traditional structural biology and, despite huge scope for duplicated efforts, only 14% of targets have a close homologue (>/=30% sequence identity) solved by another consortium. Analysis of CATH and SCOP revealed the significant contribution that structural genomics is making to the coverage of superfamilies and folds. A total of 67% of SG domains in CATH are unique, lacking an already characterised close homologue in the PDB, whereas only 21% of non-SG domains are unique. For 29% of domains, structure determination revealed a remote evolutionary relationship not apparent from sequence, and 19% and 11% contributed new superfamilies and folds. The secondary structure class, fold and superfamily distributions of this dataset reflect those of the genomes. The domains fall into 172 different folds and 259 superfamilies in CATH but the distribution is highly skewed. The most populous of these are those that recur most frequently in the genomes. Whilst 11% of superfamilies are bacteria-specific, most are common to all three superkingdoms of life and together the 316 PDB entries have provided new and reliable homology models for 9287 non-redundant gene sequences in 206 completely sequenced genomes. From the perspective of this analysis, it appears that structural genomics is on track to be a success, and it is hoped that this work will inform future directions of the field.  相似文献   

8.
Analyses of genome sequences have revealed a surprisingly variable distribution of genes, reflecting the generation of novel genes, lateral gene transfer and gene loss. The impact of gene loss on organisms has been difficult to examine, but the loss of protein coding genes, the loss of domains within proteins and the divergence of genes have made surprising contributions to the differences among organisms. This paper reviews surveys of gene loss and divergence in fungal and archaeal genomes that indicate suites of functionally related genes tend to undergo loss and divergence. Instances of fungal gene loss highlighted here suggest that specific cellular systems have changed, such as Ca 2+ biology in Saccharomyces cerevisiae and peroxisome function in Schizosaccharomyces pombe. Analyses of loss and divergence can provide specific predictions regarding protein-protein interactions, and the relationship between networks of protein interactions and loss may form a part of a parametric model of genome evolution.  相似文献   

9.
The age of genomics has given us a wealth of information and the tools to study whole genomes. This, in turn, has facilitated genome-wide studies among organisms that were relatively less studied in the pre-genomic era or are non-model organisms. This paves the way to the discovery of interesting evolutionary patterns, which are brought to light by genome-wide surveys of protein superfamilies. Phosphorylation is a post-translational modification that is utilised across all clades of life, and acts as an important signalling switch, regulating several cellular processes. Tyrosine phosphatases, which are found predominantly in eukaryotes, act on phosphorylated tyrosine residues and sometimes on other substrates. Extending on our previous effort to look for tyrosine phosphatases in the human genome, we have looked for sequences of the cysteine-based tyrosine phosphatase superfamily in thirty mammalian genomes from all across Mammalia and validated the sequences with the presence of the signature catalytic motif. Domain architecture annotation, followed by in-depth analysis, revealed interesting taxon-specific patterns such as subtle differences between the protein families in marsupials and early mammals versus placental mammals. Finally, we discuss an interesting case of loss of the tyrosine phosphatase domain from a gene product in the course of eutherian evolution.  相似文献   

10.
11.

Background  

Sequence related families of genes and proteins are common in bacterial genomes. In Escherichia coli they constitute over half of the genome. The presence of families and superfamilies of proteins suggest a history of gene duplication and divergence during evolution. Genome encoded protein families, their size and functional composition, reflect metabolic potentials of the organisms they are found in. Comparing protein families of different organisms give insight into functional differences and similarities.  相似文献   

12.
In this work we develop a microscopic physical model of early evolution where phenotype—organism life expectancy—is directly related to genotype—the stability of its proteins in their native conformations—which can be determined exactly in the model. Simulating the model on a computer, we consistently observe the “Big Bang” scenario whereby exponential population growth ensues as soon as favorable sequence–structure combinations (precursors of stable proteins) are discovered. Upon that, random diversity of the structural space abruptly collapses into a small set of preferred proteins. We observe that protein folds remain stable and abundant in the population at timescales much greater than mutation or organism lifetime, and the distribution of the lifetimes of dominant folds in a population approximately follows a power law. The separation of evolutionary timescales between discovery of new folds and generation of new sequences gives rise to emergence of protein families and superfamilies whose sizes are power-law distributed, closely matching the same distributions for real proteins. On the population level we observe emergence of species—subpopulations that carry similar genomes. Further, we present a simple theory that relates stability of evolving proteins to the sizes of emerging genomes. Together, these results provide a microscopic first-principles picture of how first-gene families developed in the course of early evolution.  相似文献   

13.
During the past decade, ancient gene duplications were recognized as one of the main forces in the generation of diverse gene families and the creation of new functional capabilities. New tools developed to search data banks for homologous sequences, and an increased availability of reliable three-dimensional structural information led to the recognition that proteins with diverse functions can belong to the same superfamily. Analyses of the evolution of these superfamilies promises to provide insights into early evolution but are complicated by several important evolutionary processes. Horizontal transfer of genes can lead to a vertical spread of innovations among organisms, therefore finding a certain property in some descendants of an ancestor does not guarantee that it was present in that ancestor. Complete or partial gene conversion between duplicated genes can yield phylogenetic trees with several, apparently independent gene duplications, suggesting an often surprising parallelism in the evolution of independent lineages. Additionally, the breakup of domains within a protein and the fusion of domains into multifunctional proteins makes the delineation of superfamilies a task that remains difficult to automate.  相似文献   

14.
Genome sequencing has revealed that horizontal gene transfer (HGT) is a major evolutionary process in bacteria. Although it is generally assumed that closely related organisms engage in genetic exchange more frequently than distantly related ones, the frequency of HGT among distantly related organisms and the effect of ecological relatedness on the frequency has not been rigorously assessed. Here, we devised a novel bioinformatic pipeline, which minimized the effect of over-representation of specific taxa in the available databases and other limitations of homology-based approaches by analyzing genomes in standardized triplets, to quantify gene exchange between bacterial genomes representing different phyla. Our analysis revealed the existence of networks of genetic exchange between organisms with overlapping ecological niches, with mesophilic anaerobic organisms showing the highest frequency of exchange and engaging in HGT twice as frequently as their aerobic counterparts. Examination of individual cases suggested that inter-phylum HGT is more pronounced than previously thought, affecting up to ∼16% of the total genes and ∼35% of the metabolic genes in some genomes (conservative estimation). In contrast, ribosomal and other universal protein-coding genes were subjected to HGT at least 150 times less frequently than genes encoding the most promiscuous metabolic functions (for example, various dehydrogenases and ABC transport systems), suggesting that the species tree based on the former genes may be reliable. These results indicated that the metabolic diversity of microbial communities within most habitats has been largely assembled from preexisting genetic diversity through HGT and that HGT accounts for the functional redundancy among phyla.  相似文献   

15.
In addition to the nuclear genome, organisms have organelle genomes. Most of the DNA present in eukaryotic organisms is located in the cell nucleus. Chloroplasts have independent genomes which are inherited from the mother. Duplicated genes are common in the genomes of all organisms. It is believed that gene duplication is the most important step for the origin of genetic variation, leading to the creation of new genes and new gene functions. Despite the fact that extensive gene duplications are rare among the chloroplast genome, gene duplication in the chloroplast genome is an essential source of new genetic functions and a mechanism of neo-evolution. The events of gene transfer between the chloroplast genome and nuclear genome via duplication and subsequent recombination are important processes in evolution. The duplicated gene or genome in the nucleus has been the subject of several recent reviews. In this review, we will briefly summarize gene duplication and evolution in the chloroplast genome. Also, we will provide an overview of gene transfer events between chloroplast and nuclear genomes.  相似文献   

16.
With the advent of larger genome databases detection of horizontal gene transfer events has been transformed into an increasingly important issue. Here we present a simple theoretical analysis based on the in silico artificial addition of known foreign genes from different prokaryotic groups into the genome of Escherichia coli K12 MG1655. Using this dataset as a control, we have tested the efficiency of four methodologies commonly employed to detect HTG (Horizontally transferred genes), which are based on (a) the codon adaptation index, codon usage, and GC percentage (CAI/GC); (b) a distributional profile (DP) approach made by a gene search in the closely related phylogenetic genomes; (c) a Bayesian model (BM); and (d) a first-order Markov model (MM). All methods exhibit limitations although, as shown here, the BM and the MM are better approximations. Moreover, the MM has demonstrated a more accurate rate of detections when genes from closely related organisms are evaluated. The application of the MM to detect recently transferred genes in the genomes of E. coli strains K12 MG1655, O157 EDL933, and Salmonella typhimurium, shows that these organisms have undergone a rather significant amount of HTG, most of which appear to be pseudogenes. Few of these sequences that have undergone HGT appear to have well defined functions and may be involved in the organism's adaptation.  相似文献   

17.
Cao YQ  Ma C  Chen JY  Yang DR 《BMC genomics》2012,13(1):276
ABSTRACT: BACKGROUND: Lepidoptera encompasses more than 160,000 described species that have been classified into 45-48 superfamilies. The previously determined Lepidoptera mitochondrial genomes (mitogenomes) are limited to six superfamilies of the most derived lepidopteran lineage Ditrysia. Compared with the ancestral insect gene order, these mitogenomes all contain a tRNA rearrangement. To gain new insights into Lepidoptera mitogenome evolution, we sequenced the mitogenomes of two ghost moths that belong to primitive lepidopteran lineages and conducted a comparative mitogenomic analysis across Lepidoptera. RESULTS: The mitogenomes of Thitarodes renzhiensis and T. yunnanensis are 16,173 bp and 15,814 bp long with an A+T content of 81.28% and 82.33%, respectively. Different tandem repeats in the A+T-rich region mainly account for the size difference between the two mitogenomes. Both mitogenomes include 13 protein-coding genes, 22 transfer RNA genes, and 2 ribosomal RNA genes. The 1,584-bp sequence from rrnS to nad2 was also determined for Thitarodes sp.QL, which has no repetitive sequence in the A+T-rich region. All three Thitarodes species possess the ancestral gene order with trnI-trnQ-trnM located between the A+T-rich region and nad2, which is different from the gene order trnM-trnI-trnQ in all previously sequenced Lepidoptera species. The formerly identified conserved elements of Lepidoptera mitogenomes (i.e. the motif 'ATAGA' and poly-T stretch in the A+T-rich region and the long intergenic spacer upstream of nad2) are absent in the Thitarodes mitogenomes. The phylogenetic analysis supports that Hepialoidea, represented by T. renzhiensis and T. yunnanensis, occupies a basal position in the currently sampled seven superfamilies. The relationships of the other six superfamilies are (((((Bombycoidea + Geometroidea) + Noctuoidea) + Pyraloidea) + Papilionoidea) + Tortricoidea). CONCLUSION: The mitogenomes of T. renzhiensis and T. yunnanensis exhibit unusual features compared with the previously determined Lepidoptera mitogenomes. Their ancestral gene order indicates that the tRNA rearrangement event occurred after Lepidoptera diverged from other holometabolous insect orders. Phylogenetic analysis based on mitogenome sequences is a power tool for addressing phylogenetic relationships among major Lepidoptera superfamilies. Characterization of the two ghost moth mitogenomes has enriched our knowledge of Lepidoptera mitogenomes and contributed to our understanding of the mechanisms underlying mitogenome evolution, especially gene rearrangements.  相似文献   

18.
While it is well accepted that horizontal gene transfer plays an important role in the evolution and the diversification of prokaryotic genomes, many questions remain open regarding its functional mechanisms of action and its interplay with the extant genome. This study addresses the relationship between proteome innovation by horizontal gene transfer and genome content in Proteobacteria. We characterize the transferred genes, focusing on the protein domain compositions and their relationships with the existing protein domain superfamilies in the genome. In agreement with previous observations, we find that the protein domain architectures of horizontally transferred genes are significantly shorter than the genomic average. Furthermore, protein domains that are more common in the total pool of genomes appear to have a proportionally higher chance to be transferred. This suggests that transfer events behave as if they were drawn randomly from a cross-genomic community gene pool, much like gene duplicates are drawn from a genomic gene pool. Finally, horizontally transferred genes carry domains of exogenous families less frequently for larger genomes, although they might do it more than expected by chance.  相似文献   

19.
20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号