首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Domains are the building blocks of all globular proteins, and are units of compact three-dimensional structure as well as evolutionary units. There is a limited repertoire of domain families, so that these domain families are duplicated and combined in different ways to form the set of proteins in a genome. Proteins are gene products. The processes that produce new genes are duplication and recombination as well as gene fusion and fission. We attempt to gain an overview of these processes by studying the structural domains in the proteins of seven genomes from the three kingdoms of life: Eubacteria, Archaea and Eukaryota. We use here the domain and superfamily definitions in Structural Classification of Proteins Database (SCOP) in order to map pairs of adjacent domains in genome sequences in terms of their superfamily combinations. We find 624 out of the 764 superfamilies in SCOP in these genomes, and the 624 families occur in 585 pairwise combinations. Most families are observed in combination with one or two other families, while a few families are very versatile in their combinatorial behaviour. This type of pattern can be described by a scale-free network. Finally, we study domain repeats and we compare the set of the domain combinations in the genomes to those in PDB, and discuss the implications for structural genomics.  相似文献   

2.
New directions in biology are being driven by the complete sequencing of genomes, which has given us the protein repertoires of diverse organisms from all kingdoms of life. In tandem with this accumulation of sequence data, worldwide structural genomics initiatives, advanced by the development of improved technologies in X-ray crystallography and NMR, are expanding our knowledge of structural families and increasing our fold libraries. Methods for detecting remote sequence similarities have also been made more sensitive and this means that we can map domains from these structural families onto genome sequences to understand how these families are distributed throughout the genomes and reveal how they might influence the functional repertoires and biological complexities of the organisms. We have used robust protocols to assign sequences from completed genomes to domain structures in the CATH database, allowing up to 60% of domain sequences in these genomes, depending on the organism, to be assigned to a domain family of known structure. Analysis of the distribution of these families throughout bacterial genomes identified more than 300 universal families, some of which had expanded significantly in proportion to genome size. These highly expanded families are primarily involved in metabolism and regulation and appear to make major contributions to the functional repertoire and complexity of bacterial organisms. When comparisons are made across all kingdoms of life, we find a smaller set of universal domain families (approx. 140), of which families involved in protein biosynthesis are the largest conserved component. Analysis of the behaviour of other families reveals that some (e.g. those involved in metabolism, regulation) have remained highly innovative during evolution, making it harder to trace their evolutionary ancestry. Structural analyses of metabolic families provide some insights into the mechanisms of functional innovation, which include changes in domain partnerships and significant structural embellishments leading to modulation of active sites and protein interactions.  相似文献   

3.
There is a limited repertoire of domain families in nature that are duplicated and combined in different ways to form the set of proteins in a genome. Most proteins in both prokaryote and eukaryote genomes consist of two or more domains, and we show that the family size distribution of multi-domain protein families follows a power law like that of individual families. Most domain pairs occur in four to six different domain architectures: in isolation and in combinations with different partners. We showed previously that within the set of all pairwise domain combinations, most small and medium-sized families are observed in combination with one or two other families, while a few large families are very versatile and combine with many different partners. Though this may appear to be a stochastic pattern, in which large families have more combination partners by virtue of their size, we establish here that all the domain families with more than three members in genomes are duplicated more frequently than would be expected by chance considering their number of neighbouring domains. This duplication of domain pairs is statistically significant for between one and three quarters of all families with seven or more members. For the majority of pairwise domain combinations, there is no known three-dimensional structure of the two domains together, and we term these novel combinations. Novel domain combinations are interesting and important targets for structural elucidation, as the geometry and interaction between the domains will help understand the function and evolution of multi-domain proteins. Of particular interest are those combinations that occur in the largest number of multi-domain proteins, and several of these frequent novel combinations contain DNA-binding domains.Abbreviations:SCOP: Structural Classification of Proteins database, PDB: Protein DataBank, HMM: hidden Markov model  相似文献   

4.
Lee D  Grant A  Marsden RL  Orengo C 《Proteins》2005,59(3):603-615
Using a new protocol, PFscape, we undertake a systematic identification of protein families and domain architectures in 120 complete genomes. PFscape clusters sequences into protein families using a Markov clustering algorithm (Enright et al., Nucleic Acids Res 2002;30:1575-1584) followed by complete linkage clustering according to sequence identity. Within each protein family, domains are recognized using a library of hidden Markov models comprising CATH structural and Pfam functional domains. Domain architectures are then determined using DomainFinder (Pearl et al., Protein Sci 2002;11:233-244) and the protein family and domain architecture data are amalgamated in the Gene3D database (Buchan et al., Genome Res 2002;12:503-514). Using Gene3D, we have investigated protein sequence space, the extent of structural annotation, and the distribution of different domain architectures in completed genomes from all kingdoms of life. As with earlier studies by other researchers, the distribution of domain families shows power-law behavior such that the largest 2,000 domain families can be mapped to approximately 70% of nonsingleton genome sequences; the remaining sequences are assigned to much smaller families. While approximately 50% of domain annotations within a genome are assigned to 219 universal domain families, a much smaller proportion (< 10%) of protein sequences are assigned to universal protein families. This supports the mosaic theory of evolution whereby domain duplication followed by domain shuffling gives rise to novel domain architectures that can expand the protein functional repertoire of an organism. Functional data (e.g. COG/KEGG/GO) integrated within Gene3D result in a comprehensive resource that is currently being used in structure genomics initiatives and can be accessed via http://www.biochem.ucl.ac.uk/bsm/cath/Gene3D/.  相似文献   

5.
Abeln S  Deane CM 《Proteins》2005,60(4):690-700
We review fold usage on completed genomes to explore protein structure evolution. The patterns of presence or absence of folds on genomes gives us insights into the relationships between folds, the age of different folds and how we have arrived at the set of folds we see today. We examine the relationships between different measures which describe protein fold usage, such as the number of copies of a fold per genome, the number of families per fold, and the number of genomes a fold occurs on. We obtained these measures of fold usage by searching for the structural domains on 157 completed genome sequences from all three kingdoms of life. In our comparisons of these measures we found that bacteria have relatively more distinct folds on their genomes than archaea. Eukaryotes were found to have many more copies of a fold on their genomes. If we separate out the different fold classes, the alpha/beta class has relatively fewer distinct folds on large genomes, more copies of a fold on bacteria and more folds occurring in all three kingdoms simultaneously. These results possibly indicate that most alpha/beta folds originated earlier than other folds. The expected power law distribution is observed for copies of a fold per genome and we found a similar distribution for the number of families per fold. However, a more complicated distribution appears for fold occurrence across genomes, which strongly depends on fold class and kingdom. We also show that there is not a clear relationship between the three measures of fold usage. A fold which occurs on many genomes does not necessarily have many copies on each genome. Similarly, folds with many copies do not necessarily have many families or vice versa.  相似文献   

6.
We present an analysis of 203 completed genomes in the Gene3D resource (including 17 eukaryotes), which demonstrates that the number of protein families is continually expanding over time and that singleton-sequences appear to be an intrinsic part of the genomes. A significant proportion of the proteomes can be assigned to fewer than 6000 well-characterized domain families with the remaining domain-like regions belonging to a much larger number of small uncharacterized families that are largely species specific. Our comprehensive domain annotation of 203 genomes enables us to provide more accurate estimates of the number of multi-domain proteins found in the three kingdoms of life than previous calculations. We find that 67% of eukaryotic sequences are multi-domain compared with 56% of sequences in prokaryotes. By measuring the domain coverage of genome sequences, we show that the structural genomics initiatives should aim to provide structures for less than a thousand structurally uncharacterized Pfam families to achieve reasonable structural annotation of the genomes. However, in large families, additional structures should be determined as these would reveal more about the evolution of the family and enable a greater understanding of how function evolves.  相似文献   

7.
There are more than 200 completed genomes and over 1 million nonredundant sequences in public repositories. Although the structural data are more sparse (approximately 13,000 nonredundant structures solved to date), several powerful sequence-based methodologies now allow these structures to be mapped onto related regions in a significant proportion of genome sequences. We review a number of publicly available strategies for providing structural annotations for genome sequences, and we describe the protocol adopted to provide CATH structural annotations for completed genomes. In particular, we assess the performance of several sequence-based protocols employing Hidden Markov model (HMM) technologies for superfamily recognition, including a new approach (SAMOSA [sequence augmented models of structure alignments]) that exploits multiple structural alignments from the CATH domain structure database when building the models. Using a data set of remote homologs detected by structure comparison and manually validated in CATH, a single-seed HMM library was able to recognize 76% of the data set. Including the SAMOSA models in the HMM library showed little gain in homolog recognition, although a slight improvement in alignment quality was observed for very remote homologs. However, using an expanded 1D-HMM library, CATH-ISL increased the coverage to 86%. The single-seed HMM library has been used to annotate the protein sequences of 120 genomes from all three major kingdoms, allowing up to 70% of the genes or partial genes to be assigned to CATH superfamilies. It has also been used to recruit sequences from Swiss-Prot and TrEMBL into CATH domain superfamilies, expanding the CATH database eightfold.  相似文献   

8.
Protein translations of over 100 complete genomes are now available. About half of these sequences can be provided with structural annotation, thereby enabling some profound insights into protein and pathway evolution. Whereas the major domain structure families are common to all kingdoms of life, these are combined in different ways in multidomain proteins to give various domain architectures that are specific to kingdoms or individual genomes, and contribute to the diverse phenotypes observed. These data argue for more targets in structural genomics initiatives and particularly for the selection of different domain architectures to gain better insights into protein functions.  相似文献   

9.
Convergent evolution of domain architectures (is rare)   总被引:4,自引:0,他引:4  
MOTIVATION: In this paper, we shall examine the evolution of domain architectures across 62 genomes of known phylogeny including all kingdoms of life. We look in particular at the possibility of convergent evolution, with a view to determining the extent to which the architectures observed in the genomes are due to functional necessity or evolutionary descent. We used domains of known structure, because from this and other information we know their evolutionary relationships. We use a range of methods including phylogenetic grouping, sequence similarity/alignment, mutation rates and comparative genomics to approach this difficult problem from several angles. RESULTS: Although we do not claim an exhaustive analysis, we conclude that between 0.4 and 4% of sequences are involved in convergent evolution of domain architectures, and expect the actual number to be close to the lower bound. We also made two incidental observations, albeit on a small sample: the events leading to convergent evolution appear to be random with no functional or structural preferences, and changes in the number of tandem repeat domains occur more readily than changes which alter the domain composition. CONCLUSION: The principal conclusion is that the observed domain architectures of the sequences in the genomes are driven by evolutionary descent rather than functional necessity. CONTACT: gough@supfam.org.  相似文献   

10.
Domains are the evolutionary units that comprise proteins, and most proteins are built from more than one domain. Domains can be shuffled by recombination to create proteins with new arrangements of domains. Using structural domain assignments, we examined the combinations of domains in the proteins of 131 completely sequenced organisms. We found two-domain and three-domain combinations that recur in different protein contexts with different partner domains. The domains within these combinations have a particular functional and spatial relationship. These units are larger than individual domains and we term them "supra-domains". Amongst the supra-domains, we identified some 1400 (1203 two-domain and 166 three-domain) combinations that are statistically significantly over-represented relative to the occurrence and versatility of the individual component domains. Over one-third of all structurally assigned multi-domain proteins contain these over-represented supra-domains. This means that investigation of the structural and functional relationships of the domains forming these popular combinations would be particularly useful for an understanding of multi-domain protein function and evolution as well as for genome annotation. These and other supra-domains were analysed for their versatility, duplication, their distribution across the three kingdoms of life and their functional classes. By examining the three-dimensional structures of several examples of supra-domains in different biological processes, we identify two basic types of spatial relationships between the component domains: the combined function of the two domains is such that either the geometry of the two domains is crucial and there is a tight constraint on the interface, or the precise orientation of the domains is less important and they are spatially separate. Frequently, the role of the supra-domain becomes clear only once the three-dimensional structure is known. Since this is the case for only a quarter of the supra-domains, we provide a list of the most important unknown supra-domains as potential targets for structural genomics projects.  相似文献   

11.
Evolutionary innovation in eukaryotes and especially animals is at least partially driven by genome rearrangements and the resulting emergence of proteins with new domain combinations, and thus potentially novel functionality. Given the random nature of such rearrangements, one could expect that proteins with particularly useful multidomain combinations may have been rediscovered multiple times by parallel evolution. However, existing reports suggest a minimal role of this phenomenon in the overall evolution of eukaryotic proteomes. We assembled a collection of 172 complete eukaryotic genomes that is not only the largest, but also the most phylogenetically complete set of genomes analyzed so far. By employing a maximum parsimony approach to compare repertoires of Pfam domains and their combinations, we show that independent evolution of domain combinations is significantly more prevalent than previously thought. Our results indicate that about 25% of all currently observed domain combinations have evolved multiple times. Interestingly, this percentage is even higher for sets of domain combinations in individual species, with, for instance, 70% of the domain combinations found in the human genome having evolved independently at least once in other species. We also show that previous, much lower estimates of this rate are most likely due to the small number and biased phylogenetic distribution of the genomes analyzed. The process of independent emergence of identical domain combination is widespread, not limited to domains with specific functional categories. Besides data from large-scale analyses, we also present individual examples of independent domain combination evolution. The surprisingly large contribution of parallel evolution to the development of the domain combination repertoire in extant genomes has profound consequences for our understanding of the evolution of pathways and cellular processes in eukaryotes and for comparative functional genomics.  相似文献   

12.
A role for tRNA modifications in genome structure and codon usage   总被引:1,自引:0,他引:1  
Transfer RNA (tRNA) gene content is a differentiating feature of genomes that contributes to the efficiency of the translational apparatus, but the principles shaping tRNA gene copy number and codon composition are poorly understood. Here, we report that the emergence of two specific tRNA modifications shaped the structure and composition of all extant genomes. Through the analysis of more than 500 genomes, we identify two kingdom-specific tRNA modifications as major contributors that separated archaeal, bacterial, and eukaryal genomes in terms of their tRNA gene composition. We show that, contrary to prior observations, genomic codon usage and tRNA gene frequencies correlate in all kingdoms if these two modifications are taken into account and that presence or absence of these modifications explains patterns of gene expression observed in previous studies. Finally, we experimentally demonstrate that human gene expression levels correlate well with genomic codon composition if these identified modifications are considered.  相似文献   

13.
The Solute Carriers (SLCs) are membrane proteins that regulate transport of many types of substances over the cell membrane. The SLCs are found in at least 46 gene families in the human genome. Here, we performed the first evolutionary analysis of the entire SLC family based on whole genome sequences. We systematically mined and analyzed the genomes of 17 species to identify SLC genes. In all, we identified 4,813 SLC sequences in these genomes, and we delineated the evolutionary history of each of the subgroups. Moreover, we also identified ten new human sequences not previously classified as SLCs, which most likely belong to the SLC family. We found that 43 of the 46 SLC families found in Homo sapiens were also found in Caenorhabditis elegans, whereas 42 of them were also found in insects. Mammals have a higher number of SLC genes in most families, perhaps reflecting important roles for these in central nervous system functions. This study provides a systematic analysis of the evolutionary history of the SLC families in Eukaryotes showing that the SLC superfamily is ancient with multiple branches that were present before early divergence of Bilateria. The results provide foundation for overall classification of SLC genes and are valuable for annotation and prediction of substrates for the many SLCs that have not been tested in experimental transport assays.  相似文献   

14.
Availability of the human genome data has enabled the exploration of a huge amount of biological information encoded in it. There are extensive ongoing experimental efforts to understand the biological functions of the gene products encoded in the human genome. However, computational analysis can aid immensely in the interpretation of biological function by associating known functional/structural domains to the human proteins. In this article we have discussed the implications of such associations. The association of structural domains to human proteins could help in prioritizing the targets for structure determination in the structural genomics initiatives. The protein kinase family is one of the most frequently occurring protein domain families in the human proteome while P-loop hydrolase, which comprises many GTPases and ATPases, is a highly represented superfamily. Using the superfamily relationships between families of unknown and known structures we could increase structural information content of the human genome by about 5%. We could also make new associations of domain families to 33 human proteins that are potentially linked to genetically inherited diseases.  相似文献   

15.
Genomic trees have been constructed based on the presence and absence of families of protein-encoding genes observed in 27 complete genomes, including genomes of 15 free-living organisms. This method does not rely on the identification of suspected orthologs in each genome, nor the specific alignment used to compare gene sequences because the protein-encoding gene families are formed by grouping any protein with a pairwise similarity score greater than a preset value. Because of this all inclusive grouping, this method is resilient to some effects of lateral gene transfer because transfers of genes are masked when the recipient genome already has a homolog (not necessarily an ortholog) of the incoming gene. Of 71 genes suspected to have been laterally transferred to the genome of Aeropyrum pernix, only approximately 7 to 15 represent genes where a lateral gene transfer appears to have generated homoplasy in our character dataset. The genomic tree of the 15 free-living taxa includes six different bacterial orders, six different archaeal orders, and two different eukaryotic kingdoms. The results are remarkably similar to results obtained by analysis of rRNA. Inclusion of the other 12 genomes resulted in a tree only broadly similar to that suggested by rRNA with at least some of the differences due to artifacts caused by the small genome size of many of these species. Very small genomes, such as those of the two Mycoplasma genomes included, fall to the base of the Bacterial domain, a result expected due to the substantial gene loss inherent to these lineages. Finally, artificial ``partial genomes' were generated by randomly selecting ORFs from the complete genomes in order to test our ability to recover the tree generated by the whole genome sequences when only partial data are available. The results indicated that partial genomic data, when sampled randomly, could robustly recover the tree generated by the whole genome sequences. Received: 30 May 2001 / Accepted: 10 October 2001  相似文献   

16.
Evolution of protein superfamilies and bacterial genome size   总被引:1,自引:0,他引:1  
We present the structural annotation of 56 different bacterial species based on the assignment of genes to 816 evolutionary superfamilies in the CATH domain structure database. These assignments have enabled us to analyse the recurrence of specific superfamilies within and across the genomes. We have selected the superfamilies that have a very broad representation and therefore appear to be universally distributed in a significant number of bacterial lineages. Occurrence profiles of these universally distributed superfamilies are compared with genome size in order to estimate the correlation between superfamily duplication and the increase in proteome size. This distinguishes between those size-dependent superfamilies where frequency of occurrence is highly correlated with increase in genome size, and size-independent superfamilies where no correlation is observed. Consideration of the size correlation and the ratio between the mean and the standard deviations for all the superfamily profiles allows more detailed subdivisions and classification of superfamilies. For example, within the size-independent superfamilies, we distinguished a group that are distributed evenly amongst all the genomes. Within the size-dependent superfamilies we differentiated two groups: linearly distributed and non-linearly distributed. Functional annotation using the COG database was performed for all superfamilies in each of these groups, and this revealed significant differences amongst the three sets of superfamilies. Evenly distributed, size-independent domains are shown to be involved primarily in protein translation and biosynthesis. For the size-dependent superfamilies, linearly distributed superfamilies are involved mainly in metabolism, and non-linearly distributed superfamily domains are involved principally in gene regulation.  相似文献   

17.
Analysis of evolution of paralogous genes in a genome is central to our understanding of genome evolution. Comparison of closely related bacterial genomes, which has provided clues as to how genome sequences evolve under natural conditions, would help in such an analysis. With species Staphylococcus aureus, whole-genome sequences have been decoded for seven strains. We compared their DNA sequences to detect large genome polymorphisms and to deduce mechanisms of genome rearrangements that have formed each of them. We first compared strains N315 and Mu50, which make one of the most closely related strain pairs, at the single-nucleotide resolution to catalogue all the middle-sized (more than 10 bp) to large genome polymorphisms such as indels and substitutions. These polymorphisms include two paralogous gene sets, one in a tandem paralogue gene cluster for toxins in a genomic island and the other in a ribosomal RNA operon. We also focused on two other tandem paralogue gene clusters and type I restriction-modification (RM) genes on the genomic islands. Then we reconstructed rearrangement events responsible for these polymorphisms, in the paralogous genes and the others, with reference to the other five genomes. For the tandem paralogue gene clusters, we were able to infer sequences for homologous recombination generating the change in the repeat number. These sequences were conserved among the repeated paralogous units likely because of their functional importance. The sequence specificity (S) subunit of type I RM systems showed recombination, likely at the homology of a conserved region, between the two variable regions for sequence specificity. We also noticed novel alleles in the ribosomal RNA operons and suggested a role for illegitimate recombination in their formation. These results revealed importance of recombination involving long conserved sequence in the evolution of paralogous genes in the genome.  相似文献   

18.
MOTIVATION: The analysis of repeated elements in genomes is a fascinating domain of research that is lacking relevant tools for transposable elements (TEs), the most complex ones. The dynamics of TEs, which provides the main mechanism of mutation in some genomes, is an essential component of genome evolution. In this study we introduce a new concept of domain, a segmentation unit useful for describing the architecture of different copies of TEs. Our method extracts occurrences of a terminus-defined family of TEs, aligns the sequences, finds the domains in the alignment and searches the distribution of each domain in sequences. After a classification step relative to the presence or the absence of domains, the method results in a graphical view of sequences segmented into domains. RESULTS: Analysis of the new non-autonomous TE AtREP21 in the model plant Arabidopsis thaliana reveals copies of very different sizes and various combinations of domains which show the potential of our method. AVAILABILITY: DomainOrganizer web page is available at www.irisa.fr/symbiose/DomainOrganizer/.  相似文献   

19.
Protein domains represent the basic evolutionary units that form proteins. Domain duplication and shuffling by recombination are probably the most important forces driving protein evolution and hence the complexity of the proteome. While the duplication of whole genes as well as domain-encoding exons increases the abundance of domains in the proteome, domain shuffling increases versatility, i.e. the number of distinct contexts in which a domain can occur. Here, we describe a comprehensive, genome-wide analysis of the relationship between these two processes. We observe a strong and robust correlation between domain versatility and abundance: domains that occur more often also have many different combination partners. This supports the view that domain recombination occurs in a random way. However, we do not observe all the different combinations that are expected from a simple random recombination scenario, and this is due to frequent duplication of specific domain combinations. When we simulate the evolution of the protein repertoire considering stochastic recombination of domains followed by extensive duplication of the combinations, we approximate the observed data well. Our analyses are consistent with a stochastic process that governs domain recombination and thus protein divergence with respect to domains within a polypeptide chain. At the same time, they support a scenario in which domain combinations are formed only once during the evolution of the protein repertoire, and are then duplicated to various extents. The extent of duplication of different combinations varies widely and, in nature, will depend on selection for the domain combination based on its function. Some of the pair-wise domain combinations that are highly duplicated also recur frequently with other partner domains, and thus represent evolutionary units larger than single protein domains, which we term "supra-domains".  相似文献   

20.
The Adhesion G-protein-coupled receptors (GPCRs) are the most complex gene family among GPCRs with large genomic size, multiple introns, and a fascinating flora of functional domains, though the evolutionary origin of this family has been obscure. Here we studied the evolution of all class B (7tm2)-related genes, including the Adhesion, Secretin, and Methuselah families of GPCRs with a focus on nine genomes. We found that the cnidarian genome of Nematostella vectensis has a remarkably rich set of Adhesion GPCRs with a broad repertoire of N-terminal domains although this genome did not have any Secretin GPCRs. Moreover, the single-celled and colony-forming eukaryotes Monosiga brevicollis and Dictyostelium discoideum contain Adhesion-like GPCRs although these genomes do not have any Secretin GPCRs suggesting that the Adhesion types of GPCRs are the most ancient among class B GPCRs. Phylogenetic analysis found Adhesion group V (that contains GPR133 and GPR144) to be the closest relative to the Secretin family in the Adhesion family. Moreover, Adhesion group V sequences in N. vectensis share the same splice site setup as the Secretin GPCRs. Additionally, one of the most conserved motifs in the entire Secretin family is only found in group V of the Adhesion family. We suggest therefore that the Secretin family of GPCRs could have descended from group V Adhesion GPCRs. We found a set of unique Adhesion-like GPCRs in N. vectensis that have long N-termini containing one Somatomedin B domain each, which is a domain configuration similar to that of a set of Adhesion-like GPCRs found in Branchiostoma floridae. These sequences show slight similarities to Methuselah sequences found in insects. The extended class B GPCRs have a very complex evolutionary history with several species-specific expansions, and we identified at least 31 unique N-terminal domains originating from other protein classes. The overall N-terminal domain structure, however, concurs with the phylogenetic analysis of the transmembrane domains, thus enabling us to track the origin of most of the subgroups.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号