首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 390 毫秒
1.
Here, we present an automatic assignment of potential cognate ligands to domains of enzymes in the CATH and SCOP protein domain classifications on the basis of structural data available in the wwPDB. This procedure involves two steps; firstly, we assign the binding of particular ligands to particular domains; secondly, we compare the chemical similarity of the PDB ligands to ligands in KEGG in order to assign cognate ligands. We find that use of the Enzyme Commission (EC) numbers is necessary to enable efficient and accurate cognate ligand assignment. The PROCOGNATE database currently has cognate ligand mapping for 3277 (4118) protein structures and 351 (302) superfamilies, as described by the CATH and (SCOP) databases, respectively. We find that just under half of all ligands are only and always bound by a single domain, with 16% bound by more than one domain and the remainder of the ligands showing a variety of binding modes. This finding has implications for domain recombination and the evolution of new protein functions. Domain architecture or context is also found to affect substrate specificity of particular domains, and we discuss example cases. The most popular PDB ligands are all found to be generic components of crystallisation buffers, highlighting the non-cognate ligand problem inherent in the PDB. In contrast, the most popular cognate ligands are all found to be universal cellular currencies of reducing power and energy such as NADH, FADH2 and ATP, respectively, reflecting the fact that the vast majority of enzymatic reactions utilise one of these popular co-factors. These ligands all share a common adenine ribonucleotide moiety, suggesting that many different domain superfamilies have converged to bind this chemical framework.  相似文献   

2.
Domains are the building blocks of all globular proteins, and are units of compact three-dimensional structure as well as evolutionary units. There is a limited repertoire of domain families, so that these domain families are duplicated and combined in different ways to form the set of proteins in a genome. Proteins are gene products. The processes that produce new genes are duplication and recombination as well as gene fusion and fission. We attempt to gain an overview of these processes by studying the structural domains in the proteins of seven genomes from the three kingdoms of life: Eubacteria, Archaea and Eukaryota. We use here the domain and superfamily definitions in Structural Classification of Proteins Database (SCOP) in order to map pairs of adjacent domains in genome sequences in terms of their superfamily combinations. We find 624 out of the 764 superfamilies in SCOP in these genomes, and the 624 families occur in 585 pairwise combinations. Most families are observed in combination with one or two other families, while a few families are very versatile in their combinatorial behaviour. This type of pattern can be described by a scale-free network. Finally, we study domain repeats and we compare the set of the domain combinations in the genomes to those in PDB, and discuss the implications for structural genomics.  相似文献   

3.
4.
Many protein classification systems capture homologous relationships by grouping domains into families and superfamilies on the basis of sequence similarity. Superfamilies with similar 3D structures are further grouped into folds. In the absence of discernable sequence similarity, these structural similarities were long thought to have originated independently, by convergent evolution. However, the growth of databases and advances in sequence comparison methods have led to the discovery of many distant evolutionary relationships that transcend the boundaries of superfamilies and folds. To investigate the contributions of convergent versus divergent evolution in the origin of protein folds, we clustered representative domains of known structure by their sequence similarity, treating them as point masses in a virtual 2D space which attract or repel each other depending on their pairwise sequence similarities. As expected, families in the same superfamily form tight clusters. But often, superfamilies of the same fold are linked with each other, suggesting that the entire fold evolved from an ancient prototype. Strikingly, some links connect superfamilies with different folds. They arise from modular peptide fragments of between 20 and 40 residues that co‐occur in the connected folds in disparate structural contexts. These may be descendants of an ancestral pool of peptide modules that evolved as cofactors in the RNA world and from which the first folded proteins arose by amplification and recombination. Our galaxy of folds summarizes, in a single image, most known and many yet undescribed homologous relationships between protein superfamilies, providing new insights into the evolution of protein domains.  相似文献   

5.
The repertoire of naturally occurring protein structures is usually characterised in structural terms at the domain level by their constituent folds. As structure is acknowledged to be an important stepping stone to the understanding of protein function, an appreciation of how individual domain interactions are built to form complete, functional protein structures is essential. A comprehensive study of protein domain interactions has been undertaken, covering all those observed in known structures, as well as those predicted to occur in 46 completed genome sequences from all three domains of life. In particular, we examine the promiscuity of protein domains characterised by SCOP superfamilies in terms of their interacting partners, the surface they use to form these interactions, and the relative orientations of their domain partners. Protein domains are shown to display a variety of behaviours, ranging from high promiscuity to absolute monogamy of domain surface employed, with both multiple and single domain partners. In addition, the conservation of sequence and volume at domain interface surfaces is observed to be significantly higher than at accessible surface in general, acting as a powerful potential predictor for domain interactions. We also examine the separation of interacting domains in protein sequence, showing that standard thresholds of 30 amino acid residues lead to a significant false positive rate, and an even more significant false negative rate of approximately 40%. These data suggest that there may be many more than the 2000 domain--domain interactions that have not yet been observed structurally, and we provide a top 30 hit-list of putative domain interactions which should be targeted.  相似文献   

6.
Using a data set of aligned protein domain superfamilies of known three-dimensional structure, we compared the location of interdomain interfaces on the tertiary folds between members of distantly related protein domain superfamilies. The data set analyzed is comprised of interdomain interfaces, with domains occurring within a polypeptide chain and those between two polypeptide chains. We observe that, in general, the interfaces between protein domains are formed entirely in different locations on the tertiary folds in such pairs. This variation in the location of interface happens in protein domains involved in a wide range of functions, such as enzymes, adapters, and domains that bind protein ligands, or cofactors. While basic biochemical functionality is preserved at the domain superfamily level, the effect of biochemical function on protein assemblies is different in these protein domains related by superfamily. The divergence between proteins, in most cases, is coupled with domain recruitment, with different modes of interaction with the recruited domain. This is in complete contrast to the observation that in closely related homologous protein domains, almost always the interaction interfaces are topologically equivalent. In a small subset of interacting domains within proteins related by remote homology, we observe that the relative positioning of domains with respect to one another is preserved. Based on the analysis of multidomain proteins of known or unknown structure, we suggest that variation in protein-protein interactions in members within a superfamily could serve as diverging points in otherwise parallel metabolic or signaling pathways. We discuss a few representative cases of diverging pathways involving domains in a superfamily.  相似文献   

7.
Superfamily classifications are based variably on similarity of sequences, global folds, local structures, or functions. We have examined the possibility of defining superfamilies purely from the viewpoint of the global fold/function relationship. For this purpose, we first classified protein domains according to the beta-sheet topology. We then introduced the concept of kinship relations among the classified beta-sheet topology by assuming that the major elementary event leading to creation of a new beta-sheet topology is either an addition or deletion of one beta-strand at the edge of an existing beta-sheet during the molecular evolution. Based on this kinship relation, a network of protein domains was constructed so that the distance between a pair of domains represents the number of evolutionary events that lead one from the other domain. We then mapped on it all known domains with a specific core chemical function (here taken, as an example, that involving ATP or its analogs). Careful analyses revealed that the domains are found distributed on the network as >20 mutually disjointed clusters. The proteins in each cluster are defined to form a fold-based superfamily. The results indicate that >20 ATP-binding protein superfamilies have been invented independently in the process of molecular evolution, and the conservative evolutionary diffusion of global folds and functions is the origin of the relationship between them.  相似文献   

8.
Fliess A  Motro B  Unger R 《Proteins》2002,48(2):377-387
An important question in protein evolution is to what extent proteins may have undergone swaps (switches of domain or fragment order) during evolution. Such events might have occurred in several forms: Swaps of short fragments, swaps of structural and functional motifs, or recombination of domains in multidomain proteins. This question is important for the theoretical understanding of the evolution of proteins, and has practical implications for using swaps as a design tool in protein engineering. In order to analyze the question systematically, we conducted a large scale survey of possible swaps and permutations among all pairs of protein from the Swissport database. A swap is defined as a specific kind of sequence mutation between two proteins in which two fragments that appear in both sequences have different relative order in the two sequences. For example, aXbYc and dYeXf are defined as a swap, where X and Y represent sequence fragments that switched their order. Identifying such swaps is difficult using standard sequence comparison packages. One of the main problems in the analysis stems from the fact that many sequences contain repeats, which may be identified as false-positive swaps. We have used two different approaches to detect pairs of proteins with swaps. The first approach is based on the predefined list of domains in Pfam. We identified all the proteins that share at least two domains and analyzed their relative order, looking for pairs in which the order of these domains was switched. We designed an algorithm to distinguish between real swaps and duplications. In the second approach, we used Blast to detect pairs of proteins that share several fragments. Then, we used an automatic procedure to select pairs that are likely to contain swaps. Those pairs were analyzed visually, using a graphical tool, to eliminate duplications. Combining these approaches, about 140 different cases of swaps in the Swissprot database were found (after eliminating multiple pairs within the same family). Some of the cases have been described in the literature, but many are novel examples. Although each new example identified may be interesting to analyze, our main conclusion is that cases of swaps are rare in protein evolution. This observation is at odds with the common view that proteins are very modular to the point that modules (e.g., domains) can be shuffled between proteins with minimal constraints. Our study suggests that sequential constraints, i.e., the relative order between domains, are highly conserved.  相似文献   

9.
A procedure for detecting structural domains in proteins.   总被引:7,自引:5,他引:2       下载免费PDF全文
A procedure is described for detecting domains in proteins of known structure. The method is based on the intuitively simple idea that each domain should contain an identifiable hydrophobic core. By applying the algorithm described in the companion paper (Swindells MB, 1995, Protein Sci 4:93-102) to identify distinct cores in multi-domain proteins, one can use this information to determine both the number and the location of the constituent domains. Tests have shown the procedure to be effective on a number of examples, even when the domains are discontinuous along the sequence. However, deficiencies also occur when hydrophobic cores from different domains continue through the interface region and join one another.  相似文献   

10.
WW and SH3 domains, two different scaffolds to recognize proline-rich ligands   总被引:15,自引:0,他引:15  
WW domains are small protein modules composed of approximately 40 amino acids. These domains fold as a stable, triple stranded beta-sheet and recognize proline-containing ligands. WW domains are found in many different signaling and structural proteins, often localized in the cytoplasm as well as in the cell nucleus. Based on analyses of seven structures of WW domains, we discuss their diverse binding preferences and sequence conservation patterns. While modeling WW domains for which structures have not been determined we uncovered a case of potential molecular and functional convergence between WW and SH3 domains. The binding surface of the modeled WW domain of Npw38 protein shows a remarkable similarity to the SH3 domain of Sem5 protein, confirming biochemical data on similar binding predilections of both domains.  相似文献   

11.
There is a limited repertoire of domain families that are duplicated and combined in different ways to form the set of proteins in a genome. Proteins are gene products, and at the level of genes, duplication, recombination, fusion and fission are the processes that produce new genes. We attempt to gain an overview of these processes by studying the evolutionary units in proteins, domains, in the protein sequences of 40 genomes. The domain and superfamily definitions in the Structural Classification of Proteins Database are used, so that we can view all pairs of adjacent domains in genome sequences in terms of their superfamily combinations. We find 783 out of the 859 superfamilies in SCOP in these genomes, and the 783 families occur in 1307 pairwise combinations. Most families are observed in combination with one or two other families, while a few families are very versatile in their combinatorial behaviour; 209 families do not make combinations with other families. This type of pattern can be described as a scale-free network. We also study the N to C-terminal orientation of domain pairs and domain repeats. The phylogenetic distribution of domain combinations is surveyed, to establish the extent of common and kingdom-specific combinations. Of the kingdom-specific combinations, significantly more combinations consist of families present in all three kingdoms than of families present in one or two kingdoms. Hence, we are led to conclude that recombination between common families, as compared to the invention of new families and recombination among these, has also been a major contribution to the evolution of kingdom-specific and species-specific functions in organisms in all three kingdoms. Finally, we compare the set of the domain combinations in the genomes to those in the RCSB Protein Data Bank, and discuss the implications for structural genomics.  相似文献   

12.
Protein domains exist by themselves or in combination with other domains to form complex multidomain proteins. Defining domain boundaries in proteins is essential for understanding their evolution and function but is not trivial. More specifically, partitioning domains that interact by forming a single β-sheet is known to be particularly troublesome for automatic structure-based domain decomposition pipelines. Here, we study edge-to-edge β-strand interactions between domains in a protein chain, to help define the boundaries for some more difficult cases where a single β-sheet spanning over two domains gives an appearance of one. We give a number of examples where β-strands belonging to a single β-sheet do not belong to a single domain and highlight the difficulties of automatic domain parsers on these examples. This work can be used as a baseline for defining domain boundaries in homologous proteins or proteins with similar domain interactions in the future.  相似文献   

13.
Fibrinogen‐related domains (FReDs) are found in a variety of animal proteins with widely different functions, ranging from non‐self recognition to clot formation. All appear to have a common surface where binding of one sort or other occurs. An examination of 19 completed animal genomes—including a sponge and sea anemone, six protostomes, and 11 deuterostomes—has allowed phylogenies to be constructed that show where various types of FReP (proteins containing FReDs) first made their appearance. Comparisons of sequences and structures also reveal particular features that correlate with function, including the influence of neighbor‐domains. A particular set of insertions in the carboxyl‐terminal subdomain was involved in the transition from structures known to bind sugars to those known to bind amino‐terminal peptides. Perhaps not unexpectedly, FReDs with different functions have changed at different rates, with ficolins by far the fastest changing group. Significantly, the greatest amount of change in ficolin FReDs occurs in the third subdomain (“P domain”), the very opposite of the situation in most other vertebrate FReDs. The unbalanced style of change was also observed in FReDs from non‐chordates, many of which have been implicated in innate immunity.  相似文献   

14.
Toward consistent assignment of structural domains in proteins   总被引:3,自引:0,他引:3  
The assignment of protein domains from three-dimensional structure is critically important in understanding protein evolution and function, yet little quality assurance has been performed. Here, the differences in the assignment of structural domains are evaluated using six common assignment methods. Three human expert methods (AUTHORS (authors' annotation), CATH and SCOP) and three fully automated methods (DALI, DomainParser and PDP) are investigated by analysis of individual methods against the author's assignment as well as analysis based on the consensus among groups of methods (only expert, only automatic, combined). The results demonstrate that caution is recommended in using current domain assignments, and indicates where additional work is needed. Specifically, the major factors responsible for conflicting domain assignments between methods, both experts and automatic, are: (1) the definition of very small domains; (2) splitting secondary structures between domains; (3) the size and number of discontinuous domains; (4) closely packed or convoluted domain-domain interfaces; (5) structures with large and complex architectures; and (6) the level of significance placed upon structural, functional and evolutionary concepts in considering structural domain definitions. A web-based resource that focuses on the results of benchmarking and the analysis of domain assignments is available at  相似文献   

15.
Many proteins consist of subdomains that can fold and function independently. We investigate here the interaction between the two high mobility group (HMG) box subdomains of the nuclear protein rHMG1. An HMG box is a conserved amino acid sequence of approximately 80 amino acids rich in basic, aromatic and proline side chains that is active in binding DNA in a sequence or structure-specific manner. In the case of HMG1, each box can bind structural DNA substrates including four-way junctions (4WJs) and branched or kinked DNA duplexes. Since proteins containing up to six HMG boxes are known, the question arises whether linking subdomains together influences the folding or function of individual boxes. In an effort to understand interactions between individual DNA-binding domains in HMG1, we created new fusion proteins: one is an inversion of the order of the AB di-domain in HMG1 (BA); in the second, we added a third A domain C-terminal to the AB di-domain (ABA). Pairs of boxes, AB or BA, behave similarly and are functionally active. By contrast, the ABA triple subdomain construct is partially unfolded and is less active than individual boxes or di-domains. Thus, long-range inter-domain effects can influence the activity of HMG boxes.  相似文献   

16.

Background

As tertiary structure is currently available only for a fraction of known protein families, it is important to assess what parts of sequence space have been structurally characterized. We consider protein domains whose structure can be predicted by sequence similarity to proteins with solved structure and address the following questions. Do these domains represent an unbiased random sample of all sequence families? Do targets solved by structural genomic initiatives (SGI) provide such a sample? What are approximate total numbers of structure-based superfamilies and folds among soluble globular domains?

Results

To make these assessments, we combine two approaches: (i) sequence analysis and homology-based structure prediction for proteins from complete genomes; and (ii) monitoring dynamics of the assigned structure set in time, with the accumulation of experimentally solved structures. In the Clusters of Orthologous Groups (COG) database, we map the growing population of structurally characterized domain families onto the network of sequence-based connections between domains. This mapping reveals a systematic bias suggesting that target families for structure determination tend to be located in highly populated areas of sequence space. In contrast, the subset of domains whose structure is initially inferred by SGI is similar to a random sample from the whole population. To accommodate for the observed bias, we propose a new non-parametric approach to the estimation of the total numbers of structural superfamilies and folds, which does not rely on a specific model of the sampling process. Based on dynamics of robust distribution-based parameters in the growing set of structure predictions, we estimate the total numbers of superfamilies and folds among soluble globular proteins in the COG database.

Conclusion

The set of currently solved protein structures allows for structure prediction in approximately a third of sequence-based domain families. The choice of targets for structure determination is biased towards domains with many sequence-based homologs. The growing SGI output in the future should further contribute to the reduction of this bias. The total number of structural superfamilies and folds in the COG database are estimated as ~4000 and ~1700. These numbers are respectively four and three times higher than the numbers of superfamilies and folds that can currently be assigned to COG proteins.  相似文献   

17.
An algorithm is presented for the fast and accurate definition of protein structural domains from coordinate data without prior knowledge of the number or type of domains. The algorithm explicitly locates domains that comprise one or two continuous segments of protein chain. Domains that include more than two segments are also located. The algorithm was applied to a nonredundant database of 230 protein structures and the results compared to domain definitions obtained from the literature, or by inspection of the coordinates on molecular graphics. For 70% of the proteins, the derived domains agree with the reference definitions, 18% show minor differences and only 12% (28 proteins) show very different definitions. Three screens were applied to identify the derived domains least likely to agree with the subjective definition set. These screens revealed a set of 173 proteins, 97% of which agree well with the subjective definitions. The algorithm represents a practical domain identification tool that can be run routinely on the entire structural database. Adjustment of parameters also allows smaller compact units to be identified in proteins.  相似文献   

18.
MOTIVATION: A major goal in structural genomics is to enrich the catalogue of proteins whose 3D structures are known. In an attempt to address this problem we mapped over 10 000 proteins with solved structures onto a graph of all Swissprot protein sequences (release 36, approximately 73 000 proteins) provided by ProtoMap, with the goal of sorting proteins according to their likelihood of belonging to new superfamilies. We hypothesized that proteins within neighbouring clusters tend to share common structural superfamilies or folds. If true, the likelihood of finding new superfamilies increases in clusters that are distal from other solved structures within the graph. RESULTS: We defined an order relation between unsolved proteins according to their 'distance' from solved structures in the graph, and sorted approximately 48 000 proteins. Our list can be partitioned into three groups: approximately 35 000 proteins sharing a cluster with at least one known structure; approximately 6500 proteins in clusters with no solved structure but with neighbouring clusters containing known structures; and a third group contains the rest of the proteins, approximately 6100 (in 1274 clusters). We tested the quality of the order relation using thousands of recently solved structures that were not included when the order was defined. The tests show that our order is significantly better (P-value approximately 10(5)) than a random order. More interestingly, the order within the union of the second and third groups, and the order within the third group alone, perform better than random (P-values: 0.0008 and 0.15, respectively) and are better than alternative orders created using PSI-BLAST. Herein, we present a method for selecting targets to be used in structural genomics projects. AVAILABILITY: List of proteins to be used for targets selection combined with a set of biological filters for narrowing down potential targets is in http://www.protarget.cs.huji.ac.il.  相似文献   

19.
The extracellular regions of many cell surface proteins of the immune system contain distinct domains that may be linked in many different ways and are often only loosely tethered to the transmembrane segment. In efforts to identify regions critical for binding, molecular models of these domains are used to select residues for mutagenesis and to map binding sites. Many immune cell surface proteins belong to protein superfamilies and display only limited sequence identity compared to proteins of known three-dimensional (3D) structure, often 30% or less. Therefore, detailed 3D structures are difficult to predict, and structure-based sequence analysis and model assessment are particularly important components of the model building process. In some cases, experimentally determined structures have made it possible to assess the accuracy of predictions, which illustrates the opportunities and shortcomings of the approach. Herein the model-based identification of binding sites in cell surface proteins is described and representative examples are discussed.  相似文献   

20.
The immunoglobulin superfamily (IgSF) is a heterogenic group of proteins built on a common fold, called the Ig fold, which is a sandwich of two β sheets. Although members of the IgSF share a similar Ig fold, they differ in their tissue distribution, amino acid composition, and biological role. In this paper we report an up-to-date compilation of the IgSF where all known members of the IgSF are classified on the basis of their common functional role (immune system, antibiotic proteins, enzymes, cytokine receptors, etc.) and their distribution in tissue (neural system, extracellular matrix, tumor marker, muscular proteins, etc.), or in species (vertebrates, invertebrates, bacteria, viruses, fungi, and plants). The members of the family can contain one or many Ig domains, comprising two basic types: the constant domain (C), with seven strands, and the variable domain (V), with eight, nine, or ten strands. The different overviews of the IgSF led to the definition of new domain subtypes, mainly concerning the C type, based on the distribution of strands within the two sheets. The wide occurrence of the Ig fold and the much less conserved sequences could have developed from a common ancestral gene and/or from a convergent evolutionary process. Cell adhesion and pattern recognition seem to be the common feature running through the entire family. Received: 4 June 1997 / Accepted: 15 September 1997  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号