首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
A unifold, mesofold, and superfold model of protein fold use.   总被引:4,自引:0,他引:4  
As more and more protein structures are determined, there is increasing interest in the question of how many different folds have been used in biology. The history of the rate of discovery of new folds and the distribution of sequence families among known folds provide a means of estimating the underlying distribution of fold use. Previous models exploiting these data have led to rather different conclusions on the total number of folds. We present a new model, based on the notion that the folds used in biology fall naturally into three classes: unifolds, that is, folds found only in a single narrow sequence family; mesofolds, found in an intermediate number of families; and the previously noted superfolds, found in many protein families. We show that this model fits the available data well and has predicted the development of SCOP over the past 2 years. The principle implications of the model are as follows: (1) The vast majority of folds will be found in only a single sequence family; (2) the total number of folds is at least 10,000; and (3) 80% of sequence families have one of about 400 folds, most of which are already known.  相似文献   

2.
Leonov H  Mitchell JS  Arkin IT 《Proteins》2003,51(3):352-359
The estimation of the number of protein folds in nature is a matter of considerable interest. In this study, a Monte Carlo method employing the broken stick model is used to assign a given number of proteins into a given number of folds. Subsequently, random, integer, non-repeating numbers are generated in order to simulate the process of fold discovery. With this conceptual framework at hand, the effects of two factors upon the fold identification process were investigated: (1) the nature of folds distributions and (2) preferential sampling bias of previously identified folds. Depending on the type of distribution, dividing 100,000 proteins into 1,000 folds resulted in 10-30% of the folds having 10 proteins or less per fold, approximately 10% of the folds having 10-20 proteins per fold, 31-45% having 20-100 proteins per fold, and >30% of the folds having more than 100 proteins per fold. After randomly sampling one tenth of the proteins, 68-96% of the folds were identified. These percentages depend both on folds distribution and biased/non-biased sampling. Only upon increasing the sampling bias for previously identified folds to 1,000, did the model result in a reduction of the number of proteins identified by an order of magnitude (approximately 9%). Thus, assuming the structures of one tenth of the population of proteins in nature have been solved, the results of the Monte Carlo simulation are more consistent with recent lower estimates of the number of folds, 相似文献   

3.
It is well known that the structure is currently available only for a small fraction of known protein sequences. It is urgent to discover the important features of known protein sequences based on present protein structures. Here, we report a study on the size distribution of protein families within different types of folds. The fold of a protein means the global arrangement of its main secondary structures, both in terms of their relative orientations and their topological connections, which specify a certain biochemical and biophysical aspect. We first search protein families in the structural database SCOP against the sequence-based database Pfam, and acquire a pool of corresponding Pfam families whose structures can be deemed as known. This pool of Pfam families is called the sample space for short. Then the size distributions of protein families involving the sample space, the Pfam database and the SCOP database are obtained. The results indicate that the size distributions of protein families under different kinds of folds abide by similar power-law. Specially, the largest families scatter evenly in different kinds of folds. This may help better understand the relationship of protein sequence, structure and function. We also show that the total of proteins with known structures can be considered a random sample from the whole space of protein sequences, which is an essential but unsettled assumption for related predictions, such as, estimating the number of protein folds in nature. Finally we conclude that about 2957 folds are needed to cover the total Pfam families by a simple method.  相似文献   

4.
Abeln S  Deane CM 《Proteins》2005,60(4):690-700
We review fold usage on completed genomes to explore protein structure evolution. The patterns of presence or absence of folds on genomes gives us insights into the relationships between folds, the age of different folds and how we have arrived at the set of folds we see today. We examine the relationships between different measures which describe protein fold usage, such as the number of copies of a fold per genome, the number of families per fold, and the number of genomes a fold occurs on. We obtained these measures of fold usage by searching for the structural domains on 157 completed genome sequences from all three kingdoms of life. In our comparisons of these measures we found that bacteria have relatively more distinct folds on their genomes than archaea. Eukaryotes were found to have many more copies of a fold on their genomes. If we separate out the different fold classes, the alpha/beta class has relatively fewer distinct folds on large genomes, more copies of a fold on bacteria and more folds occurring in all three kingdoms simultaneously. These results possibly indicate that most alpha/beta folds originated earlier than other folds. The expected power law distribution is observed for copies of a fold per genome and we found a similar distribution for the number of families per fold. However, a more complicated distribution appears for fold occurrence across genomes, which strongly depends on fold class and kingdom. We also show that there is not a clear relationship between the three measures of fold usage. A fold which occurs on many genomes does not necessarily have many copies on each genome. Similarly, folds with many copies do not necessarily have many families or vice versa.  相似文献   

5.
Garma L  Mukherjee S  Mitra P  Zhang Y 《PloS one》2012,7(6):e38913
"Protein quaternary structure universe" refers to the ensemble of all protein-protein complexes across all organisms in nature. The number of quaternary folds thus corresponds to the number of ways proteins physically interact with other proteins. This study focuses on answering two basic questions: Whether the number of protein-protein interactions is limited and, if yes, how many different quaternary folds exist in nature. By all-to-all sequence and structure comparisons, we grouped the protein complexes in the protein data bank (PDB) into 3,629 families and 1,761 folds. A statistical model was introduced to obtain the quantitative relation between the numbers of quaternary families and quaternary folds in nature. The total number of possible protein-protein interactions was estimated around 4,000, which indicates that the current protein repository contains only 42% of quaternary folds in nature and a full coverage needs approximately a quarter century of experimental effort. The results have important implications to the protein complex structural modeling and the structure genomics of protein-protein interactions.  相似文献   

6.
Animal toxins are small proteins built on the basis of a few disulfide bonded frameworks. Because of their high variability in sequence and biologic function, these proteins are now used as templates for protein engineering. Here we report the extensive characterization of the structure and dynamics of two toxin folds, the "three-finger" fold and the short alpha/beta scorpion fold found in snake and scorpion venoms, respectively. These two folds have a very different architecture; the short alpha/beta scorpion fold is highly compact, whereas the "three-finger" fold is a beta structure presenting large flexible loops. First, the crystal structure of the snake toxin alpha was solved at 1.8-A resolution. Then, long molecular dynamics simulations (10 ns) in water boxes of the snake toxin alpha and the scorpion charybdotoxin were performed, starting either from the crystal or the solution structure. For both proteins, the crystal structure is stabilized by more hydrogen bonds than the solution structure, and the trajectory starting from the X-ray structure is more stable than the trajectory started from the NMR structure. The trajectories started from the X-ray structure are in agreement with the experimental NMR and X-ray data about the protein dynamics. Both proteins exhibit fast motions with an amplitude correlated to their secondary structure. In contrast, slower motions are essentially only observed in toxin alpha. The regions submitted to rare motions during the simulations are those that exhibit millisecond time-scale motions. Lastly, the structural variations within each fold family are described. The localization and the amplitude of these variations suggest that the regions presenting large-scale motions should be those tolerant to large insertions or deletions.  相似文献   

7.
Progress towards mapping the universe of protein folds   总被引:1,自引:0,他引:1       下载免费PDF全文
Although the precise aims differ between the various international structural genomics initiatives currently aiming to illuminate the universe of protein folds, many selectively target protein families for which the fold is unknown. How well can the current set of known protein families and folds be used to estimate the total number of folds in nature, and will structural genomics initiatives yield representatives for all the major protein families within a reasonable time scale?  相似文献   

8.
Fold designability has been estimated by the number of families contained in that fold. Here, we show that among orthologous proteins, sequence divergence is higher for folds with greater numbers of families. Folds with greater numbers of families also tend to have families that appear more often in the proteome and greater promiscuity (the number of unique “partner” folds that the fold is found with within the same protein). We also find that many disease-related proteins have folds with relatively few families. In particular, a number of these proteins are associated with diseases occurring at high frequency. These results suggest that family counts reflect how certain structures are distributed in nature and is an important characteristic associated with many human diseases.  相似文献   

9.
Using the data on proteins encoded in complete genomes, combined with a rigorous theory of the sampling process, we estimate the total number of protein folds and families, as well as the number of folds and families in each genome. The total number of folds in globular, water- soluble proteins is estimated at about 1000, with structural information currently available for about one-third of the number. The sequenced genomes of unicellular organisms encode from approximately 25%, for the minimal genomes of the Mycoplasmas, to 70-80% for larger genomes, such as Escherichia coli and yeast, of the total number of folds. The number of protein families with significant sequence conservation was estimated to be between 4000 and 7000, with structures available for about 20% of these.  相似文献   

10.
One of the goals of structural genomics is to obtain a structural representative of almost every fold in nature. A recent estimate suggests that 70%-80% of soluble protein domains identified in the first 1000 genome sequences should be covered by about 25,000 structures-a reasonably achievable goal. As no current estimates exist for the number of membrane protein families, however, it is not possible to know whether family coverage is a realistic goal for membrane proteins. Here we find that virtually all polytopic helical membrane protein families are present in the already known sequences so we can make an estimate of the total number of families. We find that only approximately 700 polytopic membrane protein families account for 80% of structured residues and approximately 1700 cover 90% of structured residues. While apparently a finite and reachable goal, we estimate that it will likely take more than three decades to obtain the structures needed for 90% residue coverage, if current trends continue.  相似文献   

11.
Recent progress in structure determination techniques has led to a significant growth in the number of known membrane protein structures, and the first structural genomics projects focusing on membrane proteins have been initiated, warranting an investigation of appropriate bioinformatics strategies for optimal structural target selection for these molecules. What determines a membrane protein fold? How many membrane structures need to be solved to provide sufficient structural coverage of the membrane protein sequence space? We present the CAMPS database (Computational Analysis of the Membrane Protein Space) containing almost 45,000 proteins with three or more predicted transmembrane helices (TMH) from 120 bacterial species. This large set of membrane proteins was subjected to single‐linkage clustering using only sequence alignments covering at least 40% of the TMH present in a given family. This process yielded 266 sequence clusters with at least 15 members, roughly corresponding to membrane structural folds, sufficiently structurally homogeneous in terms of the variation of TMH number between individual sequences. These clusters were further subdivided into functionally homogeneous subclusters according to the COG (Clusters of Orthologous Groups) system as well as more stringently defined families sharing at least 30% identity. The CAMPS sequence clusters are thus designed to reflect three main levels of interest for structural genomics: fold, function, and modeling distance. We present a library of Hidden Markov Models (HMM) derived from sequence alignments of TMH at these three levels of sequence similarity. Given that 24 out of 266 clusters corresponding to membrane folds already have associated known structures, we estimate that 242 additional new structures, one for each remaining cluster, would provide structural coverage at the fold level of roughly 70% of prokaryotic membrane proteins belonging to the currently most populated families. Proteins 2006. © 2006 Wiley‐Liss, Inc.  相似文献   

12.
Recognition of protein fold from amino acid sequence is a challenging task. The structure and stability of proteins from different fold are mainly dictated by inter-residue interactions. In our earlier work, we have successfully used the medium- and long-range contacts for predicting the protein folding rates, discriminating globular and membrane proteins and for distinguishing protein structural classes. In this work, we analyze the role of inter-residue interactions in commonly occurring folds of globular proteins in order to understand their folding mechanisms. In the medium-range contacts, the globin fold and four-helical bundle proteins have more contacts than that of DNA-RNA fold although they all belong to all-alpha class. In long-range contacts, only the ribonuclease fold prefers 4-10 range and the other folding types prefer the range 21-30 in alpha/beta class proteins. Further, the preferred residues and residue pairs influenced by these different folds are discussed. The information about the preference of medium- and long-range contacts exhibited by the 20 amino acid residues can be effectively used to predict the folding type of each protein.  相似文献   

13.
Novotny M  Madsen D  Kleywegt GJ 《Proteins》2004,54(2):260-270
When a new protein structure has been determined, comparison with the database of known structures enables classification of its fold as new or belonging to a known class of proteins. This in turn may provide clues about the function of the protein. A large number of fold comparison programs have been developed, but they have never been subjected to a comprehensive and critical comparative analysis. Here we describe an evaluation of 11 publicly available, Web-based servers for automatic fold comparison. Both their functionality (e.g., user interface, presentation, and annotation of results) and their performance (i.e., how well established structural similarities are recognized) were assessed. The servers were subjected to a battery of performance tests covering a broad spectrum of folds as well as special cases, such as multidomain proteins, Calpha-only models, new folds, and NMR-based models. The CATH structural classification system was used as a reference. These tests revealed the strong and weak sides of each server. On the whole, CE, DALI, MATRAS, and VAST showed the best performance, but none of the servers achieved a 100% success rate. Where no structurally similar proteins are found by any individual server, it is recommended to try one or two other servers before any conclusions concerning the novelty of a fold are put on paper.  相似文献   

14.
Many protein classification systems capture homologous relationships by grouping domains into families and superfamilies on the basis of sequence similarity. Superfamilies with similar 3D structures are further grouped into folds. In the absence of discernable sequence similarity, these structural similarities were long thought to have originated independently, by convergent evolution. However, the growth of databases and advances in sequence comparison methods have led to the discovery of many distant evolutionary relationships that transcend the boundaries of superfamilies and folds. To investigate the contributions of convergent versus divergent evolution in the origin of protein folds, we clustered representative domains of known structure by their sequence similarity, treating them as point masses in a virtual 2D space which attract or repel each other depending on their pairwise sequence similarities. As expected, families in the same superfamily form tight clusters. But often, superfamilies of the same fold are linked with each other, suggesting that the entire fold evolved from an ancient prototype. Strikingly, some links connect superfamilies with different folds. They arise from modular peptide fragments of between 20 and 40 residues that co‐occur in the connected folds in disparate structural contexts. These may be descendants of an ancestral pool of peptide modules that evolved as cofactors in the RNA world and from which the first folded proteins arose by amplification and recombination. Our galaxy of folds summarizes, in a single image, most known and many yet undescribed homologous relationships between protein superfamilies, providing new insights into the evolution of protein domains.  相似文献   

15.
Many seemingly unrelated protein families share common folds. Theoretical models based on structure designability have suggested that a few folds should be very common while many others have low probability. In agreement with the predictions of these models, we show that the distribution of observed protein families over different folds can be modeled with a highly-stretched exponential. Our results suggest that there are approximately 4,000 possible folds, some so unlikely that only approximately 2,000 folds existing among naturally-occurring proteins. Due to the large number of extremely rare folds, constructing a comprehensive database of all existent folds would be difficult. Constructing a database of the most-likely folds representing the vast majority of protein families would be considerably easier.  相似文献   

16.
Disulfide-rich domains are small protein domains whose global folds are stabilized primarily by the formation of disulfide bonds and, to a much lesser extent, by secondary structure and hydrophobic interactions. Disulfide-rich domains perform a wide variety of roles functioning as growth factors, toxins, enzyme inhibitors, hormones, pheromones, allergens, etc. These domains are commonly found both as independent (single-domain) proteins and as domains within larger polypeptides. Here, we present a comprehensive structural classification of approximately 3000 small, disulfide-rich protein domains. We find that these domains can be arranged into 41 fold groups on the basis of structural similarity. Our fold groups, which describe broader structural relationships than existing groupings of these domains, bring together representatives with previously unacknowledged similarities; 18 of the 41 fold groups include domains from several SCOP folds. Within the fold groups, the domains are assembled into families of homologs. We define 98 families of disulfide-rich domains, some of which include newly detected homologs, particularly among knottin-like domains. On the basis of this classification, we have examined cases of convergent and divergent evolution of functions performed by disulfide-rich proteins. Disulfide bonding patterns in these domains are also evaluated. Reducible disulfide bonding patterns are much less frequent, while symmetric disulfide bonding patterns are more common than expected from random considerations. Examples of variations in disulfide bonding patterns found within families and fold groups are discussed.  相似文献   

17.
Structure is only the first step in understanding the interactions and functions of proteins. In this paper, we explore the flexibility of proteins across a broad database of over 250 solvated protein molecular dynamics simulations in water for an aggregate simulation time of approximately 6 micros. These simulations are from our Dynameomics project, and these proteins represent approximately 75% of all known protein structures. We employ principal component analysis of the atomic coordinates over time to determine the primary axis and magnitude of the flexibility of each atom in a simulation. This technique gives us both a database of flexibility for many protein fold families and a compact visual representation of a particular protein's native-state conformational space, neither of which are available using experimental methods alone. These tools allow us to better understand the nature of protein motion and to describe its relationship to other structural and dynamical characteristics. In addition to reporting general properties of protein flexibility and detailing many dynamic motifs, we characterize the relationship between protein native-state flexibility and early events in thermal unfolding and show that flexibility predicts how a protein will begin to unfold. We provide evidence that fold families have conserved flexibility patterns, and family members who deviate from the conserved patterns have very low sequence identity. Finally, we examine novel aspects of highly inflexible loops that are as important to structural integrity as conventional secondary structure. These loops, which are difficult if not impossible to locate without dynamic data, may constitute new structural motifs.  相似文献   

18.
Targeting of proteins for structure determination in structural genomic programs often includes the use of threading and fold recognition methods to exclude proteins belonging to well-populated fold families, but such methods can still fail to recognize preexisting folds. The authors illustrate here a method in which limited amounts of structural data are used to improve an initial homology search and the data are subsequently used to produce a structure by data-constrained refinement of an identified structural template. The data used are primarily NMR-based residual dipolar couplings, but they also include additional chemical shift and backbone-nuclear Overhauser effect data. Using this methodology, a backbone structure was efficiently produced for a 10 kDa protein (PF1455) from Pyrococcus furiosus. Its relationship to existing structures and its probable function are discussed.  相似文献   

19.
To understand the molecular basis of glycosyltransferases' (GTFs) catalytic mechanism, extensive structural information is required. Here, fold recognition methods were employed to assign 3D protein shapes (folds) to the currently known GTF sequences, available in public databases such as GenBank and Swissprot. First, GTF sequences were retrieved and classified into clusters, based on sequence similarity only. Intracluster sequence similarity was chosen sufficiently high to ensure that the same fold is found within a given cluster. Then, a representative sequence from each cluster was selected to compose a subset of GTF sequences. The members of this reduced set were processed by three different fold recognition methods: 3D-PSSM, FUGUE, and GeneFold. Finally, the results from different fold recognition methods were analyzed and compared to sequence-similarity search methods (i.e., BLAST and PSI-BLAST). It was established that the folds of about 70% of all currently known GTF sequences can be confidently assigned by fold recognition methods, a value which is higher than the fold identification rate based on sequence comparison alone (48% for BLAST and 64% for PSI-BLAST). The identified folds were submitted to 3D clustering, and we found that most of the GTF sequences adopt the typical GTF A or GTF B folds. Our results indicate a lack of evidence that new GTF folds (i.e., folds other than GTF A and B) exist. Based on cases where fold identification was not possible, we suggest several sequences as the most promising targets for a structural genomics initiative focused on the GTF protein family.  相似文献   

20.
Kinases are a ubiquitous group of enzymes that catalyze the phosphoryl transfer reaction from a phosphate donor (usually ATP) to a receptor substrate. Although all kinases catalyze essentially the same phosphoryl transfer reaction, they display remarkable diversity in their substrate specificity, structure, and the pathways in which they participate. In order to learn the relationship between structural fold and functional specificities in kinases, we have done a comprehensive survey of all available kinase sequences (>17,000) and classified them into 30 distinct families based on sequence similarities. Of these families, 19, covering nearly 98% of all sequences, fall into seven general structural folds for which three-dimensional structures are known. These fold groups include some of the most widespread protein folds, such as Rossmann fold, ferredoxin fold, ribonuclease H fold, and TIM beta/alpha-barrel. On the basis of this classification system, we examined the shared substrate binding and catalytic mechanisms as well as variations of these mechanisms in the same fold groups. Cases of convergent evolution of identical kinase activities occurring in different folds are discussed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号