首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Liu X  Fan K  Wang W 《Proteins》2004,54(3):491-499
Currently, of the 10(6) known protein sequences, only about 10(4) structures have been solved. Based on homologies and similarities, proteins are grouped into different families in which each has a structural prototype, namely, the fold, and some share the same folds. However, the total number of folds and families, and furthermore, the distribution of folds over families in nature, are still an enigma. Here, we report a study on the distribution of folds over families and the total number of folds in nature, using a maximum probability principle and the moment method of estimation. A quadratic relation between the numbers of families and folds is found for the number of families in an interval from 6000 to 30,000. For example, about 2700 folds for 23,100 families are obtained, among them about 33 superfolds, including more than 100 families each, and the largest superfold comprises about 800 families. Our results suggest that although the majority of folds have only a single family per fold, a considerably larger number of folds include many more families each than in the database, and the distribution of folds over families in nature differs markedly from the sampled distribution. The long tail of fold distribution is first estimated in this article. The results fit the data for different versions of the structural classification of proteins (SCOP) excellently, and the goodness-of-fit tests strongly support the results. In addition, the method of directly "enlarging" the sample to the population may be useful in inferring distributions of species in different fields.  相似文献   

2.
In this work we develop a microscopic physical model of early evolution where phenotype—organism life expectancy—is directly related to genotype—the stability of its proteins in their native conformations—which can be determined exactly in the model. Simulating the model on a computer, we consistently observe the “Big Bang” scenario whereby exponential population growth ensues as soon as favorable sequence–structure combinations (precursors of stable proteins) are discovered. Upon that, random diversity of the structural space abruptly collapses into a small set of preferred proteins. We observe that protein folds remain stable and abundant in the population at timescales much greater than mutation or organism lifetime, and the distribution of the lifetimes of dominant folds in a population approximately follows a power law. The separation of evolutionary timescales between discovery of new folds and generation of new sequences gives rise to emergence of protein families and superfamilies whose sizes are power-law distributed, closely matching the same distributions for real proteins. On the population level we observe emergence of species—subpopulations that carry similar genomes. Further, we present a simple theory that relates stability of evolving proteins to the sizes of emerging genomes. Together, these results provide a microscopic first-principles picture of how first-gene families developed in the course of early evolution.  相似文献   

3.
A major challenge in designing proteins de novo to bind user-defined ligands with high affinity is finding backbones structures into which a new binding site geometry can be engineered with high precision. Recent advances in methods to generate protein fold families de novo have expanded the space of accessible protein structures, but it is not clear to what extend de novo proteins with diverse geometries also expand the space of designable ligand binding functions. We constructed a library of 25,806 high-quality ligand binding sites and developed a fast protocol to place (“match”) these binding sites into both naturally occurring and de novo protein families with two fold topologies: Rossman and NTF2. Each matching step involves engineering new binding site residues into each protein “scaffold”, which is distinct from the problem of comparing already existing binding pockets. 5,896 and 7,475 binding sites could be matched to the Rossmann and NTF2 fold families, respectively. De novo designed Rossman and NTF2 protein families can support 1,791 and 678 binding sites that cannot be matched to naturally existing structures with the same topologies, respectively. While the number of protein residues in ligand binding sites is the major determinant of matching success, ligand size and primary sequence separation of binding site residues also play important roles. The number of matched binding sites are power law functions of the number of members in a fold family. Our results suggest that de novo sampling of geometric variations on diverse fold topologies can significantly expand the space of designable ligand binding sites for a wealth of possible new protein functions.  相似文献   

4.
Abeln S  Deane CM 《Proteins》2005,60(4):690-700
We review fold usage on completed genomes to explore protein structure evolution. The patterns of presence or absence of folds on genomes gives us insights into the relationships between folds, the age of different folds and how we have arrived at the set of folds we see today. We examine the relationships between different measures which describe protein fold usage, such as the number of copies of a fold per genome, the number of families per fold, and the number of genomes a fold occurs on. We obtained these measures of fold usage by searching for the structural domains on 157 completed genome sequences from all three kingdoms of life. In our comparisons of these measures we found that bacteria have relatively more distinct folds on their genomes than archaea. Eukaryotes were found to have many more copies of a fold on their genomes. If we separate out the different fold classes, the alpha/beta class has relatively fewer distinct folds on large genomes, more copies of a fold on bacteria and more folds occurring in all three kingdoms simultaneously. These results possibly indicate that most alpha/beta folds originated earlier than other folds. The expected power law distribution is observed for copies of a fold per genome and we found a similar distribution for the number of families per fold. However, a more complicated distribution appears for fold occurrence across genomes, which strongly depends on fold class and kingdom. We also show that there is not a clear relationship between the three measures of fold usage. A fold which occurs on many genomes does not necessarily have many copies on each genome. Similarly, folds with many copies do not necessarily have many families or vice versa.  相似文献   

5.
A unifold, mesofold, and superfold model of protein fold use.   总被引:4,自引:0,他引:4  
As more and more protein structures are determined, there is increasing interest in the question of how many different folds have been used in biology. The history of the rate of discovery of new folds and the distribution of sequence families among known folds provide a means of estimating the underlying distribution of fold use. Previous models exploiting these data have led to rather different conclusions on the total number of folds. We present a new model, based on the notion that the folds used in biology fall naturally into three classes: unifolds, that is, folds found only in a single narrow sequence family; mesofolds, found in an intermediate number of families; and the previously noted superfolds, found in many protein families. We show that this model fits the available data well and has predicted the development of SCOP over the past 2 years. The principle implications of the model are as follows: (1) The vast majority of folds will be found in only a single sequence family; (2) the total number of folds is at least 10,000; and (3) 80% of sequence families have one of about 400 folds, most of which are already known.  相似文献   

6.
Immunoglobulin heavy chain-binding protein (BiP) is a member of the hsp70 family of chaperones and one of the most abundant proteins in the ER lumen. It is known to interact transiently with many nascent proteins as they enter the ER and more stably with protein subunits produced in stoichiometric excess or with mutant proteins. However, there also exists a large number of secretory pathway proteins that do not apparently interact with BiP. To begin to understand what controls the likelihood that a nascent protein entering the ER will associate with BiP, we have examined the in vivo folding of a murine λI immunoglobulin (Ig) light chain (LC). This LC is composed of two Ig domains that can fold independent of the other and that each possess multiple potential BiP-binding sequences. To detect BiP binding to the LC during folding, we used BiP ATPase mutants, which bind irreversibly to proteins, as “kinetic traps.” Although both the wild-type and mutant BiP clearly associated with the unoxidized variable region domain, we were unable to detect binding of either BiP protein to the constant region domain. A combination of in vivo and in vitro folding studies revealed that the constant domain folds rapidly and stably even in the absence of an intradomain disulfide bond. Thus, the simple presence of a BiP-binding site on a nascent chain does not ensure that BiP will bind and play a role in its folding. Instead, it appears that the rate and stability of protein folding determines whether or not a particular site is recognized, with BiP preferentially binding to proteins that fold slowly or somewhat unstably.  相似文献   

7.
Extant fold‐switching proteins remodel their secondary structures and change their functions in response to environmental stimuli. These shapeshifting proteins regulate biological processes and are associated with a number of diseases, including tuberculosis, cancer, Alzheimer''s, and autoimmune disorders. Thus, predictive methods are needed to identify more fold‐switching proteins, especially since all naturally occurring instances have been discovered by chance. In response to this need, two high‐throughput predictive methods have recently been developed. Here we test them on ORF9b, a newly discovered fold switcher and potential therapeutic target from the Severe Acute Respiratory Syndrome Coronavirus 2 (SARS‐CoV‐2). Promisingly, both methods correctly indicate that ORF9b switches folds. We then tested the same two methods on ORF9b1, the ORF9b homolog from SARS‐CoV‐1. Again, both methods predict that ORF9b1 switches folds, a finding consistent with experimental binding studies. Together, these results (a) demonstrate that protein fold switching can be predicted using high‐throughput computational approaches and (b) suggest that fold switching might be a general characteristic of ORF9b homologs.  相似文献   

8.
A number of structural genomics/proteomics initiatives are focused on bacterial or viral pathogens. In this article, we will review the progress of structural proteomics initiatives targeting the SARS coronavirus (SARS-CoV), the etiological agent of the 2003 worldwide epidemic that culminated in approximately 8,000 cases and 800 deaths. The SARS-CoV genome encodes 28 proteins in three distinct classes, many of them with unknown function and sharing low similarity to other proteins. The structures of 16 SARS-CoV proteins or functional domains have been determined to date. Remarkably, eight of these 16 proteins or functional domains have novel folds, indicating the uniqueness of the coronavirus proteins. The results of SARS-CoV structural proteomics initiatives will have several profound biological impacts, including elucidation of the structure-function relationships of coronavirus proteins; identification of targets for the design of anti-viral compounds against SARS-CoV and other coronaviruses; and addition of new protein folds to the fold space, with further understanding of the structure-function relationships for several new protein families. We discuss the use of structural proteomics in response to emerging infectious diseases such as SARS-CoV and to increase preparedness against future emerging coronaviruses.  相似文献   

9.
Three-dimensional structures have been determined of a large number of proteins characterized by a repetitive fold where each of the repeats (coils) supplies a strand to one or more parallel beta-sheets. Some of these proteins form superfamilies of proteins, which have probably arisen by divergent evolution from a common ancestor. The classical example is the family including four families of pectinases without obviously related primary sequences, the phage P22 tailspike endorhamnosidase, chrondroitinase B and possibly pertactin from Bordetella pertusis. These show extensive stacking of similar residues to give aliphatic, aromatic and polar stacks such as the asparagine ladder. This suggests that coils can be added or removed by duplication or deletion of the DNA corresponding to one or more coils and explains how homologous proteins can have different numbers of coils.This process can also account for the evolution of other families of proteins such as the beta-rolls, the leucine-rich repeat proteins, the hexapeptide repeat family, two separate families of beta-helical antifreeze proteins and the spiral folds. These families need not be related to each other but will share features such as relative untwisted beta-sheets, stacking of similar residues and turns between beta-strands of approximately 90 degrees often stabilized by hydrogen bonding along the direction of the parallel beta-helix.Repetitive folds present special problems in the comparison of structures but offer attractive targets for structure prediction. The stacking of similar residues on a flat parallel beta-sheet may account for the formation of amyloid with beta-strands at right-angles to the fibril axis from many unrelated peptides.  相似文献   

10.
Mark Gerstein 《Proteins》1998,33(4):518-534
Eight microbial genomes are compared in terms of protein structure. Specifically, yeast, H. influenzae, M. genitalium, M. jannaschii, Synechocystis, M. pneumoniae, H. pylori, and E. coli are compared in terms of patterns of fold usage—whether a given fold occurs in a particular organism. Of the ∼340 soluble protein folds currently in the structure databank (PDB), 240 occur in at least one of the eight genomes, and 30 are shared amongst all eight. The shared folds are depleted in all-helical structure and enriched in mixed helix-sheet structure compared to the folds in the PDB. The top-10 most common of the shared 30 are enriched in superfolds, uniting many non-homologous sequence families, and are especially similar in overall architecture—eight having helices packed onto a central sheet. They are also very different from the common folds in the PBD, highlighting databank biases. Folds can be ranked in terms of expression as well as genome duplication. In yeast the top-10 most highly expressed folds are considerably different from the most highly duplicated folds. A tree can be constructed grouping genomes in terms of their shared folds. This has a remarkably similar topology to more conventional classifications, based on very different measures of relatedness. Finally, folds of membrane proteins can be analyzed through transmembrane-helix (TM) prediction. All the genomes appear to have similar usage patterns for these folds, with the occurrence of a particular fold falling off rapidly with increasing numbers of TM-elements, according to a “Zipf-like” law. This implies there are no marked preferences for proteins with particular numbers of TM-helices (e.g. 7-TM) in microbial genomes. Further information pertinent to this analysis is available at http://bioinfo.mbb.yale.edu/genome. Proteins 33:518–534, 1998. © 1998 Wiley-Liss, Inc.  相似文献   

11.
Leonov H  Mitchell JS  Arkin IT 《Proteins》2003,51(3):352-359
The estimation of the number of protein folds in nature is a matter of considerable interest. In this study, a Monte Carlo method employing the broken stick model is used to assign a given number of proteins into a given number of folds. Subsequently, random, integer, non-repeating numbers are generated in order to simulate the process of fold discovery. With this conceptual framework at hand, the effects of two factors upon the fold identification process were investigated: (1) the nature of folds distributions and (2) preferential sampling bias of previously identified folds. Depending on the type of distribution, dividing 100,000 proteins into 1,000 folds resulted in 10-30% of the folds having 10 proteins or less per fold, approximately 10% of the folds having 10-20 proteins per fold, 31-45% having 20-100 proteins per fold, and >30% of the folds having more than 100 proteins per fold. After randomly sampling one tenth of the proteins, 68-96% of the folds were identified. These percentages depend both on folds distribution and biased/non-biased sampling. Only upon increasing the sampling bias for previously identified folds to 1,000, did the model result in a reduction of the number of proteins identified by an order of magnitude (approximately 9%). Thus, assuming the structures of one tenth of the population of proteins in nature have been solved, the results of the Monte Carlo simulation are more consistent with recent lower estimates of the number of folds, 相似文献   

12.
Holins form pores in the cytoplasmic membranes of bacteria for the primary purpose of releasing endolysins that hydrolyze the cell wall and induce cell death. Holins are encoded within bacteriophage genomes, where they promote cell lysis for virion release, and within bacterial genomes, where they serve a diversity of potential or established functions. These include (i) release of gene transfer agents, (ii) facilitation of programs of differentiation such as those that allow sporulation and spore germination, (iii) contribution to biofilm formation, (iv) promotion of responses to stress conditions, and (v) release of toxins and other proteins. There are currently 58 recognized families of holins and putative holins with members exhibiting between 1 and 4 transmembrane α-helical spanners, but many more families have yet to be discovered. Programmed cell death in animals involves holin-like proteins such as Bax and Bak that may have evolved from bacterial holins. Holin homologues have also been identified in archaea, suggesting that these proteins are ubiquitous throughout the three domains of life. Phage-mediated cell lysis of dual-membrane Gram-negative bacteria also depends on outer membrane-disrupting “spanins” that function independently of, but in conjunction with, holins and endolysins. In this minireview, we provide an overview of their modes of action and the first comprehensive summary of the many currently recognized and postulated functions and uses of these cell lysis systems. It is anticipated that future studies will result in the elucidation of many more such functions and the development of additional applications.  相似文献   

13.
Phylogenomic analysis of the occurrence and abundance of protein domains in proteomes has recently showed that the α/β architecture is probably the oldest fold design. This holds important implications for the origins of biochemistry. Here we explore structure-function relationships addressing the use of chemical mechanisms by ancestral enzymes. We test the hypothesis that the oldest folds used the most mechanisms. We start by tracing biocatalytic mechanisms operating in metabolic enzymes along a phylogenetic timeline of the first appearance of homologous superfamilies of protein domain structures from CATH. A total of 335 enzyme reactions were retrieved from MACiE and were mapped over fold age. We define a mechanistic step type as one of the 51 mechanistic annotations given in MACiE, and each step of each of the 335 mechanisms was described using one or more of these annotations. We find that the first two folds, the P-loop containing nucleotide triphosphate hydrolase and the NAD(P)-binding Rossmann-like homologous superfamilies, were α/β architectures responsible for introducing 35% (18/51) of the known mechanistic step types. We find that these two oldest structures in the phylogenomic analysis of protein domains introduced many mechanistic step types that were later combinatorially spread in catalytic history. The most common mechanistic step types included fundamental building blocks of enzyme chemistry: “Proton transfer,” “Bimolecular nucleophilic addition,” “Bimolecular nucleophilic substitution,” and “Unimolecular elimination by the conjugate base.” They were associated with the most ancestral fold structure typical of P-loop containing nucleotide triphosphate hydrolases. Over half of the mechanistic step types were introduced in the evolutionary timeline before the appearance of structures specific to diversified organisms, during a period of architectural diversification. The other half unfolded gradually after organismal diversification and during a period that spanned ∼2 billion years of evolutionary history.  相似文献   

14.
Garma L  Mukherjee S  Mitra P  Zhang Y 《PloS one》2012,7(6):e38913
"Protein quaternary structure universe" refers to the ensemble of all protein-protein complexes across all organisms in nature. The number of quaternary folds thus corresponds to the number of ways proteins physically interact with other proteins. This study focuses on answering two basic questions: Whether the number of protein-protein interactions is limited and, if yes, how many different quaternary folds exist in nature. By all-to-all sequence and structure comparisons, we grouped the protein complexes in the protein data bank (PDB) into 3,629 families and 1,761 folds. A statistical model was introduced to obtain the quantitative relation between the numbers of quaternary families and quaternary folds in nature. The total number of possible protein-protein interactions was estimated around 4,000, which indicates that the current protein repository contains only 42% of quaternary folds in nature and a full coverage needs approximately a quarter century of experimental effort. The results have important implications to the protein complex structural modeling and the structure genomics of protein-protein interactions.  相似文献   

15.
It has been known that topologically different proteins of the same class sometimes share the same spatial arrangement of secondary structure elements (SSEs). However, the frequency by which topologically different structures share the same spatial arrangement of SSEs is unclear. It is important to estimate this frequency because it provides both a deeper understanding of the geometry of protein folds and a valuable suggestion for predicting protein structures with novel folds. Here we clarified the frequency with which protein folds share the same SSE packing arrangement with other folds, the types of spatial arrangement of SSEs that are frequently observed across different folds, and the diversity of protein folds that share the same spatial arrangement of SSEs with a given fold, using a protein structure alignment program MICAN, which we have been developing. By performing comprehensive structural comparison of SCOP fold representatives, we found that approximately 80% of protein folds share the same spatial arrangement of SSEs with other folds. We also observed that many protein pairs that share the same spatial arrangement of SSEs belong to the different classes, often with an opposing N- to C-terminal direction of the polypeptide chain. The most frequently observed spatial arrangement of SSEs was the 2-layer α/β packing arrangement and it was dispersed among as many as 27% of SCOP fold representatives. These results suggest that the same spatial arrangements of SSEs are adopted by a wide variety of different folds and that the spatial arrangement of SSEs is highly robust against the N- to C-terminal direction of the polypeptide chain.  相似文献   

16.
17.
Disulfide-rich domains are small protein domains whose global folds are stabilized primarily by the formation of disulfide bonds and, to a much lesser extent, by secondary structure and hydrophobic interactions. Disulfide-rich domains perform a wide variety of roles functioning as growth factors, toxins, enzyme inhibitors, hormones, pheromones, allergens, etc. These domains are commonly found both as independent (single-domain) proteins and as domains within larger polypeptides. Here, we present a comprehensive structural classification of approximately 3000 small, disulfide-rich protein domains. We find that these domains can be arranged into 41 fold groups on the basis of structural similarity. Our fold groups, which describe broader structural relationships than existing groupings of these domains, bring together representatives with previously unacknowledged similarities; 18 of the 41 fold groups include domains from several SCOP folds. Within the fold groups, the domains are assembled into families of homologs. We define 98 families of disulfide-rich domains, some of which include newly detected homologs, particularly among knottin-like domains. On the basis of this classification, we have examined cases of convergent and divergent evolution of functions performed by disulfide-rich proteins. Disulfide bonding patterns in these domains are also evaluated. Reducible disulfide bonding patterns are much less frequent, while symmetric disulfide bonding patterns are more common than expected from random considerations. Examples of variations in disulfide bonding patterns found within families and fold groups are discussed.  相似文献   

18.
Vastly divergent sequences populate a majority of protein folds. In the quest to identify features that are conserved within protein domains belonging to the same fold, we set out to examine the entire protein universe on a fold-by-fold basis. We report that the atomic interaction network in the solvent-unexposed core of protein domains are fold-conserved, extraordinary sequence divergence notwithstanding. Further, we find that this feature, termed protein core atomic interaction network (or PCAIN) is significantly distinguishable across different folds, thus appearing to be “signature” of a domain''s native fold. As part of this study, we computed the PCAINs for 8698 representative protein domains from families across the 1018 known protein folds to construct our seed database and an automated framework was developed for PCAIN-based characterization of the protein fold universe. A test set of randomly selected domains that are not in the seed database was classified with over 97% accuracy, independent of sequence divergence. As an application of this novel fold signature, a PCAIN-based scoring scheme was developed for comparative (homology-based) structure prediction, with 1–2 angstroms (mean 1.61A) Cα RMSD generally observed between computed structures and reference crystal structures. Our results are consistent across the full spectrum of test domains including those from recent CASP experiments and most notably in the ‘twilight’ and ‘midnight’ zones wherein <30% and <10% target-template sequence identity prevails (mean twilight RMSD of 1.69A). We further demonstrate the utility of the PCAIN protocol to derive biological insight into protein structure-function relationships, by modeling the structure of the YopM effector novel E3 ligase (NEL) domain from plague-causative bacterium Yersinia Pestis and discussing its implications for host adaptive and innate immune modulation by the pathogen. Considering the several high-throughput, sequence-identity-independent applications demonstrated in this work, we suggest that the PCAIN is a fundamental fold feature that could be a valuable addition to the arsenal of protein modeling and analysis tools.  相似文献   

19.
Recent progress in structure determination techniques has led to a significant growth in the number of known membrane protein structures, and the first structural genomics projects focusing on membrane proteins have been initiated, warranting an investigation of appropriate bioinformatics strategies for optimal structural target selection for these molecules. What determines a membrane protein fold? How many membrane structures need to be solved to provide sufficient structural coverage of the membrane protein sequence space? We present the CAMPS database (Computational Analysis of the Membrane Protein Space) containing almost 45,000 proteins with three or more predicted transmembrane helices (TMH) from 120 bacterial species. This large set of membrane proteins was subjected to single‐linkage clustering using only sequence alignments covering at least 40% of the TMH present in a given family. This process yielded 266 sequence clusters with at least 15 members, roughly corresponding to membrane structural folds, sufficiently structurally homogeneous in terms of the variation of TMH number between individual sequences. These clusters were further subdivided into functionally homogeneous subclusters according to the COG (Clusters of Orthologous Groups) system as well as more stringently defined families sharing at least 30% identity. The CAMPS sequence clusters are thus designed to reflect three main levels of interest for structural genomics: fold, function, and modeling distance. We present a library of Hidden Markov Models (HMM) derived from sequence alignments of TMH at these three levels of sequence similarity. Given that 24 out of 266 clusters corresponding to membrane folds already have associated known structures, we estimate that 242 additional new structures, one for each remaining cluster, would provide structural coverage at the fold level of roughly 70% of prokaryotic membrane proteins belonging to the currently most populated families. Proteins 2006. © 2006 Wiley‐Liss, Inc.  相似文献   

20.
Summary: Retroviruses are an important group of pathogens that cause a variety of diseases in humans and animals. Four human retroviruses are currently known, including human immunodeficiency virus type 1, which causes AIDS, and human T-lymphotropic virus type 1, which causes cancer and inflammatory disease. For many years, there have been sporadic reports of additional human retroviral infections, particularly in cancer and other chronic diseases. Unfortunately, many of these putative viruses remain unproven and controversial, and some retrovirologists have dismissed them as merely “human rumor viruses.” Work in this field was last reviewed in depth in 1984, and since then, the molecular techniques available for identifying and characterizing retroviruses have improved enormously in sensitivity. The advent of PCR in particular has dramatically enhanced our ability to detect novel viral sequences in human tissues. However, DNA amplification techniques have also increased the potential for false-positive detection due to contamination. In addition, the presence of many families of human endogenous retroviruses (HERVs) within our DNA can obstruct attempts to identify and validate novel human retroviruses. Here, we aim to bring together the data on “novel” retroviral infections in humans by critically examining the evidence for those putative viruses that have been linked with disease and the likelihood that they represent genuine human infections. We provide a background to the field and a discussion of potential confounding factors along with some technical guidelines. In addition, some of the difficulties associated with obtaining formal proof of causation for common or ubiquitous agents such as HERVs are discussed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号