首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
Utilizing concepts of protein building blocks, we propose a de novo computational algorithm that is similar to combinatorial shuffling experiments. Our goal is to engineer new naturally occurring folds with low homology to existing proteins. A selected protein is first partitioned into its building blocks based on their compactness, degree of isolation from the rest of the structure, and hydrophobicity. Next, the protein building blocks are substituted by fragments taken from other proteins with overall low sequence identity, but with a similar hydrophobic/hydrophilic pattern and a high structural similarity. These criteria ensure that the designed protein has a similar fold, low sequence identity, and a good hydrophobic core compared with its native counterpart. Here, we have selected two proteins for engineering, protein G B1 domain and ubiquitin. The two engineered proteins share approximately 20% and approximately 25% amino acid sequence identities with their native counterparts, respectively. The stabilities of the engineered proteins are tested by explicit water molecular dynamics simulations. The algorithm implements a strategy of designing a protein using relatively stable fragments, with a high population time. Here, we have selected the fragments by searching for local minima along the polypeptide chain using the protein building block model. Such an approach provides a new method for engineering new proteins with similar folds and low homology.  相似文献   

3.
Finding structural similarities between proteins often helps reveal shared functionality, which otherwise might not be detected by native sequence information alone. Such similarity is usually detected and quantified by protein structure alignment. Determining the optimal alignment between two protein structures, however, remains a hard problem. An alternative approach is to approximate each three-dimensional protein structure using a sequence of motifs derived from a structural alphabet. Using this approach, structure comparison is performed by comparing the corresponding motif sequences or structural sequences. In this article, we measure the performance of such alphabets in the context of the protein structure classification problem. We consider both local and global structural sequences. Each letter of a local structural sequence corresponds to the best matching fragment to the corresponding local segment of the protein structure. The global structural sequence is designed to generate the best possible complete chain that matches the full protein structure. We use an alphabet of 20 letters, corresponding to a library of 20 motifs or protein fragments having four residues. We show that the global structural sequences approximate well the native structures of proteins, with an average coordinate root mean square of 0.69 Å over 2225 test proteins. The approximation is best for all α-proteins, while relatively poorer for all β-proteins. We then test the performance of four different sequence representations of proteins (their native sequence, the sequence of their secondary-structure elements, and the local and global structural sequences based on our fragment library) with different classifiers in their ability to classify proteins that belong to five distinct folds of CATH. Without surprise, the primary sequence alone performs poorly as a structure classifier. We show that addition of either secondary-structure information or local information from the structural sequence considerably improves the classification accuracy. The two fragment-based sequences perform better than the secondary-structure sequence but not well enough at this stage to be a viable alternative to more computationally intensive methods based on protein structure alignment.  相似文献   

4.
Protein design aims at designing new protein molecules of desired structure and functionality. One of the major obstacles to large-scale protein design are the extensive time and manpower requirements for experimental validation of designed sequences. Recent advances in protein structure prediction have provided potentials for an automated assessment of the designed sequences via folding simulations. We present a new protocol for protein design and validation. The sequence space is initially searched by Monte Carlo sampling guided by a public atomic potential, with candidate sequences selected by the clustering of sequence decoys. The designed sequences are then assessed by I-TASSER folding simulations, which generate full-length atomic structural models by the iterative assembly of threading fragments. The protocol is tested on 52 nonhomologous single-domain proteins, with an average sequence identity of 24% between the designed sequences and the native sequences. Despite this low sequence identity, three-dimensional models predicted for the first designed sequence have an RMSD of < 2 Å to the target structure in 62% of cases. This percentage increases to 77% if we consider the three-dimensional models from the top 10 designed sequences. Such a striking consistency between the target structure and the structural prediction from nonhomologous sequences, despite the fact that the design and folding algorithms adopt completely different force fields, indicates that the design algorithm captures the features essential to the global fold of the target. On average, the designed sequences have a free energy that is 0.39 kcal/(mol residue) lower than in the native sequences, potentially affording a greater stability to synthesized target folds.  相似文献   

5.
Predicting the three-dimensional structure of proteins from their amino acid sequences remains a challenging problem in molecular biology. While the current structural coverage of proteins is almost exclusively provided by template-based techniques, the modeling of the rest of the protein sequences increasingly require template-free methods. However, template-free modeling methods are much less reliable and are usually applicable for smaller proteins, leaving much space for improvement. We present here a novel computational method that uses a library of supersecondary structure fragments, known as Smotifs, to model protein structures. The library of Smotifs has saturated over time, providing a theoretical foundation for efficient modeling. The method relies on weak sequence signals from remotely related protein structures to create a library of Smotif fragments specific to the target protein sequence. This Smotif library is exploited in a fragment assembly protocol to sample decoys, which are assessed by a composite scoring function. Since the Smotif fragments are larger in size compared to the ones used in other fragment-based methods, the proposed modeling algorithm, SmotifTF, can employ an exhaustive sampling during decoy assembly. SmotifTF successfully predicts the overall fold of the target proteins in about 50% of the test cases and performs competitively when compared to other state of the art prediction methods, especially when sequence signal to remote homologs is diminishing. Smotif-based modeling is complementary to current prediction methods and provides a promising direction in addressing the structure prediction problem, especially when targeting larger proteins for modeling.  相似文献   

6.
We present a thorough analysis of the relation between amino acid sequence and local three-dimensional structure in proteins. A library of overlapping local structural prototypes was built using an unsupervised clustering approach called “hybrid protein model” (HPM). The HPM carries out a multiple structural alignment of local folds from a non-redundant protein structure databank encoded into a structural alphabet composed of 16 protein blocks (PBs). Following previous research focusing on the HPM protocol, we have considered gaps in the local structure prototype. This methodology allows to have variable length fragments. Hence, 120 local structure prototypes were obtained. Twenty-five percent of the protein fragments learnt by HPM had gaps.An investigation of tight turns suggested that they are mainly derived from three PB series with precise locations in the HPM. The amino acid information content of the whole conformational classes was tackled by multivariate methods, e.g., canonical correlation analysis. It points out the presence of seven amino acid equivalence classes showing high propensities for preferential local structures. In the same way, definition of “contrast factors” based on sequence-structure properties underline the specificity of certain structural prototypes, e.g., the dependence of Gly or Asn-rich turns to a limited number of PBs, or, the opposition between Pro-rich coils to those enriched in Ser, Thr, Asn and Glu. These results are so useful to analyze the sequence-structure relationships, but could also be used to improve fragment-based method for protein structure prediction from sequence.  相似文献   

7.
Recent progress in structure determination techniques has led to a significant growth in the number of known membrane protein structures, and the first structural genomics projects focusing on membrane proteins have been initiated, warranting an investigation of appropriate bioinformatics strategies for optimal structural target selection for these molecules. What determines a membrane protein fold? How many membrane structures need to be solved to provide sufficient structural coverage of the membrane protein sequence space? We present the CAMPS database (Computational Analysis of the Membrane Protein Space) containing almost 45,000 proteins with three or more predicted transmembrane helices (TMH) from 120 bacterial species. This large set of membrane proteins was subjected to single‐linkage clustering using only sequence alignments covering at least 40% of the TMH present in a given family. This process yielded 266 sequence clusters with at least 15 members, roughly corresponding to membrane structural folds, sufficiently structurally homogeneous in terms of the variation of TMH number between individual sequences. These clusters were further subdivided into functionally homogeneous subclusters according to the COG (Clusters of Orthologous Groups) system as well as more stringently defined families sharing at least 30% identity. The CAMPS sequence clusters are thus designed to reflect three main levels of interest for structural genomics: fold, function, and modeling distance. We present a library of Hidden Markov Models (HMM) derived from sequence alignments of TMH at these three levels of sequence similarity. Given that 24 out of 266 clusters corresponding to membrane folds already have associated known structures, we estimate that 242 additional new structures, one for each remaining cluster, would provide structural coverage at the fold level of roughly 70% of prokaryotic membrane proteins belonging to the currently most populated families. Proteins 2006. © 2006 Wiley‐Liss, Inc.  相似文献   

8.
Alexander PA  Rozak DA  Orban J  Bryan PN 《Biochemistry》2005,44(43):14045-14054
To better understand how amino acid sequences specify unique tertiary folds, we have used random mutagenesis and phage display selection to evolve proteins with a high degree of sequence identity but different tertiary structures (homologous heteromorphs). The starting proteins in this evolutionary process were the IgG binding domains of streptococcal protein G (G(B)) and staphylococcal protein A (A(B)). These nonhomologous domains are similar in size and function but have different folds. G(B) has an alpha/beta fold, and A(B) is a three-helix bundle (3-alpha). IgG binding function is used to select for mutant proteins which retain the correct tertiary structure as the level of sequence identity is increased. A detailed thermodynamic analysis of the folding reactions and binding reactions for a pair of homologous heteromorphs (59% identical) is presented. High-resolution NMR structures of the pair are presented by He et al. [(2005) Biochemistry 44, 14055-14061]. Because the homologous but heteromorphic proteins are identical at most positions in their sequence, their essential folding signals must reside in the positions of nonidentity. Further, the thermodynamic linkage between folding and binding is used to assess the propensity of one sequence to adopt two unique folds.  相似文献   

9.
Silva PJ 《Proteins》2008,70(4):1588-1594
Hydrophobic cluster analysis (HCA) has long been used as a tool to detect distant homologies between protein sequences, and to classify them into different folds. However, it relies on expert human intervention, and is sensitive to subjective interpretations of pattern similarities. In this study, we describe a novel algorithm to assess the similarity of hydrophobic amino acid distributions between two sequences. Our algorithm correctly identifies as misattributions several HCA-based proposals of structural similarity between unrelated proteins present in the literature. We have also used this method to identify the proper fold of a large variety of sequences, and to automatically select the most appropriate structure for homology modeling of several proteins with low sequence identity to any other member of the protein data bank. Automatic modeling of the target proteins based on these templates yielded structures with TM-scores (vs. experimental structures) above 0.60, even without further refinement. Besides enabling a reliable identification of the correct fold of an unknown sequence and the choice of suitable templates, our algorithm also shows that whereas most structural classes of proteins are very homogeneous in hydrophobic cluster composition, a tenth of the described families are compatible with a large variety of hydrophobic patterns. We have built a browsable database of every major representative hydrophobic cluster pattern present in each structural class of proteins, freely available at http://www2.ufp.pt/ pedros/HCA_db/index.htm.  相似文献   

10.
Identifying the fold class of a protein sequence of unknown structure is a fundamental problem in modern biology. We apply a supervised learning algorithm to the classification of protein sequences with low sequence identity from a library of 174 structural classes created with the Combinatorial Extension structural alignment methodology. A class of rules is considered that assigns test sequences to structural classes based on the closest match of an amino acid index profile of the test sequence to a profile centroid for each class. A mathematical optimization procedure is applied to determine an amino acid index of maximal structural discriminatory power by maximizing the ratio of between-class to within-class profile variation. The optimal index is computed as the solution to a generalized eigenvalue problem, and its performance for fold classification is compared to that of other published indices. The optimal index has significantly more structural discriminatory power than all currently known indices, including average surrounding hydrophobicity, which it most closely resembles. It demonstrates >70% classification accuracy over all folds and nearly 100% accuracy on several folds with distinctive conserved structural features. Finally, there is a compelling universality to the optimal index in that it does not appear to depend strongly on the specific structural classes used in its computation.  相似文献   

11.
A novel method has been developed for acquiring the correct alignment of a query sequence against remotely homologous proteins by extracting structural information from profiles of multiple structure alignment. A systematic search algorithm combined with a group of score functions based on sequence information and structural information has been introduced in this procedure. A limited number of top solutions (15,000) with high scores were selected as candidates for further examination. On a test-set comprising 301 proteins from 75 protein families with sequence identity less than 30%, the proportion of proteins with completely correct alignment as first candidate was improved to 39.8% by our method, whereas the typical performance of existing sequence-based alignment methods was only between 16.1% and 22.7%. Furthermore, multiple candidates for possible alignment were provided in our approach, which dramatically increased the possibility of finding correct alignment, such that completely correct alignments were found amongst the top-ranked 1000 candidates in 88.3% of the proteins. With the assistance of a sequence database, completely correct alignment solutions were achieved amongst the top 1000 candidates in 94.3% of the proteins. From such a limited number of candidates, it would become possible to identify more correct alignment using a more time-consuming but more powerful method with more detailed structural information, such as side-chain packing and energy minimization, etc. The results indicate that the novel alignment strategy could be helpful for extending the application of highly reliable methods for fold identification and homology modeling to a huge number of homologous proteins of low sequence similarity. Details of the methods, together with the results and implications for future development are presented.  相似文献   

12.
Locating sequences compatible with a protein structural fold is the well‐known inverse protein‐folding problem. While significant progress has been made, the success rate of protein design remains low. As a result, a library of designed sequences or profile of sequences is currently employed for guiding experimental screening or directed evolution. Sequence profiles can be computationally predicted by iterative mutations of a random sequence to produce energy‐optimized sequences, or by combining sequences of structurally similar fragments in a template library. The latter approach is computationally more efficient but yields less accurate profiles than the former because of lacking tertiary structural information. Here we present a method called SPIN that predicts Sequence Profiles by Integrated Neural network based on fragment‐derived sequence profiles and structure‐derived energy profiles. SPIN improves over the fragment‐derived profile by 6.7% (from 23.6 to 30.3%) in sequence identity between predicted and wild‐type sequences. The method also reduces the number of residues in low complex regions by 15.7% and has a significantly better balance of hydrophilic and hydrophobic residues at protein surface. The accuracy of sequence profiles obtained is comparable to those generated from the protein design program RosettaDesign 3.5. This highly efficient method for predicting sequence profiles from structures will be useful as a single‐body scoring term for improving scoring functions used in protein design and fold recognition. It also complements protein design programs in guiding experimental design of the sequence library for screening and directed evolution of designed sequences. The SPIN server is available at http://sparks‐lab.org . Proteins 2014; 82:2565–2573. © 2014 Wiley Periodicals, Inc.  相似文献   

13.
For many years it has been accepted that the sequence of a protein can specify its three-dimensional structure. However, there has been limited progress in explaining how the sequence dictates its fold and no attempt to do this computationally without the use of specific structural data has ever succeeded for any protein larger than 100 residues. We describe a method that can predict complex folds up to almost 200 residues using only basic principles that do not include any elements of sequence homology. The method does not simulate the folding chain but generates many thousands of models based on an idealized representation of structure. Each rough model is scored and the best are refined. On a set of five proteins, the correct fold score well and when tested on a set of larger proteins, the correct fold was ranked highest for some proteins more than 150 residues, with others being close topological variants. All other methods that approach this level of success rely on the use of templates or fragments of known structures. Our method is unique in using a database of ideal models based on general packing rules that, in spirit, is closer to an ab initio approach.  相似文献   

14.
A significant number of protein sequences in a given proteome have no obvious evolutionarily related protein in the database of solved protein structures, the PDB. Under these conditions, ab initio or template-free modeling methods are the sole means of predicting protein structure. To assess its expected performance on proteomes, the TASSER structure prediction algorithm is benchmarked in the ab initio limit on a representative set of 1129 nonhomologous sequences ranging from 40 to 200 residues that cover the PDB at 30% sequence identity and which adopt alpha, alpha + beta, and beta secondary structures. For sequences in the 40-100 (100-200) residue range, as assessed by their root mean square deviation from native, RMSD, the best of the top five ranked models of TASSER has a global fold that is significantly close to the native structure for 25% (16%) of the sequences, and with a correct identification of the structure of the protein core for 59% (36%). In the absence of a native structure, the structural similarity among the top five ranked models is a moderately reliable predictor of folding accuracy. If we classify the sequences according to their secondary structure content, then 64% (36%) of alpha, 43% (24%) of alpha + beta, and 20% (12%) of beta sequences in the 40-100 (100-200) residue range have a significant TM-score (TM-score > or = 0.4). TASSER performs best on helical proteins because there are less secondary structural elements to arrange in a helical protein than in a beta protein of equal length, since the average length of a helix is longer than that of a strand. In addition, helical proteins have shorter loops and dangling tails. If we exclude these flexible fragments, then TASSER has similar accuracy for sequences containing the same number of secondary structural elements, irrespective of whether they are helices and/or strands. Thus, it is the effective configurational entropy of the protein that dictates the average likelihood of correctly arranging the secondary structure elements.  相似文献   

15.
16.
Metamorphic proteins, including proteins with high levels of sequence identity but different folds, are exceptions to the long-standing rule-of-thumb that proteins with as little as 30% sequence identity adopt the same fold. Which topologies can be bridged by these highly identical sequences remains an open question. Here we bridge two 3-α-helix bundle proteins with two radically different folds. Using a straightforward approach, we engineered the sequences of one subdomain within maltose binding protein (MBP, α/β/α-sandwich) and another within outer surface protein A (OspA, β-sheet) to have high sequence identity (80 and 77%, respectively) with engineered variants of protein G (GA, 3-α-helix bundle). Circular dichroism and nuclear magnetic resonance spectra of all engineered variants demonstrate that they maintain their native conformations despite substantial sequence modification. Furthermore, the MBP variant (80% identical to GA) remained active. Thermodynamic analysis of numerous GA and MBP variants suggests that the key to our approach involved stabilizing the modified MBP and OspA subdomains via external interactions with neighboring substructures, indicating that subdomain interactions can stabilize alternative folds over a broad range of sequence variation. These findings suggest that it is possible to bridge one fold with many other topologies, which has implications for protein folding, evolution, and misfolding diseases.  相似文献   

17.
The question of protein homology versus analogy arises when proteins share a common function or a common structural fold without any statistically significant amino acid sequence similarity. Even though two or more proteins do not have similar sequences but share a common fold and the same or closely related function, they are assumed to be homologs, descendant from a common ancestor. The problem of homolog identification is compounded in the case of proteins of 100 or less amino acids. This is due to a limited number of basic single domain folds and to a likelihood of identifying by chance sequence similarity. The latter arises from two conditions: first, any search of the currently very large protein database is likely to identify short regions of chance match; secondly, a direct sequence comparison among a small set of short proteins sharing a similar fold can detect many similar patterns of hydrophobicity even if proteins do not descend from a common ancestor. In an effort to identify distant homologs of the many ubiquitin proteins, we have developed a combined structure and sequence similarity approach that attempts to overcome the above limitations of homolog identification. This approach results in the identification of 90 probable ubiquitin-related proteins, including examples from the two prokaryotic domains of life, Archaea and Bacteria.  相似文献   

18.
McGuffin LJ  Jones DT 《Proteins》2002,48(1):44-52
The ultimate goal of structural genomics is to obtain the structure of each protein coded by each gene within a genome to determine gene function. Because of cost and time limitations, it remains impractical to solve the structure for every gene product experimentally. Up to a point, reasonably accurate three‐dimensional structures can be deduced for proteins with homologous sequences by using comparative modeling. Beyond this, fold recognition or threading methods can be used for proteins showing little homology to any known fold, although this is relatively time‐consuming and limited by the library of template folds currently available. Therefore, it is appropriate to develop methods that can increase our knowledge base, expanding our fold libraries by earmarking potentially “novel” folds for experimental structure determination. How can we sift through proteomic data rapidly and yet reliably identify novel folds as targets for structural genomics? We have analyzed a number of simple methods that discriminate between “novel” and “known” folds. We propose that simple alignments of secondary structure elements using predicted secondary structure could potentially be a more selective method than both a simple fold recognition method (GenTHREADER) and standard sequence alignment at finding novel folds when sequences show no detectable homology to proteins with known structures. Proteins 2002;48:44–52. © 2002 Wiley‐Liss, Inc.  相似文献   

19.
MOTIVATION: A method for recognizing the three-dimensional fold from the protein amino acid sequence based on a combination of hidden Markov models (HMMs) and secondary structure prediction was recently developed for proteins in the Mainly-Alpha structural class. Here, this methodology is extended to Mainly-Beta and Alpha-Beta class proteins. Compared to other fold recognition methods based on HMMs, this approach is novel in that only secondary structure information is used. Each HMM is trained from known secondary structure sequences of proteins having a similar fold. Secondary structure prediction is performed for the amino acid sequence of a query protein. The predicted fold of a query protein is the fold described by the model fitting the predicted sequence the best. RESULTS: After model cross-validation, the success rate on 44 test proteins covering the three structural classes was found to be 59%. On seven fold predictions performed prior to the publication of experimental structure, the success rate was 71%. In conclusion, this approach manages to capture important information about the fold of a protein embedded in the length and arrangement of the predicted helices, strands and coils along the polypeptide chain. When a more extensive library of HMMs representing the universe of known structural families is available (work in progress), the program will allow rapid screening of genomic databases and sequence annotation when fold similarity is not detectable from the amino acid sequence. AVAILABILITY: FORESST web server at http://absalpha.dcrt.nih.gov:8008/ for the library of HMMs of structural families used in this paper. FORESST web server at http://www.tigr.org/ for a more extensive library of HMMs (work in progress). CONTACT: valedf@tigr.org; munson@helix.nih.gov; garnier@helix.nih.gov  相似文献   

20.
MOTIVATION: Secondary-Structure Guided Superposition tool (SSGS) is a permissive secondary structure-based algorithm for matching of protein structures and in particular their fragments. The algorithm was developed towards protein structure prediction via fragment assembly. RESULTS: In a fragment-based structural prediction scheme, a protein sequence is cut into building blocks (BBs). The BBs are assembled to predict their relative 3D arrangement. Finally, the assemblies are refined. To implement this prediction scheme, a clustered structural library representing sequence patterns for protein fragments is essential. To create a library, BBs generated by cutting proteins from the PDB are compared and structurally similar BBs are clustered. To allow structural comparison and clustering of the BBs, which are often relatively short with flexible loops, we have devised SSGS. SSGS maintains high similarity between cluster members and is highly efficient. When it comes to comparing BBs for clustering purposes, the algorithm obtains better results than other, non-secondary structure guided protein superimposition algorithms.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号