首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Comparative modeling methods can consistently produce reliable structural models for protein sequences with more than 25% sequence identity to proteins with known structure. However, there is a good chance that also sequences with lower sequence identity have their structural components represented in structural databases. To this end, we present a novel fragment-based method using sets of structurally similar local fragments of proteins. The approach differs from other fragment-based methods that use only single backbone fragments. Instead, we use a library of groups containing sets of sequence fragments with geometrically similar local structures and extract sequence related properties to assign these specific geometrical conformations to target sequences. We test the ability of the approach to recognize correct SCOP folds for 273 sequences from the 49 most popular folds. 49% of these sequences have the correct fold as their top prediction, while 82% have the correct fold in one of the top five predictions. Moreover, the approach shows no performance reduction on a subset of sequence targets with less than 10% sequence identity to any protein used to build the library.  相似文献   

2.
J Hargbo  A Elofsson 《Proteins》1999,36(1):68-76
There are many proteins that share the same fold but have no clear sequence similarity. To predict the structure of these proteins, so called "protein fold recognition methods" have been developed. During the last few years, improvements of protein fold recognition methods have been achieved through the use of predicted secondary structures (Rice and Eisenberg, J Mol Biol 1997;267:1026-1038), as well as by using multiple sequence alignments in the form of hidden Markov models (HMM) (Karplus et al., Proteins Suppl 1997;1:134-139). To test the performance of different fold recognition methods, we have developed a rigorous benchmark where representatives for all proteins of known structure are matched against each other. Using this benchmark, we have compared the performance of automatically-created hidden Markov models with standard-sequence-search methods. Further, we combine the use of predicted secondary structures and multiple sequence alignments into a combined method that performs better than methods that do not use this combination of information. Using only single sequences, the correct fold of a protein was detected for 10% of the test cases in our benchmark. Including multiple sequence information increased this number to 16%, and when predicted secondary structure information was included as well, the fold was correctly identified in 20% of the cases. Moreover, if the correct secondary structure was used, 27% of the proteins could be correctly matched to a fold. For comparison, blast2, fasta, and ssearch identifies the fold correctly in 13-17% of the cases. Thus, standard pairwise sequence search methods perform almost as well as hidden Markov models in our benchmark. This is probably because the automatically-created multiple sequence alignments used in this study do not contain enough diversity and because the current generation of hidden Markov models do not perform very well when built from a few sequences.  相似文献   

3.
We have collected a set of 44 Arabidopsis proteins with similarity to the USPA (universal stress protein A of Escherichia coli) domain of bacteria. The USPA domain is found either in small proteins, or it makes up the N-terminal portion of a larger protein, usually a protein kinase. Phylogenetic tree analysis based upon a multiple sequence alignment of the USPA domains shows that these domains of protein kinases 1.3.1 and 1.3.2 form distinct groups, as do the protein kinases 1.4.1. This indicates that their USPA domain structures have diverged appreciably and suggests that they may subserve distinct cellular functions. Two USPA fold classes have been proposed: one based on Methanococcus jannaschii MJ0577 (1MJH) that binds ATP, and the other based on the Haemophilus influenzae universal stress protein (1JMV), highly similar to E. coli UspA, which does not bind ATP. A set of common residues involved in ATP binding in 1MJH and conserved in similar bacterial sequences is also found in a distinct cluster of Arabidopsis sequences. Threading analysis, which examines aspects of secondary and tertiary structure, confirms this Arabidopsis sequence cluster as highly similar to 1MJH. This structural approach can distinguish between the characteristic fold differences of 1MJH-like and 1JMV-like bacterial proteins and was used to assign the complete set of candidate Arabidopsis proteins to one of these fold classes. It is clear that all the plant sequences have arisen from a 1MJH-like ancestor.  相似文献   

4.
The question of protein homology versus analogy arises when proteins share a common function or a common structural fold without any statistically significant amino acid sequence similarity. Even though two or more proteins do not have similar sequences but share a common fold and the same or closely related function, they are assumed to be homologs, descendant from a common ancestor. The problem of homolog identification is compounded in the case of proteins of 100 or less amino acids. This is due to a limited number of basic single domain folds and to a likelihood of identifying by chance sequence similarity. The latter arises from two conditions: first, any search of the currently very large protein database is likely to identify short regions of chance match; secondly, a direct sequence comparison among a small set of short proteins sharing a similar fold can detect many similar patterns of hydrophobicity even if proteins do not descend from a common ancestor. In an effort to identify distant homologs of the many ubiquitin proteins, we have developed a combined structure and sequence similarity approach that attempts to overcome the above limitations of homolog identification. This approach results in the identification of 90 probable ubiquitin-related proteins, including examples from the two prokaryotic domains of life, Archaea and Bacteria.  相似文献   

5.
Structural genomics projects as well as ab initio protein structure prediction methods provide structures of proteins with no sequence or fold similarity to proteins with known functions. These are often low-resolution structures that may only include the positions of C alpha atoms. We present a fast and efficient method to predict DNA-binding proteins from just the amino acid sequences and low-resolution, C alpha-only protein models. The method uses the relative proportions of certain amino acids in the protein sequence, the asymmetry of the spatial distribution of certain other amino acids as well as the dipole moment of the molecule. These quantities are used in a linear formula, with coefficients derived from logistic regression performed on a training set, and DNA-binding is predicted based on whether the result is above a certain threshold. We show that the method is insensitive to errors in the atomic coordinates and provides correct predictions even on inaccurate protein models. We demonstrate that the method is capable of predicting proteins with novel binding site motifs and structures solved in an unbound state. The accuracy of our method is close to another, published method that uses all-atom structures, time-consuming calculations and information on conserved residues.  相似文献   

6.
Small autonomously folding proteins are of interest as model systems to study protein folding, as the same molecule can be used for both experimental and computational approaches. The question remains as to how well these minimized peptide model systems represent larger native proteins. For example, is the core of a minimized protein tolerant to mutation like larger proteins are? Also, do minimized proteins use special strategies for specifying and stabilizing their folded structure? Here we examine these questions in the 35‐residue autonomously folding villin headpiece subdomain (VHP subdomain). Specifically, we focus on a cluster of three conserved phenylalanine (F) residues F47, F51, and F58, that form most of the hydrophobic core. These three residues are oriented such that they may provide stabilizing aromatic–aromatic interactions that could be critical for specifying the fold. Circular dichroism and 1D‐NMR spectroscopy show that point mutations that individually replace any of these three residues with leucine were destabilized, but retained the native VHP subdomain fold. In pair‐wise replacements, the double mutant that retains F58 can adopt the native fold, while the two double mutants that lack F58 cannot. The folding of the double mutant that retains F58 demonstrates that aromatic–aromatic interactions within the aromatic cluster are not essential for specifying the VHP subdomain fold. The ability of the VHP subdomain to tolerate mutations within its hydrophobic core indicates that the information specifying the three dimensional structure is distributed throughout the sequence, as observed in larger proteins. Thus, the VHP subdomain is a legitimate model for larger, native proteins.  相似文献   

7.
Over the next few years, various genome projects will sequence many new genes and yield many new gene products. Many of these products will have no known function and little, if any, sequence homology to existing proteins. There is reason to believe that a rapid determination of a protein fold, even at low resolution, can aid in the identification of function and expedite the determination of structure at higher resolution. Recently devised NMR methods of measuring residual dipolar couplings provide one route to the determination of a fold. They do this by allowing the alignment of previously identified secondary structural elements with respect to each other. When combined with constraints involving loops connecting elements or other short-range experimental distance information, a fold is produced. We illustrate this approach to protein fold determination on (15)N-labeled Eschericia coli acyl carrier protein using a limited set of (15)N-(1)H and (1)H-(1)H dipolar couplings. We also illustrate an approach using a more extended set of heteronuclear couplings on a related protein, (13)C, (15)N-labeled NodF protein from Rhizobium leguminosarum.  相似文献   

8.
The information required to generate a protein structure is contained in its amino acid sequence, but how three-dimensional information is mapped onto a linear sequence is still incompletely understood. Multiple structure alignments of similar protein structures have been used to investigate conserved sequence features but contradictory results have been obtained, due, in large part, to the absence of subjective criteria to be used in the construction of sequence profiles and in the quantitative comparison of alignment results. Here, we report a new procedure for multiple structure alignment and use it to construct structure-based sequence profiles for similar proteins. The definition of "similar" is based on the structural alignment procedure and on the protein structural distance (PSD) described in paper I of this series, which offers an objective measure for protein structure relationships. Our approach is tested in two well-studied groups of proteins; serine proteases and Ig-like proteins. It is demonstrated that the quality of a sequence profile generated by a multiple structure alignment is quite sensitive to the PSD used as a threshold for the inclusion of proteins in the alignment. Specifically, if the proteins included in the aligned set are too distant in structure from one another, there will be a dilution of information and patterns that are relevant to a subset of the proteins are likely to be lost.In order to understand better how the same three-dimensional information can be encoded in seemingly unrelated sequences, structure-based sequence profiles are constructed for subsets of proteins belonging to nine superfolds. We identify patterns of relatively conserved residues in each subset of proteins. It is demonstrated that the most conserved residues are generally located in the regions where tertiary interactions occur and that are relatively conserved in structure. Nevertheless, the conservation patterns are relatively weak in all cases studied, indicating that structure-determining factors that do not require a particular sequential arrangement of amino acids, such as secondary structure propensities and hydrophobic interactions, are important in encoding protein fold information. In general, we find that similar structures can fold without having a set of highly conserved residue clusters or a well-conserved sequence profile; indeed, in some cases there is no apparent conservation pattern common to structures with the same fold. Thus, when a group of proteins exhibits a common and well-defined sequence pattern, it is more likely that these sequences have a close evolutionary relationship rather than the similarities having arisen from the structural requirements of a given fold.  相似文献   

9.
Apgar JR  Gutwin KN  Keating AE 《Proteins》2008,72(3):1048-1065
The alpha-helical coiled coil is a structurally simple protein oligomerization or interaction motif consisting of two or more alpha helices twisted into a supercoiled bundle. Coiled coils can differ in their stoichiometry, helix orientation, and axial alignment. Because of the near degeneracy of many of these variants, coiled coils pose a challenge to fold recognition methods for structure prediction. Whereas distinctions between some protein folds can be discriminated on the basis of hydrophobic/polar patterning or secondary structure propensities, the sequence differences that encode important details of coiled-coil structure can be subtle. This is emblematic of a larger problem in the field of protein structure and interaction prediction: that of establishing specificity between closely similar structures. We tested the behavior of different computational models on the problem of recognizing the correct orientation--parallel vs. antiparallel--of pairs of alpha helices that can form a dimeric coiled coil. For each of 131 examples of known structure, we constructed a large number of both parallel and antiparallel structural models and used these to assess the ability of five energy functions to recognize the correct fold. We also developed and tested three sequence-based approaches that make use of varying degrees of implicit structural information. The best structural methods performed similarly to the best sequence methods, correctly categorizing approximately 81% of dimers. Steric compatibility with the fold was important for some coiled coils we investigated. For many examples, the correct orientation was determined by smaller energy differences between parallel and antiparallel structures distributed over many residues and energy components. Prediction methods that used structure but incorporated varying approximations and assumptions showed quite different behaviors when used to investigate energetic contributions to orientation preference. Sequence based methods were sensitive to the choice of residue-pair interactions scored.  相似文献   

10.
11.
One of the main barriers to accurate computational protein structure prediction is searching the vast space of protein conformations. Distance restraints or inter‐residue contacts have been used to reduce this search space, easing the discovery of the correct folded state. It has been suggested that about 1 contact for every 12 residues may be sufficient to predict structure at fold level accuracy. Here, we use coarse‐grained structure‐based models in conjunction with molecular dynamics simulations to examine this empirical prediction. We generate sparse contact maps for 15 proteins of varying sequence lengths and topologies and find that given perfect secondary‐structural information, a small fraction of the native contact map (5%‐10%) suffices to fold proteins to their correct native states. We also find that different sparse maps are not equivalent and we make several observations about the type of maps that are successful at such structure prediction. Long range contacts are found to encode more information than shorter range ones, especially for α and αβ‐proteins. However, this distinction reduces for β‐proteins. Choosing contacts that are a consensus from successful maps gives predictive sparse maps as does choosing contacts that are well spread out over the protein structure. Additionally, the folding of proteins can also be used to choose predictive sparse maps. Overall, we conclude that structure‐based models can be used to understand the efficacy of structure‐prediction restraints and could, in future, be tuned to include specific force‐field interactions, secondary structure errors and noise in the sparse maps.  相似文献   

12.
Reva B  Finkelstein A  Topiol S 《Proteins》2002,47(2):180-193
We present a new method for more accurate modeling of protein structure, called threading with chemostructural restrictions. This method addresses those cases in which a target sequence has only remote homologues of known structure for which sequence comparison methods cannot provide accurate alignments. Although remote homologues cannot provide an accurate model for the whole chain, they can be used in constructing practically useful models for the most conserved-and often the most interesting-part of the structure. For many proteins of interest, one can suggest certain chemostructural patterns for the native structure based on the available information on the structural superfamily of the protein, the type of activity, the sequence location of the functionally significant residues, and other factors. We use such patterns to restrict (1) a number of possible templates, and (2) a number of allowed chain conformations on a template. The latter restrictions are imposed in the form of additional template potentials (including terms acting as sequence anchors) that act on certain residues. This approach is tested on remote homologues of alpha/beta-hydrolases that have significant structural similarity in the positions of their catalytic triads. The study shows that, in spite of significant deviations between the model and the native structures, the surroundings of the catalytic triad (positions of C(alpha) atoms of 20-30 nearby residues) can be reproduced with accuracy of 2-3 A. We then apply the approach to predict the structure of dipeptidylpeptidase IV (DPP-IV). Using experimentally available data identifying the catalytic triad residues of DPP-IV (David et al., J Biol Chem 1993;268:17247-17252); we predict a model structure of the catalytic domain of DPP-IV based on the 3D fold of prolyl oligopeptidase (Fulop et al., Cell 1998;94:161-170) and use this structure for modeling the interaction of DPP-IV with inhibitor.  相似文献   

13.

Background  

Recent approaches for predicting the three-dimensional (3D) structure of proteins such asde novoor fold recognition methods mostly rely on simplified energy potential functions and a reduced representation of the polypeptide chain. These simplifications facilitate the exploration of the protein conformational space but do not permit to capture entirely the subtle relationship that exists between the amino acid sequence and its native structure. It has been proposed that physics-based energy functions together with techniques for sampling the conformational space, e.g., Monte Carlo or molecular dynamics (MD) simulations, are better suited to the task of modelling proteins at higher resolutions than those of models obtained with the former type of methods. In this study we monitor different protein structural properties along MD trajectories to discriminate correct from erroneous models. These models are based on the sequence-structure alignments provided by our fold recognition method, FROST. We define correct models as being built from alignments of sequences with structures similar to their native structures and erroneous models from alignments of sequences with structures unrelated to their native structures.  相似文献   

14.
Bhardwaj N  Lu H 《FEBS letters》2007,581(5):1058-1066
Protein-DNA interactions are crucial to many cellular activities such as expression-control and DNA-repair. These interactions between amino acids and nucleotides are highly specific and any aberrance at the binding site can render the interaction completely incompetent. In this study, we have three aims focusing on DNA-binding residues on the protein surface: to develop an automated approach for fast and reliable recognition of DNA-binding sites; to improve the prediction by distance-dependent refinement; use these predictions to identify DNA-binding proteins. We use a support vector machines (SVM)-based approach to harness the features of the DNA-binding residues to distinguish them from non-binding residues. Features used for distinction include the residue's identity, charge, solvent accessibility, average potential, the secondary structure it is embedded in, neighboring residues, and location in a cationic patch. These features collected from 50 proteins are used to train SVM. Testing is then performed on another set of 37 proteins, much larger than any testing set used in previous studies. The testing set has no more than 20% sequence identity not only among its pairs, but also with the proteins in the training set, thus removing any undesired redundancy due to homology. This set also has proteins with an unseen DNA-binding structural class not present in the training set. With the above features, an accuracy of 66% with balanced sensitivity and specificity is achieved without relying on homology or evolutionary information. We then develop a post-processing scheme to improve the prediction using the relative location of the predicted residues. Balanced success is then achieved with average sensitivity, specificity and accuracy pegged at 71.3%, 69.3% and 70.5%, respectively. Average net prediction is also around 70%. Finally, we show that the number of predicted DNA-binding residues can be used to differentiate DNA-binding proteins from non-DNA-binding proteins with an accuracy of 78%. Results presented here demonstrate that machine-learning can be applied to automated identification of DNA-binding residues and that the success rate can be ameliorated as more features are added. Such functional site prediction protocols can be useful in guiding consequent works such as site-directed mutagenesis and macromolecular docking.  相似文献   

15.
Morra G  Colombo G 《Proteins》2008,72(2):660-672
Most proteins must fold to a well-defined structure with a minimal stability to perform their function. Here we use a simple, molecular dynamics-based, energy decomposition approach to map the principal energetic interactions in a set of proteins representative of different folds. This work involves the all-atom simulation and analysis of the native structures and mutants of five different proteins representative of an all-alpha (yACPB, Protein A), all-beta (SH3), and a mixed alpha/beta fold (Proteins G and L). Given a certain structure, a native sequence and a set of mutants, we show that our model discriminates the ability of a mutation to yield a more or less stable protein, in agreement with experimental data, catching the principal energetic determinants of protein stabilization. Our approach identifies the interaction determinants responsible to define a fold and shows that mutations can either modulate the strength of pair-wise coupling between residues important for folding, or modify the profile of the principal interactions. Furthermore, we address the question of how to evaluate the fitness of a sequence to a given structure by comparing the information contained in the energy map, which recapitulates the chemistry of the sequence, to that contained in the contact map, which recapitulates the fold topology. The results show that the better fit between the energetic properties of the sequence and the fold topology corresponds to a higher stabilization of the protein. We discuss the relevance of these observations to the analysis of protein designability and to the rational evolution of new sequences.  相似文献   

16.
The construction of fitness landscape has broad implication in understanding molecular evolution, cellular epigenetic state, and protein structures. We studied the problem of constructing fitness landscape of inverse protein folding or protein design, with the aim to generate amino acid sequences that would fold into an a priori determined structural fold which would enable engineering novel or enhanced biochemistry. For this task, an effective fitness function should allow identification of correct sequences that would fold into the desired structure. In this study, we showed that nonlinear fitness function for protein design can be constructed using a rectangular kernel with a basis set of proteins and decoys chosen a priori. The full landscape for a large number of protein folds can be captured using only 480 native proteins and 3,200 non-protein decoys via a finite Newton method. A blind test of a simplified version of fitness function for sequence design was carried out to discriminate simultaneously 428 native sequences not homologous to any training proteins from 11 million challenging protein-like decoys. This simplified function correctly classified 408 native sequences (20 misclassifications, 95% correct rate), which outperforms several other statistical linear scoring function and optimized linear function. Our results further suggested that for the task of global sequence design of 428 selected proteins, the search space of protein shape and sequence can be effectively parametrized with just about 3,680 carefully chosen basis set of proteins and decoys, and we showed in addition that the overall landscape is not overly sensitive to the specific choice of this set. Our results can be generalized to construct other types of fitness landscape.  相似文献   

17.
Structural genomics initiatives aim to elucidate representative 3D structures for the majority of protein families over the next decade, but many obstacles must be overcome. The correct design of constructs is extremely important since many proteins will be too large or contain unstructured regions and will not be amenable to crystallization. It is therefore essential to identify regions in protein sequences that are likely to be suitable for structural study. Scooby-Domain is a fast and simple method to identify globular domains in protein sequences. Domains are compact units of protein structure and their correct delineation will aid structural elucidation through a divide-and-conquer approach. Scooby-Domain predictions are based on the observed lengths and hydrophobicities of domains from proteins with known tertiary structure. The prediction method employs an A*-search to identify sequence regions that form a globular structure and those that are unstructured. On a test set of 173 proteins with consensus CATH and SCOP domain definitions, Scooby-Domain has a sensitivity of 50% and an accuracy of 29%, which is better than current state-of-the-art methods. The method does not rely on homology searches and, therefore, can identify previously unknown domains.  相似文献   

18.
Many single-domain proteins with <100 residues fold cooperatively; but the recently designed 92-residue Top7 protein exhibits clearly non-two-state behaviors. In apparent agreement with experiment, we found that coarse-grained, native-centric chain models, including potentials with and without elementary desolvation barriers, predicted that Top7 has a stable intermediate state in which the C-terminal fragment is folded while the rest of the chain remains disordered. We observed noncooperative folding in Top7 models that incorporated nonnative hydrophobic interactions as well. In contrast, free energy profiles deduced from models with desolvation barriers for a set of thirteen natural proteins with similar chain lengths and secondary structure elements suggested that they fold much more cooperatively than Top7. Buttressed by related studies on smaller natural proteins with chain lengths of ∼40 residues, our findings argue that the de novo native topology of Top7 likely imposed a significant restriction on the cooperativity achievable by any design for this target structure.  相似文献   

19.
In this study, I explain the observation that a rather limited number of residues (about 10) establishes the immunoglobulin fold for the sequences of about 100 residues. Immunoglobulin fold proteins (IgF) comprise SCOP protein superfamilies with rather different functions and with less than 10% sequence identity; their alignment can be accomplished only taking into account the 3D structure. Therefore, I believe that discovering the additional common features of the sequences is necessary to explain the existence of a common fold for these SCOP superfamilies. We propose a method for analysis of pair-wise interconnections between residues of the multiple sequence alignment which helps us to reveal the set of mutually correlated positions, inherent to almost every superfamily of this protein fold. Hence, the set of constant positions (comprising the hydrophobic common core) and the set of variable but mutually correlated ones can serve as a basis of having the common 3D structure for rather distinct protein sequences.  相似文献   

20.
双绕蛋白质的分类与识别   总被引:1,自引:0,他引:1  
蛋白质折叠识别是蛋白质结构研究的重要内容。双绕是α/β蛋白质中结构典型的常见折叠类型。选取22个家族中序列一致性小于25%的79个典型双绕蛋白质作为训练集,以RMSD为指标进行系统聚类,并对各类建立基于结构比对的概形隐马尔科夫模型(profile-HMM)。将Astral1.65中序列一致性小于95%的9 505个样本作为检验集,整体识别敏感性为93.9%,特异性为82.1%,MCC值为0.876。结果表明:对于成员较多,无法建立统一模型的折叠类型,分类建模可以实现较高准确率的识别。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号