首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Protein functional sites control most biological processes and are important targets for drug design and protein engineering. To characterize them, the evolutionary trace (ET) ranks the relative importance of residues according to their evolutionary variations. Generally, top‐ranked residues cluster spatially to define evolutionary hotspots that predict functional sites in structures. Here, various functions that measure the physical continuity of ET ranks among neighboring residues in the structure, or in the sequence, are shown to inform sequence selection and to improve functional site resolution. This is shown first, in 110 proteins, for which the overlap between top‐ranked residues and actual functional sites rose by 8% in significance. Then, on a structural proteomic scale, optimized ET led to better 3D structure‐function motifs (3D templates) and, in turn, to enzyme function prediction by the Evolutionary Trace Annotation (ETA) method with better sensitivity of (40% to 53%) and positive predictive value (93% to 94%). This suggests that the similarity of evolutionary importance among neighboring residues in the sequence and in the structure is a universal feature of protein evolution. In practice, this yields a tool for optimizing sequence selections for comparative analysis and, via ET, for better predictions of functional site and function. This should prove useful for the efficient mutational redesign of protein function and for pharmaceutical targeting.  相似文献   

2.
Functional sites determine the activity and interactions of proteins and as such constitute the targets of most drugs. However, the exponential growth of sequence and structure data far exceeds the ability of experimental techniques to identify their locations and key amino acids. To fill this gap we developed a computational Evolutionary Trace method that ranks the evolutionary importance of amino acids in protein sequences. Studies show that the best-ranked residues form fewer and larger structural clusters than expected by chance and overlap with functional sites, but until now the significance of this overlap has remained qualitative. Here, we use 86 diverse protein structures, including 20 determined by the structural genomics initiative, to show that this overlap is a recurrent and statistically significant feature. An automated ET correctly identifies seven of ten functional sites by the least favorable statistical measure, and nine of ten by the most favorable one. These results quantitatively demonstrate that a large fraction of functional sites in the proteome may be accurately identified from sequence and structure. This should help focus structure-function studies, rational drug design, protein engineering, and functional annotation to the relevant regions of a protein.  相似文献   

3.
Simulating the change of protein sequences over time in a biologically realistic way is fundamental for a broad range of studies with a focus on evolution. It is, thus, problematic that typically simulators evolve individual sites of a sequence identically and independently. More realistic simulations are possible; however, they are often prohibited by limited knowledge concerning site-specific evolutionary constraints or functional dependencies between amino acids. As a consequence, a protein's functional and structural characteristics are rapidly lost in the course of simulated evolution. Here, we present REvolver (www.cibiv.at/software/revolver), a program that simulates protein sequence alteration such that evolutionarily stable sequence characteristics, like functional domains, are maintained. For this purpose, REvolver recruits profile hidden Markov models (pHMMs) for parameterizing site-specific models of sequence evolution in an automated fashion. pHMMs derived from alignments of homologous proteins or protein domains capture information regarding which sequence sites remained conserved over time and where in a sequence insertions or deletions are more likely to occur. Thus, they describe constraints on the evolutionary process acting on these sequences. To demonstrate the performance of REvolver as well as its applicability in large-scale simulation studies, we evolved the entire human proteome up to 1.5 expected substitutions per site. Simultaneously, we analyzed the preservation of Pfam and SMART domains in the simulated sequences over time. REvolver preserved 92% of the Pfam domains originally present in the human sequences. This value drops to 15% when traditional models of amino acid sequence evolution are used. Thus, REvolver represents a significant advance toward a realistic simulation of protein sequence evolution on a proteome-wide scale. Further, REvolver facilitates the simulation of a protein family with a user-defined domain architecture at the root.  相似文献   

4.
5.
The Joint Evolutionary Trees (JET) method detects protein interfaces, the core residues involved in the folding process, and residues susceptible to site-directed mutagenesis and relevant to molecular recognition. The approach, based on the Evolutionary Trace (ET) method, introduces a novel way to treat evolutionary information. Families of homologous sequences are analyzed through a Gibbs-like sampling of distance trees to reduce effects of erroneous multiple alignment and impacts of weakly homologous sequences on distance tree construction. The sampling method makes sequence analysis more sensitive to functional and structural importance of individual residues by avoiding effects of the overrepresentation of highly homologous sequences and improves computational efficiency. A carefully designed clustering method is parametrized on the target structure to detect and extend patches on protein surfaces into predicted interaction sites. Clustering takes into account residues' physical-chemical properties as well as conservation. Large-scale application of JET requires the system to be adjustable for different datasets and to guarantee predictions even if the signal is low. Flexibility was achieved by a careful treatment of the number of retrieved sequences, the amino acid distance between sequences, and the selective thresholds for cluster identification. An iterative version of JET (iJET) that guarantees finding the most likely interface residues is proposed as the appropriate tool for large-scale predictions. Tests are carried out on the Huang database of 62 heterodimer, homodimer, and transient complexes and on 265 interfaces belonging to signal transduction proteins, enzymes, inhibitors, antibodies, antigens, and others. A specific set of proteins chosen for their special functional and structural properties illustrate JET behavior on a large variety of interactions covering proteins, ligands, DNA, and RNA. JET is compared at a large scale to ET and to Consurf, Rate4Site, siteFiNDER|3D, and SCORECONS on specific structures. A significant improvement in performance and computational efficiency is shown.  相似文献   

6.
7.
The rapid increase in the amount of protein sequence data has created a need for automated identification of sites that determine functional specificity among related subfamilies of proteins. A significant fraction of subfamily specific sites are only marginally conserved, which makes it extremely challenging to detect those amino acid changes that lead to functional diversification. To address this critical problem we developed a method named SPEER (specificity prediction using amino acids' properties, entropy and evolution rate) to distinguish specificity determining sites from others. SPEER encodes the conservation patterns of amino acid types using their physico-chemical properties and the heterogeneity of evolutionary changes between and within the subfamilies. To test the method, we compiled a test set containing 13 protein families with known specificity determining sites. Extensive benchmarking by comparing the performance of SPEER with other specificity site prediction algorithms has shown that it performs better in predicting several categories of subfamily specific sites.  相似文献   

8.
MOTIVATION: Evolutionary conservation estimated from a multiple sequence alignment is a powerful indicator of the functional significance of a residue and helps to predict active sites, ligand binding sites, and protein interaction interfaces. Many algorithms that calculate conservation work well, provided an accurate and balanced alignment is used. However, such a strong dependence on the alignment makes the results highly variable. We attempted to improve the conservation prediction algorithm by making it more robust and less sensitive to (1) local alignment errors, (2) overrepresentation of sequences in some branches and (3) occasional presence of unrelated sequences. RESULTS: A novel method is presented for robust constrained Bayesian estimation of evolutionary rates that avoids overfitting independent rates and satisfies the above requirements. The method is evaluated and compared with an entropy-based conservation measure on a set of 1494 protein interfaces. We demonstrated that approximately 62% of the analyzed protein interfaces are more conserved than the remaining surface at the 5% significance level. A consistent method to incorporate alignment reliability is proposed and demonstrated to reduce arbitrary variation of calculated rates upon inclusion of distantly related or unrelated sequences into the alignment.  相似文献   

9.
Given the massive increase in the number of new sequences and structures, a critical problem is how to integrate these raw data into meaningful biological information. One approach, the Evolutionary Trace, or ET, uses phylogenetic information to rank the residues in a protein sequence by evolutionary importance and then maps those ranked at the top onto a representative structure. If these residues form structural clusters, they can identify functional surfaces such as those involved in molecular recognition. Now that a number of examples have shown that ET can identify binding sites and focus mutational studies on their relevant functional determinants, we ask whether the method can be improved so as to be applicable on a large scale. To address this question, we introduce a new treatment of gaps resulting from insertions and deletions, which streamlines the selection of sequences used as input. We also introduce objective statistics to assess the significance of the total number of clusters and of the size of the largest one. As a result of the novel treatment of gaps, ET performance improves measurably. We find evolutionarily privileged clusters that are significant at the 5% level in 45 out of 46 (98%) proteins drawn from a variety of structural classes and biological functions. In 37 of the 38 proteins for which a protein-ligand complex is available, the dominant cluster contacts the ligand. We conclude that spatial clustering of evolutionarily important residues is a general phenomenon, consistent with the cooperative nature of residues that determine structure and function. In practice, these results suggest that ET can be applied on a large scale to identify functional sites in a significant fraction of the structures in the protein databank (PDB). This approach to combining raw sequences and structure to obtain detailed insights into the molecular basis of function should prove valuable in the context of the Structural Genomics Initiative.  相似文献   

10.
MOTIVATION: High accuracy of data always governs the large-scale gene discovery projects. The data should not only be trustworthy but should be correctly annotated for various features it contains. Sequence errors are inherent in single-pass sequences such as ESTs obtained from automated sequencing. These errors further complicate the automated identification of EST-related sequencing. A tool is required to prepare the data prior to advanced annotation processing and submission to public databases. RESULTS: This paper describes ESTprep, a program designed to preprocess expressed sequence tag (EST) sequences. It identifies the location of features present in ESTs and allows the sequence to pass only if it meets various quality criteria. Use of ESTprep has resulted in substantial improvement in accurate EST feature identification and fidelity of results submitted to GenBank. AVAILABILITY: The program is freely available for download from http://genome.uiowa.edu/pubsoft/software.html  相似文献   

11.
Binding sites in proteins can be either specifically functional binding sites (active sites) that bind specific substrates with high affinity or regulatory binding sites (allosteric sites), that modulate the activity of functional binding sites through effector molecules. Owing to their significance in determining protein function, the identification of protein functional and regulatory binding sites is widely acknowledged as an important biological problem. In this work, we present a novel binding site prediction method, Active and Regulatory site Prediction (AR-Pred), which supplements protein geometry, evolutionary, and physicochemical features with information about protein dynamics to predict putative active and allosteric site residues. As the intrinsic dynamics of globular proteins plays an essential role in controlling binding events, we find it to be an important feature for the identification of protein binding sites. We train and validate our predictive models on multiple balanced training and validation sets with random forest machine learning and obtain an ensemble of discrete models for each prediction type. Our models for active site prediction yield a median area under the curve (AUC) of 91% and Matthews correlation coefficient (MCC) of 0.68, whereas the less well-defined allosteric sites are predicted at a lower level with a median AUC of 80% and MCC of 0.48. When tested on an independent set of proteins, our models for active site prediction show comparable performance to two existing methods and gains compared to two others, while the allosteric site models show gains when tested against three existing prediction methods. AR-Pred is available as a free downloadable package at https://github.com/sambitmishra0628/AR-PRED_source .  相似文献   

12.
A major problem in genome annotation is whether it is valid to transfer the function from a characterised protein to a homologue of unknown activity. Here, we show that one can employ a strategy that uses a structure-based prediction of protein functional sites to assess the reliability of functional inheritance. We have automated and benchmarked a method based on the evolutionary trace approach. Using a multiple sequence alignment, we identified invariant polar residues, which were then mapped onto the protein structure. Spatial clusters of these invariant residues formed the predicted functional site. For 68 of 86 proteins examined, the method yielded information about the observed functional site. This algorithm for functional site prediction was then used to assess the validity of transferring the function between homologues. This procedure was tested on 18 pairs of homologous proteins with unrelated function and 70 pairs of proteins with related function, and was shown to be 94 % accurate. This automated method could be linked to schemes for genome annotation. Finally, we examined the use of functional site prediction in protein-protein and protein-DNA docking. The use of predicted functional sites was shown to filter putative docked complexes with a discrimination similar to that obtained by manually including biological information about active sites or DNA-binding residues.  相似文献   

13.
Study of structure/function relationships constitutes an important field of research, especially for modification of protein function and drug design. However, the fact that rational design (i.e. the modification of amino acid sequences by means of directed mutagenesis, based on knowledge of the three-dimensional structure) appears to be much less efficient than irrational design (i.e. random mutagenesis followed by in vitro selection) clearly indicates that we understand little about the relationships between primary sequence, three-dimensional structure and function. The use of evolutionary approaches and concepts will bring insights to this difficult question. The increasing availability of multigene family sequences that has resulted from genome projects has inspired the creation of novel in silico evolutionary methods to predict details of protein function in duplicated (paralogous) proteins. The underlying principle of all such approaches is to compare the evolutionary properties of homologous sequence positions in paralogs. It has been proposed that the positions that show switches in substitution rate over time--i.e., 'heterotachous sites'--are good indicators of functional divergence. However, it appears that heterotachy is a much more general process, since most variable sites of homologous proteins with no evidence of functional shift are heterotachous. Similarly, it appears that switches in substitution rate are as frequent when paralogous sequences are compared as when orthologous sequences are compared. Heterotachy, instead of being indicative of functional shift, may more generally reflect a less specific process related to the many intra- and inter-molecular interactions compatible with a range of more or less equally viable protein conformations. These interactions will lead to different constraints on the nature of the primary sequences, consistently with theories suggesting the non-independence of substitutions in proteins. However, a specific type of amino acid variation might constitute a good indicator of functional divergence: substitutions occurring at positions that are generally slowly evolving. Such substitutions at constrained sites are indeed much more frequent soon after gene duplication. The identification and analysis of these sites by complementing structural information with evolutionary data may represent a promising direction to future studies dealing with the functional characterization of an ever increasing number of multi-gene families identified by complete genome analysis.  相似文献   

14.
Function prediction frequently relies on comparing genes or gene products to search for relevant similarities. Because the number of protein structures with unknown function is mushrooming, however, we asked here whether such comparisons could be improved by focusing narrowly on the key functional features of protein structures, as defined by the Evolutionary Trace (ET). Therefore a series of algorithms was built to (a) extract local motifs (3D templates) from protein structures based on ET ranking of residue importance; (b) to assess their geometric and evolutionary similarity to other structures; and (c) to transfer enzyme annotation whenever a plurality was reached across matches. Whereas a prototype had only been 80% accurate and was not scalable, here a speedy new matching algorithm enabled large-scale searches for reciprocal matches and thus raised annotation specificity to 100% in both positive and negative controls of 49 enzymes and 50 non-enzymes, respectively-in one case even identifying an annotation error-while maintaining sensitivity ( approximately 60%). Critically, this Evolutionary Trace Annotation (ETA) pipeline requires no prior knowledge of functional mechanisms. It could thus be applied in a large-scale retrospective study of 1218 structural genomics enzymes and reached 92% accuracy. Likewise, it was applied to all 2935 unannotated structural genomics proteins and predicted enzymatic functions in 320 cases: 258 on first pass and 62 more on second pass. Controls and initial analyses suggest that these predictions are reliable. Thus the large-scale evolutionary integration of sequence-structure-function data, here through reciprocal identification of local, functionally important structural features, may contribute significantly to de-orphaning the structural proteome.  相似文献   

15.
Measuring the accuracy of protein three-dimensional structures is one of the most important problems in protein structure prediction. For structure-based drug design, the accuracy of the binding site is far more important than the accuracy of any other region of the protein. We have developed an automated method for assessing the quality of a protein model by focusing on the set of residues in the small molecule binding site. Small molecule binding sites typically involve multiple regions of the protein coming together in space, and their accuracy has been observed to be sensitive to even small alignment errors. In addition, ligand binding sites contain the critical information required for drug design, making their accuracy particularly important. We analyzed the accuracy of the binding sites on two sets of protein models: the predictions submitted by the top-performing CASP7 groups, and the models generated by four widely used homology modeling packages. The results of our CASP7 analysis significantly differ from the previous findings, implying that the binding site measure does not correlate with the traditional model quality measures used in the structure prediction benchmarks. For the modeling programs, the resolution of binding sites is extremely sensitive to the degree of sequence homology between the query and the template, even when the most accurate alignments are used in the homology modeling process.  相似文献   

16.
Evaluation measures of multiple sequence alignments.   总被引:1,自引:0,他引:1  
Multiple sequence alignments (MSAs) are frequently used in the study of families of protein sequences or DNA/RNA sequences. They are a fundamental tool for the understanding of the structure, functionality and, ultimately, the evolution of proteins. A new algorithm, the Circular Sum (CS) method, is presented for formally evaluating the quality of an MSA. It is based on the use of a solution to the Traveling Salesman Problem, which identifies a circular tour through an evolutionary tree connecting the sequences in a protein family. With this approach, the calculation of an evolutionary tree and the errors that it would introduce can be avoided altogether. The algorithm gives an upper bound, the best score that can possibly be achieved by any MSA for a given set of protein sequences. Alternatively, if presented with a specific MSA, the algorithm provides a formal score for the MSA, which serves as an absolute measure of the quality of the MSA. The CS measure yields a direct connection between an MSA and the associated evolutionary tree. The measure can be used as a tool for evaluating different methods for producing MSAs. A brief example of the last application is provided. Because it weights all evolutionary events on a tree identically, but does not require the reconstruction of a tree, the CS algorithm has advantages over the frequently used sum-of-pairs measures for scoring MSAs, which weight some evolutionary events more strongly than others. Compared to other weighted sum-of-pairs measures, it has the advantage that no evolutionary tree must be constructed, because we can find a circular tour without knowing the tree.  相似文献   

17.
We describe a novel approach for inferring functional relationship of proteins by detecting sequence and spatial patterns of protein surfaces. Well-formed concave surface regions in the form of pockets and voids are examined to identify similarity relationship that might be directly related to protein function. We first exhaustively identify and measure analytically all 910,379 surface pockets and interior voids on 12,177 protein structures from the Protein Data Bank. The similarity of patterns of residues forming pockets and voids are then assessed in sequence, in spatial arrangement, and in orientational arrangement. Statistical significance in the form of E and p-values is then estimated for each of the three types of similarity measurements. Our method is fully automated without human intervention and can be used without input of query patterns. It does not assume any prior knowledge of functional residues of a protein, and can detect similarity based on surface patterns small and large. It also tolerates, to some extent, conformational flexibility of functional sites. We show with examples that this method can detect functional relationship with specificity for members of the same protein family and superfamily, as well as remotely related functional surfaces from proteins of different fold structures. We envision that this method can be used for discovering novel functional relationship of protein surfaces, for functional annotation of protein structures with unknown biological roles, and for further inquiries on evolutionary origins of structural elements important for protein function.  相似文献   

18.
To improve the utility of increasingly large numbers of available unannotated and initially poorly annotated genomic sequences for proteome analysis, we demonstrate that effective protein identification can be made on a large and unannotated genome. The strategy developed is to translate the unannotated genome sequence into amino acid sequence encoding putative proteins in all six reading frames, to identify peptides by tandem mass spectrometry (MS/MS), to localize them on the genome sequence, and to preliminarily annotate the protein via a similarity search by BLAST. These tasks have been optimized and automated. Optimization to obtain multiple peptide matches in effect extends the searchable region and results in more robust protein identification. The viability of this strategy is demonstrated with the identification of 223 cilia proteins in the unicellular eukaryotic model organism Tetrahymena thermophila, whose initial genomic sequence draft was released in November 2003. To the best of our knowledge, this is the first demonstration of large-scale protein identification based on such a large, unannotated genome. Of the 223 cilia proteins, 84 have no similarity to proteins in NCBI's nonredundant (nr) database. This methodology allows identifying the locations of the genes encoding these novel proteins, which is a necessary first step to downstream functional genomic experimentation.  相似文献   

19.
真核翻译延伸因子1A(eEF1A)是真核生物蛋白质翻译过程中能将氨酰tRNA运送到核糖体A位点参与多肽延伸反应的多功能蛋白质. 本文主要利用多种生物信息学分析工具进行地中海涡虫翻译延伸因子1A(SmEF1A)蛋白序列的查找与eEF1A直系同源蛋白的搜索, 并基于90条直系同源蛋白进行eEF1A蛋白家族的进化踪迹分析和SmEF1A蛋白功能位点的比较研究. 结果表明,在eEF1A蛋白家族中共识别到338个踪迹残基位点和20个踪迹残基富集区域,SmEF1A蛋白的功能位点与踪迹残基位点密切相关,与GTP/Mg2+结合相关的S21、T72、D91、G94等重要位点均为全家族保守的踪迹残基,N 糖基化、磷酸化等蛋白修饰位点中踪迹残基位点往往是被修饰的部位或修饰功能发挥的关键辅助位点,而位于分子表面的配基结合口袋则与20个踪迹残基富集区域在分子表面形成的踪迹残基簇关系密切. eEF1A蛋白家族的进化踪迹分析为eEF1A蛋白重要功能区域关键残基的确定和未知功能位点的预测提供了重要信息.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号