首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Methods for predicting protein function from structure are becoming more important as the rate at which structures are solved increases more rapidly than experimental knowledge. As a result, protein structures now frequently lack functional annotations. The majority of methods for predicting protein function are reliant upon identifying a similar protein and transferring its annotations to the query protein. This method fails when a similar protein cannot be identified, or when any similar proteins identified also lack reliable annotations. Here, we describe a method that can assign function from structure without the use of algorithms reliant upon alignments. Using simple attributes that can be calculated from any crystal structure, such as secondary structure content, amino acid propensities, surface properties and ligands, we describe each enzyme in a non-redundant set. The set is split according to Enzyme Classification (EC) number. We combine the predictions of one-class versus one-class support vector machine models to make overall assignments of EC number to an accuracy of 35% with the top-ranked prediction, rising to 60% accuracy with the top two ranks. In doing so we demonstrate the utility of simple structural attributes in protein function prediction and shed light on the link between structure and function. We apply our methods to predict the function of every currently unclassified protein in the Protein Data Bank.  相似文献   

2.
There are two main reasons to try to predict an enzyme's function from its sequence. The first is to identify the components and thus the functional capabilities of an organism, the second is to create enzymes with specific properties. Genomics, expression analysis, proteomics and metabonomics are largely directed towards understanding how information flows from DNA sequence to protein functions within an organism. This review focuses on information flow in the opposite direction: the applicability of what is being learned from natural enzymes to improve methods for catalyst design.  相似文献   

3.
Short-range and long-range contacts are important in forming protein structure. The proteins can be grouped into four different structural classes according to the content and topology of alpha-helices and beta-strands, and there are all-alpha, all-beta, alpha/beta and alpha+beta proteins. However, there is much difference in statistical property for those classes of proteins. In this paper, we will discuss protein structure in the view of the relative number of long-range (short-range) contacts for each residue. We find the percentage of residues having a large number of long-range contacts in protein is small in all-alpha class of proteins, and large in all-beta class of proteins. However, the percentage of residues is almost the same in alpha/beta and alpha+beta classes of proteins. We calculate the percentage of residues having the number of long-range contacts greater than or equal to (>/=) N(L)=5, and 7 for 428 proteins. The average percentage is 13.3%, 54.8%, 41.4% and 37.0% for all-alpha, all-beta, alpha/beta and alpha+beta classes of proteins with N(L)=5, respectively. With N(L) increasing, the percentage decreases, especially for all-alpha class of proteins. In the meantime, the percentage of residues having the number of short-range contacts greater than or equal to N(S) (>/=N(S)) in protein samples is large for all-alpha class of proteins, and small for all-beta class of proteins, especially for large N(S). We also investigate the ability of amino residues in forming a large number of long-range and short-range contacts. Cys, Val, Ile, Tyr, Trp and Phe can form a large number of long-range contacts easily, and Glu, Lys, Asp, Gln, Arg and Asn can form a large number of long-range contacts, but with difficulty. We also discuss the relative ability in forming short-range contacts for 20 amino residues. Comparison with Fauchere-Pliska hydrophobicity scale and the percentage of residues having large number of long-range contacts is also made. This investigation can provide some insights into the protein structure.  相似文献   

4.
Protein interactions play a vital part in the function of a cell. As experimental techniques for detection and validation of protein interactions are time consuming, there is a need for computational methods for this task. Protein interactions appear to form a network with a relatively high degree of local clustering. In this paper we exploit this clustering by suggesting a score based on triplets of observed protein interactions. The score utilises both protein characteristics and network properties. Our score based on triplets is shown to complement existing techniques for predicting protein interactions, outperforming them on data sets which display a high degree of clustering. The predicted interactions score highly against test measures for accuracy. Compared to a similar score derived from pairwise interactions only, the triplet score displays higher sensitivity and specificity. By looking at specific examples, we show how an experimental set of interactions can be enriched and validated. As part of this work we also examine the effect of different prior databases upon the accuracy of prediction and find that the interactions from the same kingdom give better results than from across kingdoms, suggesting that there may be fundamental differences between the networks. These results all emphasize that network structure is important and helps in the accurate prediction of protein interactions. The protein interaction data set and the program used in our analysis, and a list of predictions and validations, are available at http://www.stats.ox.ac.uk/bioinfo/resources/PredictingInteractions.  相似文献   

5.
While the number of sequenced genomes continues to grow, experimentally verified functional annotation of whole genomes remains patchy. Structural genomics projects are yielding many protein structures that have unknown function. Nevertheless, subsequent experimental investigation is costly and time-consuming, which makes computational methods for predicting protein function very attractive. There is an increasing number of noteworthy methods for predicting protein function from sequence and structural data alone, many of which are readily available to cell biologists who are aware of the strengths and pitfalls of each available technique.  相似文献   

6.
MOTIVATION: We introduce a new approach to using the information contained in sequence-to-function prediction data in order to recognize protein template classes, a critical step in predicting protein structure. The data on which our method is based comprise probabilities of functional categories; for given query sequences these probabilities are obtained by a neural net that has previously been trained on a variety of functionally important features. On a training set of sequences we assess the relevance of individual functional categories for identifying a given structural family. Using a combination of the most relevant categories, the likelihood of a query sequence to belong to a specific family can be estimated. RESULTS: The performance of the method is evaluated using cross-validation. For a fixed structural family and for every sequence, a score is calculated that measures the evidence for family membership. Even for structural families of small size, family members receive significantly higher scores. For some examples, we show that the relevant functional features identified by this method are biologically meaningful. The proposed approach can be used to improve existing sequence-to-structure prediction methods. AVAILABILITY: Matlab code is available on request from the authors. The data are available at http://www.mpisb.mpg.de/~sommer/Fun2Struc/  相似文献   

7.
The current state of the art in modeling protein structure has been assessed, based on the results of the CASP (Critical Assessment of protein Structure Prediction) experiments. In comparative modeling, improvements have been made in sequence alignment, sidechain orientation and loop building. Refinement of the models remains a serious challenge. Improved sequence profile methods have had a large impact in fold recognition. Although there has been some progress in alignment quality, this factor still limits model usefulness. In ab initio structure prediction, there has been notable progress in building approximately correct structures of 40-60 residue-long protein fragments. There is still a long way to go before the general ab initio prediction problem is solved. Overall, the field is maturing into a practical technology, able to deliver useful models for a large number of sequences.  相似文献   

8.
9.
MOTIVATION: Disulfide bonds are primary covalent crosslinks between two cysteine residues in proteins that play critical roles in stabilizing the protein structures and are commonly found in extracy-toplasmatic or secreted proteins. In protein folding prediction, the localization of disulfide bonds can greatly reduce the search in conformational space. Therefore, there is a great need to develop computational methods capable of accurately predicting disulfide connectivity patterns in proteins that could have potentially important applications. RESULTS: We have developed a novel method to predict disulfide connectivity patterns from protein primary sequence, using a support vector regression (SVR) approach based on multiple sequence feature vectors and predicted secondary structure by the PSIPRED program. The results indicate that our method could achieve a prediction accuracy of 74.4% and 77.9%, respectively, when averaged on proteins with two to five disulfide bridges using 4-fold cross-validation, measured on the protein and cysteine pair on a well-defined non-homologous dataset. We assessed the effects of different sequence encoding schemes on the prediction performance of disulfide connectivity. It has been shown that the sequence encoding scheme based on multiple sequence feature vectors coupled with predicted secondary structure can significantly improve the prediction accuracy, thus enabling our method to outperform most of other currently available predictors. Our work provides a complementary approach to the current algorithms that should be useful in computationally assigning disulfide connectivity patterns and helps in the annotation of protein sequences generated by large-scale whole-genome projects. AVAILABILITY: The prediction web server and Supplementary Material are accessible at http://foo.maths.uq.edu.au/~huber/disulfide  相似文献   

10.

Background  

The classification of protein domains in the CATH resource is primarily based on structural comparisons, sequence similarity and manual analysis. One of the main bottlenecks in the processing of new entries is the evaluation of 'borderline' cases by human curators with reference to the literature, and better tools for helping both expert and non-expert users quickly identify relevant functional information from text are urgently needed. A text based method for protein classification is presented, which complements the existing sequence and structure-based approaches, especially in cases exhibiting low similarity to existing members and requiring manual intervention. The method is based on the assumption that textual similarity between sets of documents relating to proteins reflects biological function similarities and can be exploited to make classification decisions.  相似文献   

11.
A comparison of neural network methods and Bayesian statistical methods is presented for prediction of the secondary structure of proteins given their primary sequence. The Bayesian method makes the unphysical assumption that the probability of an amino acid occurring in each position in the protein is independent of the amino acids occurring elsewhere. However, we find the predictive accuracy of the Bayesian method to be only minimally less than the accuracy of the most sophisticated methods used to date. We present the relationship of neural network methods to Bayesian statistical methods and show that, in principle, neural methods offer considerable power, although apparently they are not particularly useful for this problem. In the process, we derive a neural formalism in which the output neurons directly represent the conditional probabilities of structure class. The probabilistic formalism allows introduction of a new objective function, the mutual information, which translates the notion of correlation as a measure of predictive accuracy into a useful training measure. Although a similar accuracy to other approaches (utilizing a mean-square error) is achieved using this new measure, the accuracy on the training set is significantly and tantalizingly higher, even though the number of adjustable parameters remains the same. The mutual information measure predicts a greater fraction of helix and sheet structures correctly than the mean-square error measure, at the expense of coil accuracy, precisely as it was designed to do. By combining the two objective functions, we obtain a marginally improved accuracy of 64.4%, with Matthews coefficients C alpha, C beta and Ccoil of 0.40, 0.32 and 0.42, respectively. However, since all methods to date perform only slightly better than the Bayes algorithm, which entails the drastic assumption of independence of amino acids, one is forced to conclude that little progress has been made on this problem, despite the application of a variety of sophisticated algorithms such as neural networks, and that further advances will require a better understanding of the relevant biophysics.  相似文献   

12.
Predicting protein structural class with AdaBoost Learner   总被引:1,自引:0,他引:1  
The structural class is an important feature in characterizing the overall topological folding type of a protein or the domains therein. Prediction of protein structural classification has attracted the attention and efforts from many investigators. In this paper a novel predictor, the AdaBoost Learner, was introduced to deal with this problem. The essence of the AdaBoost Learner is that a combination of many 'weak' learning algorithms, each performing just slightly better than a random guessing algorithm, will generate a 'strong' learning algorithm. Demonstration thru jackknife cross-validation on two working datasets constructed by previous investigators indicated that AdaBoost outperformed other predictors such as SVM (support vector machine), a powerful algorithm widely used in biological literatures. It has not escaped our notice that AdaBoost may hold a high potential for improving the quality in predicting the other protein features as well, such as subcellular location and receptor type, among many others. Or at the very least, it will play a complementary role to many of the existing algorithms in this regard.  相似文献   

13.
Predicting protein functions computationally from massive protein-protein interaction (PPI) data generated by high-throughput technology is one of the challenges and fundamental problems in the post-genomic era. Although there have been many approaches developed for computationally predicting protein functions, the mutual correlations among proteins in terms of protein functions have not been thoroughly investigated and incorporated into existing prediction methods, especially in voting based prediction methods. In this paper, we propose an innovative method to predict protein functions from PPI data by aggregating the functional correlations among relevant proteins using the Choquet-Integral in fuzzy theory. This functional aggregation measures the real impact of each relevant protein function on the final prediction results, and reduces the impact of repeated functional information on the prediction. Accordingly, a new protein similarity and a new iterative prediction algorithm are proposed in this paper. The experimental evaluations on real PPI datasets demonstrate the effectiveness of our method.  相似文献   

14.

Background

As one of the most common protein post-translational modifications, glycosylation is involved in a variety of important biological processes. Computational identification of glycosylation sites in protein sequences becomes increasingly important in the post-genomic era. A new encoding scheme was employed to improve the prediction of mucin-type O-glycosylation sites in mammalian proteins.

Results

A new protein bioinformatics tool, CKSAAP_OGlySite, was developed to predict mucin-type O-glycosylation serine/threonine (S/T) sites in mammalian proteins. Using the composition of k-spaced amino acid pairs (CKSAAP) based encoding scheme, the proposed method was trained and tested in a new and stringent O-glycosylation dataset with the assistance of Support Vector Machine (SVM). When the ratio of O-glycosylation to non-glycosylation sites in training datasets was set as 1:1, 10-fold cross-validation tests showed that the proposed method yielded a high accuracy of 83.1% and 81.4% in predicting O-glycosylated S and T sites, respectively. Based on the same datasets, CKSAAP_OGlySite resulted in a higher accuracy than the conventional binary encoding based method (about +5.0%). When trained and tested in 1:5 datasets, the CKSAAP encoding showed a more significant improvement than the binary encoding. We also merged the training datasets of S and T sites and integrated the prediction of S and T sites into one single predictor (i.e. S+T predictor). Either in 1:1 or 1:5 datasets, the performance of this S+T predictor was always slightly better than those predictors where S and T sites were independently predicted, suggesting that the molecular recognition of O-glycosylated S/T sites seems to be similar and the increase of the S+T predictor's accuracy may be a result of expanded training datasets. Moreover, CKSAAP_OGlySite was also shown to have better performance when benchmarked against two existing predictors.

Conclusion

Because of CKSAAP encoding's ability of reflecting characteristics of the sequences surrounding mucin-type O-glycosylation sites, CKSAAP_ OGlySite has been proved more powerful than the conventional binary encoding based method. This suggests that it can be used as a competitive mucin-type O-glycosylation site predictor to the biological community. CKSAAP_OGlySite is now available at http://bioinformatics.cau.edu.cn/zzd_lab/CKSAAP_OGlySite/.  相似文献   

15.
A measure of protein structure similarity is calculated from the matching of pairs of secondary structure elements between two proteins. The interaction of each pair was estimated from their axial line segments and combined with other geometric features to produce an optimal discrimination between intrafamily and interfamily relationships. The matching used a fast bipartite graph-matching algorithm that avoids the computational complexity of searching for the full subgraph isomorphism between the two sets of interactions. The main algorithm used was the "stable marriage" algorithm, which works on the ranked "preferences" of one interaction for another. The method takes 1/10 of a second for a typical comparison making it suitable as a fast pre-filter for slower, more exhaustive approaches. An application to protein structure classification is described.  相似文献   

16.
17.
Structural class characterizes the overall folding type of a protein or its domain and the prediction of protein structural class has become both an important and a challenging topic in protein science. Moreover, the prediction itself can stimulate the development of novel predictors that may be straightforwardly applied to many other relational areas. In this paper, 10 frequently used sequence-derived structural and physicochemical features, which can be easily computed by the PROFEAT (Protein Features) web server, were taken as inputs of support vector machines to develop statistical learning models for predicting the protein structural class. More importantly, a strategy of merging different features, called best-first search, was developed. It was shown through the rigorous jackknife cross-validation test that the success rates by our method were significantly improved. We anticipate that the present method may also have important impacts on boosting the predictive accuracies for a series of other protein attributes, such as subcellular localization, membrane types, enzyme family and subfamily classes, among many others.  相似文献   

18.
19.
The functional domain composition is introduced to predict the structural class of a protein or domain according to the following classification: all-alpha, all-beta, alpha/beta, alpha+beta, micro (multi-domain), sigma (small protein), and rho (peptide). The advantage by doing so is that both the sequence-order-related features and the function-related features are naturally incorporated in the predictor. As a demonstration, the jackknife cross-validation test was performed on a dataset that consists of proteins and domains with only less than 20% sequence identity to each other in order to get rid of any homologous bias. The overall success rate thus obtained was 98%. In contrast to this, the corresponding rates obtained by the simple geometry approaches based on the amino acid composition were only 36-39%. This indicates that using the functional domain composition to represent the sample of a protein for statistical prediction is very promising, and that the functional type of a domain is closely correlated with its structural class.  相似文献   

20.
Bayesian segmentation of protein secondary structure.   总被引:12,自引:0,他引:12  
We present a novel method for predicting the secondary structure of a protein from its amino acid sequence. Most existing methods predict each position in turn based on a local window of residues, sliding this window along the length of the sequence. In contrast, we develop a probabilistic model of protein sequence/structure relationships in terms of structural segments, and formulate secondary structure prediction as a general Bayesian inference problem. A distinctive feature of our approach is the ability to develop explicit probabilistic models for alpha-helices, beta-strands, and other classes of secondary structure, incorporating experimentally and empirically observed aspects of protein structure such as helical capping signals, side chain correlations, and segment length distributions. Our model is Markovian in the segments, permitting efficient exact calculation of the posterior probability distribution over all possible segmentations of the sequence using dynamic programming. The optimal segmentation is computed and compared to a predictor based on marginal posterior modes, and the latter is shown to provide significant improvement in predictive accuracy. The marginalization procedure provides exact secondary structure probabilities at each sequence position, which are shown to be reliable estimates of prediction uncertainty. We apply this model to a database of 452 nonhomologous structures, achieving accuracies as high as the best currently available methods. We conclude by discussing an extension of this framework to model nonlocal interactions in protein structures, providing a possible direction for future improvements in secondary structure prediction accuracy.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号