首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 61 毫秒
1.
MOTIVATION: Prediction of catalytic residues provides useful information for the research on function of enzymes. Most of the existing prediction methods are based on structural information, which limits their use. We propose a sequence-based catalytic residue predictor that provides predictions with quality comparable to modern structure-based methods and that exceeds quality of state-of-the-art sequence-based methods. RESULTS: Our method (CRpred) uses sequence-based features and the sequence-derived PSI-BLAST profile. We used feature selection to reduce the dimensionality of the input (and explain the input) to support vector machine (SVM) classifier that provides predictions. Tests on eight datasets and side-by-side comparison with six modern structure- and sequence-based predictors show that CRpred provides predictions with quality comparable to current structure-based methods and better than sequence-based methods. The proposed method obtains 15-19% precision and 48-58% TP (true positive) rate, depending on the dataset used. CRpred also provides confidence values that allow selecting a subset of predictions with higher precision. The improved quality is due to newly designed features and careful parameterization of the SVM. The features incorporate amino acids characterized by the highest and the lowest propensities to constitute catalytic residues, Gly that provides flexibility for catalytic sites and sequence motifs characteristic to certain catalytic reactions. Our features indicate that catalytic residues are on average more conserved when compared with the general population of residues and that highly conserved amino acids characterized by high catalytic propensity are likely to form catalytic sites. We also show that local (with respect to the sequence) hydrophobicity contributes towards the prediction.  相似文献   

2.
ABSTRACT: BACKGROUND: RNA molecules play diverse functional and structural roles in cells. They function as messengers for transferring genetic information from DNA to proteins, as the primary genetic material in many viruses, as catalysts (ribozymes) important for protein synthesis and RNA processing, and as essential and ubiquitous regulators of gene expression in living organisms. Many of these functions depend on precisely orchestrated interactions between RNA molecules and specific proteins in cells. Understanding the molecular mechanisms by which proteins recognize and bind RNA is essential for comprehending the functional implications of these interactions, but the recognition 'code' that mediates interactions between proteins and RNA is not yet understood. Success in deciphering this code would dramatically impact the development of new therapeutic strategies for intervening in devastating diseases such as AIDS and cancer. Because of the high cost of experimental determination of protein-RNA interfaces, there is an increasing reliance on statistical machine learning methods for training predictors of RNA-binding residues in proteins. However, because of differences in the choice of datasets, performance measures, and data representations used, it has been difficult to obtain an accurate assessment of the current state of the art in protein-RNA interface prediction. RESULTS: We provide a review of published approaches for predicting RNA-binding residues in proteins and a systematic comparison and critical assessment of protein-RNA interface residue predictors trained using these approaches on three carefully curated non-redundant datasets. We directly compare two widely used machine learning algorithms (Naive Bayes (NB) and Support Vector Machine (SVM)) using three different data representations in which features are encoded using either sequence- or structure-based windows. Our results show that (i) Sequence-based classifiers that use a position-specific scoring matrix (PSSM)-based representation (PSSMSeq) outperform those that use an amino acid identity based representation (IDSeq) or a smoothed PSSM (SmoPSSMSeq); (ii) Structure-based classifiers that use smoothed PSSM representation (SmoPSSMStr) outperform those that use PSSM (PSSMStr) as well as sequence identity based representation (IDStr). PSSMSeq classifiers, when tested on an independent test set of 44 proteins, achieve performance that is comparable to that of three state-of-the-art structure-based predictors (including those that exploit geometric features) in terms of Matthews Correlation Coefficient (MCC), although the structure-based methods achieve substantially higher Specificity (albeit at the expense of Sensitivity) compared to sequence-based methods. We also find that the expected performance of the classifiers on a residue level can be markedly different from that on a protein level. Our experiments show that the classifiers trained on three different non-redundant protein-RNA interface datasets achieve comparable cross-validation performance. However, we find that the results are significantly affected by differences in the distance threshold used to define interface residues. CONCLUSIONS: Our results demonstrate that protein-RNA interface residue predictors that use a PSSM-based encoding of sequence windows outperform classifiers that use other encodings of sequence windows. While structure-based methods that exploit geometric features can yield significant increases in the Specificity of protein-RNA interface residue predictions, such increases are offset by decreases in Sensitivity. These results underscore the importance of comparing alternative methods using rigorous statistical procedures, multiple performance measures, and datasets that are constructed based on several alternative definitions of interface residues and redundancy cutoffs as well as including evaluations on independent test sets into the comparisons.  相似文献   

3.
Zhao N  Pang B  Shyu CR  Korkin D 《Proteomics》2011,11(22):4321-4330
Structural knowledge about protein-protein interactions can provide insights to the basic processes underlying cell function. Recent progress in experimental and computational structural biology has led to a rapid growth of experimentally resolved structures and computationally determined near-native models of protein-protein interactions. However, determining whether a protein-protein interaction is physiological or it is the artifact of an experimental or computational method remains a challenging problem. In this work, we have addressed two related problems. The first problem is distinguishing between the experimentally obtained physiological and crystal-packing protein-protein interactions. The second problem is concerned with the classification of near-native and inaccurate docking models. We first defined a universal set of interface features and employed a support vector machines (SVM)-based approach to classify the interactions for both problems, with the accuracy, precision, and recall for the first problem classifier reaching 93%. To improve the classification, we next developed a semi-supervised learning approach for the second problem, using transductive SVM (TSVM). We applied both classifiers to a commonly used protein docking benchmark of 124 complexes. We found that while we reached the classification accuracies of 78.9% for the SVM classifier and 80.3% for the TSVM classifier, improving protein-docking methods by model re-ranking remains a challenging problem.  相似文献   

4.

Background  

Prediction of catalytic residues is a major step in characterizing the function of enzymes. In its simpler formulation, the problem can be cast into a binary classification task at the residue level, by predicting whether the residue is directly involved in the catalytic process. The task is quite hard also when structural information is available, due to the rather wide range of roles a functional residue can play and to the large imbalance between the number of catalytic and non-catalytic residues.  相似文献   

5.
It is important to understand the cause of amyloid illnesses by predicting the short protein fragments capable of forming amyloid-like fibril motifs aiding in the discovery of sequence-targeted anti-aggregation drugs. It is extremely desirable to design computational tools to provide affordable in silico predictions owing to the limitations of molecular techniques for their identification. In this research article, we tried to study, from a machine learning perspective, the performance of several machine learning classifiers that use heterogenous features based on biochemical and biophysical properties of amino acids to discriminate between amyloidogenic and non-amyloidogenic regions in peptides. Four conventional machine learning classifiers namely Support Vector Machine, Neural network, Decision tree and Random forest were trained and tested to find the best classifier that fits the problem domain well. Prior to classification, novel implementations of two biologically-inspired feature optimization techniques based on evolutionary algorithms and methodologies that mimic social life and a multivariate method based on projection are utilized in order to remove the unimportant and uninformative features. Among the dimenionality reduction algorithms considered under the study, prediction results show that algorithms based on evolutionary computation is the most effective. SVM best suits the problem domain in its fitment among the classifiers considered. The best classifier is also compared with an online predictor to evidence the equilibrium maintained between true positive rates and false positive rates in the proposed classifier. This exploratory study suggests that these methods are promising in providing amyloidogenity prediction and may be further extended for large-scale proteomic studies.  相似文献   

6.
SUMMARY: Several papers have been published where nonlinear machine learning algorithms, e.g. artificial neural networks, support vector machines and decision trees, have been used to model the specificity of the HIV-1 protease and extract specificity rules. We show that the dataset used in these studies is linearly separable and that it is a misuse of nonlinear classifiers to apply them to this problem. The best solution on this dataset is achieved using a linear classifier like the simple perceptron or the linear support vector machine, and it is straightforward to extract rules from these linear models. We identify key residues in peptides that are efficiently cleaved by the HIV-1 protease and list the most prominent rules, relating them to experimental results for the HIV-1 protease. MOTIVATION: Understanding HIV-1 protease specificity is important when designing HIV inhibitors and several different machine learning algorithms have been applied to the problem. However, little progress has been made in understanding the specificity because nonlinear and overly complex models have been used. RESULTS: We show that the problem is much easier than what has previously been reported and that linear classifiers like the simple perceptron or linear support vector machines are at least as good predictors as nonlinear algorithms. We also show how sets of specificity rules can be generated from the resulting linear classifiers. AVAILABILITY: The datasets used are available at http://www.hh.se/staff/bioinf/  相似文献   

7.
Gaucher disease, the most common lysosomal storage disease, is caused by mutations in the gene that encodes acid-β-glucosidase (GlcCerase). Type 1 is characterized by hepatosplenomegaly, and types 2 and 3 by early or chronic onset of severe neurological symptoms. No clear correlation exists between the ~200 GlcCerase mutations and disease severity, although homozygosity for the common mutations N370S and L444P is associated with non- neuronopathic and neuronopathic disease, respectively. We report the X-ray structure of GlcCerase at 2.0 Å resolution. The catalytic domain consists of a (β/α)8 TIM barrel, as expected for a member of the glucosidase hydrolase A clan. The distance between the catalytic residues E235 and E340 is consistent with a catalytic mechanism of retention. N370 is located on the longest α-helix (helix 7), which has several other mutations of residues that point into the TIM barrel. Helix 7 is at the interface between the TIM barrel and a separate immunoglobulin-like domain on which L444 is located, suggesting an important regulatory or structural role for this non-catalytic domain. The structure provides the possibility of engineering improved GlcCerase for enzyme-replacement therapy, and for designing structure-based drugs aimed at restoring the activity of defective GlcCerase.  相似文献   

8.
Structural genomics projects are determining the three-dimensional structure of proteins without full characterization of their function. A critical part of the annotation process involves appropriate knowledge representation and prediction of functionally important residue environments. We have developed a method to extract features from sequence, sequence alignments, three-dimensional structure, and structural environment conservation, and used support vector machines to annotate homologous and nonhomologous residue positions based on a specific training set of residue functions. In order to evaluate this pipeline for automated protein annotation, we applied it to the challenging problem of prediction of catalytic residues in enzymes. We also ranked the features based on their ability to discriminate catalytic from noncatalytic residues. When applying our method to a well-annotated set of protein structures, we found that top-ranked features were a measure of sequence conservation, a measure of structural conservation, a degree of uniqueness of a residue's structural environment, solvent accessibility, and residue hydrophobicity. We also found that features based on structural conservation were complementary to those based on sequence conservation and that they were capable of increasing predictor performance. Using a family nonredundant version of the ASTRAL 40 v1.65 data set, we estimated that the true catalytic residues were correctly predicted in 57.0% of the cases, with a precision of 18.5%. When testing on proteins containing novel folds not used in training, the best features were highly correlated with the training on families, thus validating the approach to nonhomologous catalytic residue prediction in general. We then applied the method to 2781 coordinate files from the structural genomics target pipeline and identified both highly ranked and highly clustered groups of predicted catalytic residues.  相似文献   

9.
Structural information can help engineer enzymes. Usually, specific amino acids in particular regions are targeted for functional reconstruction to enhance the catalytic performance, including activity, stereoselectivity, and thermostability. Appropriate selection of target sites is the key to structure-based design, which requires elucidation of the structure–function relationships. Here, we summarize the mutations of residues in different specific regions, including active center, access tunnels, and flexible loops, on fine-tuning the catalytic performance of enzymes, and discuss the effects of altering the local structural environment on the functions. In addition, we keep up with the recent progress of structure-based approaches for enzyme engineering, aiming to provide some guidance on how to take advantage of the structural information.  相似文献   

10.
Calculations of charge interactions complement analysis of a characterised active site, rationalising pH-dependence of activity and transition state stabilisation. Prediction of active site location through large DeltapK(a)s or electrostatic strain is relevant for structural genomics. We report a study of ionisable groups in a set of 20 enzymes, finding that false positives obscure predictive potential. In a larger set of 156 enzymes, peaks in solvent-space electrostatic properties are calculated. Both electric field and potential match well to active site location. The best correlation is found with electrostatic potential calculated from uniform charge density over enzyme volume, rather than from assignment of a standard atom-specific charge set. Studying a shell around each molecule, for 77% of enzymes the potential peak is within that 5% of the shell closest to the active site centre, and 86% within 10%. Active site identification by largest cleft, also with projection onto a shell, gives 58% of enzymes for which the centre of the largest cleft lies within 5% of the active site, and 70% within 10%. Dielectric boundary conditions emphasise clefts in the uniform charge density method, which is suited to recognition of binding pockets embedded within larger clefts. The variation of peak potential with distance from active site, and comparison between enzyme and non-enzyme sets, gives an optimal threshold distinguishing enzyme from non-enzyme. We find that 87% of the enzyme set exceeds the threshold as compared to 29% of the non-enzyme set. Enzyme/non-enzyme homologues, "structural genomics" annotated proteins and catalytic/non-catalytic RNAs are studied in this context.  相似文献   

11.
We introduce novel profile-based string kernels for use with support vector machines (SVMs) for the problems of protein classification and remote homology detection. These kernels use probabilistic profiles, such as those produced by the PSI-BLAST algorithm, to define position-dependent mutation neighborhoods along protein sequences for inexact matching of k-length subsequences ("k-mers") in the data. By use of an efficient data structure, the kernels are fast to compute once the profiles have been obtained. For example, the time needed to run PSI-BLAST in order to build the profiles is significantly longer than both the kernel computation time and the SVM training time. We present remote homology detection experiments based on the SCOP database where we show that profile-based string kernels used with SVM classifiers strongly outperform all recently presented supervised SVM methods. We further examine how to incorporate predicted secondary structure information into the profile kernel to obtain a small but significant performance improvement. We also show how we can use the learned SVM classifier to extract "discriminative sequence motifs"--short regions of the original profile that contribute almost all the weight of the SVM classification score--and show that these discriminative motifs correspond to meaningful structural features in the protein data. The use of PSI-BLAST profiles can be seen as a semi-supervised learning technique, since PSI-BLAST leverages unlabeled data from a large sequence database to build more informative profiles. Recently presented "cluster kernels" give general semi-supervised methods for improving SVM protein classification performance. We show that our profile kernel results also outperform cluster kernels while providing much better scalability to large datasets.  相似文献   

12.
Mismatch string kernels for discriminative protein classification   总被引:1,自引:0,他引:1  
MOTIVATION: Classification of proteins sequences into functional and structural families based on sequence homology is a central problem in computational biology. Discriminative supervised machine learning approaches provide good performance, but simplicity and computational efficiency of training and prediction are also important concerns. RESULTS: We introduce a class of string kernels, called mismatch kernels, for use with support vector machines (SVMs) in a discriminative approach to the problem of protein classification and remote homology detection. These kernels measure sequence similarity based on shared occurrences of fixed-length patterns in the data, allowing for mutations between patterns. Thus, the kernels provide a biologically well-motivated way to compare protein sequences without relying on family-based generative models such as hidden Markov models. We compute the kernels efficiently using a mismatch tree data structure, allowing us to calculate the contributions of all patterns occurring in the data in one pass while traversing the tree. When used with an SVM, the kernels enable fast prediction on test sequences. We report experiments on two benchmark SCOP datasets, where we show that the mismatch kernel used with an SVM classifier performs competitively with state-of-the-art methods for homology detection, particularly when very few training examples are available. Examination of the highest-weighted patterns learned by the SVM classifier recovers biologically important motifs in protein families and superfamilies.  相似文献   

13.
The process of deducing the catalytic mechanism of an enzyme from its structure is highly complex and requires extensive experimental work to validate a proposed mechanism. As one step towards improving the reliability of this process, we have gathered statistics describing the typical geometry of catalytic residues with regard to the substrate and one another. In order to analyse residue-substrate interactions, we have assembled a dataset of structures of enzymes of known mechanism bound to substrate, product, or a substrate analogue. Despite the challenges presented in obtaining such experimental data, we were able to include 42 enzyme structures. We have also assembled a separate dataset of catalytic residues which act upon other catalytic residues, using a set of 60 enzyme structures. For both datasets, we have extracted the distances between residues with a given catalytic function and their target moieties. The geometry of residues whose function involves the transfer or sharing of hydrogens (either with substrate or another residue) was analysed more closely. The results showed that the geometry for such productive interactions (prior to the transition state) closely resembles that seen in non-catalytic hydrogen bonds, with distances and angles in the normal expected range. Such statistics provide limits on "expected geometries" for catalytic residues, which will help to identify these residues and elucidate enzyme mechanisms.  相似文献   

14.
Identification of catalytic residues can help unveil interesting attributes of enzyme function for various therapeutic and industrial applications. Based on their biochemical roles, the number of catalytic residues and sequence lengths of enzymes vary. This article describes a prediction approach (PINGU) for such a scenario. It uses models trained using physicochemical properties and evolutionary information of 650 non-redundant enzymes (2136 catalytic residues) in a support vector machines architecture. Independent testing on 200 non-redundant enzymes (683 catalytic residues) in predefined prediction settings, i.e., with non-catalytic per catalytic residue ranging from 1 to 30, suggested that the prediction approach was highly sensitive and specific, i.e., 80% or above, over the incremental challenges. To learn more about the discriminatory power of PINGU in real scenarios, where the prediction challenge is variable and susceptible to high false positives, the best model from independent testing was used on 60 diverse enzymes. Results suggested that PINGU was able to identify most catalytic residues and non-catalytic residues properly with 80% or above accuracy, sensitivity and specificity. The effect of false positives on precision was addressed in this study by application of predicted ligand-binding residue information as a post-processing filter. An overall improvement of 20% in F-measure and 0.138 in Correlation Coefficient with 16% enhanced precision could be achieved. On account of its encouraging performance, PINGU is hoped to have eventual applications in boosting enzyme engineering and novel drug discovery.  相似文献   

15.
To solve the class imbalance problem in the classification of pre-miRNAs with the ab initio method, we developed a novel sample selection method according to the characteristics of pre-miRNAs. Real/pseudo pre-miRNAs are clustered based on their stem similarity and their distribution in high dimensional sample space, respectively. The training samples are selected according to the sample density of each cluster. Experimental results are validated by the cross-validation and other testing datasets composed of human real/pseudo pre-miRNAs. When compared with the previous method, microPred, our classifier miRNAPred is nearly 12% more accurate. The selected training samples also could be used to train other SVM classifiers, such as triplet-SVM, MiPred, miPred, and microPred, to improve their classification performance. The sample selection algorithm is useful for constructing a more efficient classifier for the classification of real pre-miRNAs and pseudo hairpin sequences.  相似文献   

16.
In optical printed Chinese character recognition (OPCCR), many classifiers have been proposed for the recognition. Among the classifiers, support vector machine (SVM) might be the best classifier. However, SVM is a classifier for two classes. When it is used for multi-classes in OPCCR, its computation is time-consuming. Thus, we propose a neighbor classes based SVM (NC-SVM) to reduce the computation consumption of SVM. Experiments of NC-SVM classification for OPCCR have been done. The results of the experiments have shown that the NC-SVM we proposed can effectively reduce the computation time in OPCCR.  相似文献   

17.
《Journal of molecular biology》2019,431(19):3860-3870
Enzymes exhibit a strong long-range evolutionary constraint that extends from their catalytic site and affects even distant sites, where site-specific evolutionary rate increases monotonically with distance. While protein–protein sites in enzymes were previously shown to induce only a weak conservation gradient, a comprehensive relationship between different types of functional sites in proteins and the magnitude of evolutionary rate gradients they induce has yet to be established. Here, we systematically calculate the evolutionary rate (dN/dS) of sites as a function of distance from different types of binding sites in enzymes and other proteins: catalytic sites, non-catalytic ligand binding sites, allosteric binding sites, and protein–protein interaction sites. We show that catalytic sites indeed induce significantly stronger evolutionary rate gradient than all other types of non-catalytic binding sites. In addition, catalytic sites in enzymes with no known allosteric function still induce strong long-range conservation gradients. Notably, the weak long-range conservation gradients induced by non-catalytic binding sites in enzymes is nearly identical in magnitude to those induced by ligand binding sites in non-enzymes. Finally, we show that structural determinants such as local solvent exposure of sites cannot explain the observed difference between catalytic and non-catalytic functional sites. Our results suggest that enzymes and non-enzymes share similar evolutionary constraints only when examined from the perspective of non-catalytic functional sites. Hence, the unique evolutionary rate gradient from catalytic sites in enzymes is likely driven by the optimization of catalysis rather than ligand binding and allosteric functions.  相似文献   

18.
19.
Identification of catalytic residues can provide valuable insights into protein function. With the increasing number of protein 3D structures having been solved by X-ray crystallography and NMR techniques, it is highly desirable to develop an efficient method to identify their catalytic sites. In this paper, we present an SVM method for the identification of catalytic residues using sequence and structural features. The algorithm was applied to the 2096 catalytic residues derived from Catalytic Site Atlas database. We obtained overall prediction accuracy of 88.6% from 10-fold cross validation and 95.76% from resubstitution test. Testing on the 254 catalytic residues shows our method can correctly predict all 254 residues. This result suggests the usefulness of our approach for facilitating the identification of catalytic residues from protein structures.  相似文献   

20.
Huang HL  Chang FL 《Bio Systems》2007,90(2):516-528
An optimal design of support vector machine (SVM)-based classifiers for prediction aims to optimize the combination of feature selection, parameter setting of SVM, and cross-validation methods. However, SVMs do not offer the mechanism of automatic internal relevant feature detection. The appropriate setting of their control parameters is often treated as another independent problem. This paper proposes an evolutionary approach to designing an SVM-based classifier (named ESVM) by simultaneous optimization of automatic feature selection and parameter tuning using an intelligent genetic algorithm, combined with k-fold cross-validation regarded as an estimator of generalization ability. To illustrate and evaluate the efficiency of ESVM, a typical application to microarray classification using 11 multi-class datasets is adopted. By considering model uncertainty, a frequency-based technique by voting on multiple sets of potentially informative features is used to identify the most effective subset of genes. It is shown that ESVM can obtain a high accuracy of 96.88% with a small number 10.0 of selected genes using 10-fold cross-validation for the 11 datasets averagely. The merits of ESVM are three-fold: (1) automatic feature selection and parameter setting embedded into ESVM can advance prediction abilities, compared to traditional SVMs; (2) ESVM can serve not only as an accurate classifier but also as an adaptive feature extractor; (3) ESVM is developed as an efficient tool so that various SVMs can be used conveniently as the core of ESVM for bioinformatics problems.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号