首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Naveed M  Khan A  Khan AU 《Amino acids》2012,42(5):1809-1823
G protein-coupled receptors (GPCRs) are transmembrane proteins, which transduce signals from extracellular ligands to intracellular G protein. Automatic classification of GPCRs can provide important information for the development of novel drugs in pharmaceutical industry. In this paper, we propose an evolutionary approach, GPCR-MPredictor, which combines individual classifiers for predicting GPCRs. GPCR-MPredictor is a web predictor that can efficiently predict GPCRs at five levels. The first level determines whether a protein sequence is a GPCR or a non-GPCR. If the predicted sequence is a GPCR, then it is further classified into family, subfamily, sub-subfamily, and subtype levels. In this work, our aim is to analyze the discriminative power of different feature extraction and classification strategies in case of GPCRs prediction and then to use an evolutionary ensemble approach for enhanced prediction performance. Features are extracted using amino acid composition, pseudo amino acid composition, and dipeptide composition of protein sequences. Different classification approaches, such as k-nearest neighbor (KNN), support vector machine (SVM), probabilistic neural networks (PNN), J48, Adaboost, and Naives Bayes, have been used to classify GPCRs. The proposed hierarchical GA-based ensemble classifier exploits the prediction results of SVM, KNN, PNN, and J48 at each level. The GA-based ensemble yields an accuracy of 99.75, 92.45, 87.80, 83.57, and 96.17% at the five levels, on the first dataset. We further perform predictions on a dataset consisting of 8,000 GPCRs at the family, subfamily, and sub-subfamily level, and on two other datasets of 365 and 167 GPCRs at the second and fourth levels, respectively. In comparison with the existing methods, the results demonstrate the effectiveness of our proposed GPCR-MPredictor in classifying GPCRs families. It is accessible at .  相似文献   

2.
Lo SL  Cai CZ  Chen YZ  Chung MC 《Proteomics》2005,5(4):876-884
Knowledge of protein-protein interaction is useful for elucidating protein function via the concept of 'guilt-by-association'. A statistical learning method, Support Vector Machine (SVM), has recently been explored for the prediction of protein-protein interactions using artificial shuffled sequences as hypothetical noninteracting proteins and it has shown promising results (Bock, J. R., Gough, D. A., Bioinformatics 2001, 17, 455-460). It remains unclear however, how the prediction accuracy is affected if real protein sequences are used to represent noninteracting proteins. In this work, this effect is assessed by comparison of the results derived from the use of real protein sequences with that derived from the use of shuffled sequences. The real protein sequences of hypothetical noninteracting proteins are generated from an exclusion analysis in combination with subcellular localization information of interacting proteins found in the Database of Interacting Proteins. Prediction accuracy using real protein sequences is 76.9% compared to 94.1% using artificial shuffled sequences. The discrepancy likely arises from the expected higher level of difficulty for separating two sets of real protein sequences than that for separating a set of real protein sequences from a set of artificial sequences. The use of real protein sequences for training a SVM classification system is expected to give better prediction results in practical cases. This is tested by using both SVM systems for predicting putative protein partners of a set of thioredoxin related proteins. The prediction results are consistent with observations, suggesting that real sequence is more practically useful in development of SVM classification system for facilitating protein-protein interaction prediction.  相似文献   

3.
Yao Y  Zhang T  Xiong Y  Li L  Huo J  Wei DQ 《Biotechnology journal》2011,6(11):1367-1376
The support vector machine (SVM), an effective statistical learning method, has been widely used in mutation prediction. Two factors, i.e., feature selection and parameter setting, have shown great influence on the efficiency and accuracy of SVM classification. In this study, according to the principles of a genetic algorithm (GA) and SVM, we developed a GA-SVM program and applied it to human cytochrome P450s (CYP450s), which are important monooxygenases in phase I drug metabolism. The program optimizes features and parameters simultaneously, and hence fewer features are used and the overall prediction accuracy is improved. We focus on the mutation of non-synonymous single nucleotide polymorphisms (nsSNPs) in protein sequences that appear to exhibit significant influences on drug metabolism. The final predictive model has a quite satisfactory performance, with the prediction accuracy of 61% and cross-validation accuracy of 73%. The results indicate that the GA-SVM program is a powerful tool in optimizing mutation predictive models of nsSNPs of human CYP450s.  相似文献   

4.
Elucidation of the interaction of proteins with different molecules is of significance in the understanding of cellular processes. Computational methods have been developed for the prediction of protein-protein interactions. But insufficient attention has been paid to the prediction of protein-RNA interactions, which play central roles in regulating gene expression and certain RNA-mediated enzymatic processes. This work explored the use of a machine learning method, support vector machines (SVM), for the prediction of RNA-binding proteins directly from their primary sequence. Based on the knowledge of known RNA-binding and non-RNA-binding proteins, an SVM system was trained to recognize RNA-binding proteins. A total of 4011 RNA-binding and 9781 non-RNA-binding proteins was used to train and test the SVM classification system, and an independent set of 447 RNA-binding and 4881 non-RNA-binding proteins was used to evaluate the classification accuracy. Testing results using this independent evaluation set show a prediction accuracy of 94.1%, 79.3%, and 94.1% for rRNA-, mRNA-, and tRNA-binding proteins, and 98.7%, 96.5%, and 99.9% for non-rRNA-, non-mRNA-, and non-tRNA-binding proteins, respectively. The SVM classification system was further tested on a small class of snRNA-binding proteins with only 60 available sequences. The prediction accuracy is 40.0% and 99.9% for snRNA-binding and non-snRNA-binding proteins, indicating a need for a sufficient number of proteins to train SVM. The SVM classification systems trained in this work were added to our Web-based protein functional classification software SVMProt, at http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi. Our study suggests the potential of SVM as a useful tool for facilitating the prediction of protein-RNA interactions.  相似文献   

5.
Li Z  Zhou X  Dai Z  Zou X 《Amino acids》2012,43(2):793-804
The coupling between G protein-coupled receptors (GPCRs) and guanine nucleotide-binding proteins (G proteins) regulates various signal transductions from extracellular space into the cell. However, the coupling mechanism between GPCRs and G proteins is still unknown, and experimental determination of their coupling specificity and function is both expensive and time consuming. Therefore, it is significant to develop a theoretical method to predict the coupling specificity between GPCRs and G proteins as well as their function using their primary sequences. In this study, a novel four-layer predictor (GPCRsG_CWTIT) based on support vector machine (SVM), continuous wavelet transform (CWT) and information theory (IT) is developed to classify G proteins and predict the coupling specificity between GPCRs and G proteins. SVM is used for construction of models. CWT and IT are used to characterize the primary structure of protein. Performance of GPCRsG_CWTIT is evaluated with cross-validation test on various working dataset. The overall accuracy of the G proteins at the levels of class and family is 98.23 and 85.42%, respectively. The accuracy of the coupling specificity prediction varies from 74.60 to 94.30%. These results indicate that the proposed predictor is an effective and feasible tool to predict the coupling specificity between GPCRs and G proteins as well as their functions using only the protein full sequence. The establishment of such an accurate prediction method will facilitate drug discovery by improving the ability to identify and predict protein-protein interactions. GPCRsG_CWTIT and dataset can be acquired freely on request from the authors.  相似文献   

6.
Guo Y  Li M  Lu M  Wen Z  Huang Z 《Proteins》2006,65(1):55-60
Determining G-protein coupled receptors (GPCRs) coupling specificity is very important for further understanding the functions of receptors. A successful method in this area will benefit both basic research and drug discovery practice. Previously published methods rely on the transmembrane topology prediction at training step, even at prediction step. However, the transmembrane topology predicted by even the best algorithm is not of high accuracy. In this study, we developed a new method, autocross-covariance (ACC) transform based support vector machine (SVM), to predict coupling specificity between GPCRs and G-proteins. The primary amino acid sequences are translated into vectors based on the principal physicochemical properties of the amino acids and the data are transformed into a uniform matrix by applying ACC transform. SVMs for nonpromiscuous coupled GPCRs and promiscuous coupled GPCRs were trained and validated by jackknife test and the results thus obtained are very promising. All classifiers were also evaluated by the test datasets with good performance. Besides the high prediction accuracy, the most important feature of this method is that it does not require any transmembrane topology prediction at either training or prediction step but only the primary sequences of proteins. The results indicate that this relatively simple method is applicable. Academic users can freely download the prediction program at http://www.scucic.net/group/database/Service.asp.  相似文献   

7.
We have introduced a new method of protein secondary structure prediction which is based on the theory of support vector machine (SVM). SVM represents a new approach to supervised pattern classification which has been successfully applied to a wide range of pattern recognition problems, including object recognition, speaker identification, gene function prediction with microarray expression profile, etc. In these cases, the performance of SVM either matches or is significantly better than that of traditional machine learning approaches, including neural networks.The first use of the SVM approach to predict protein secondary structure is described here. Unlike the previous studies, we first constructed several binary classifiers, then assembled a tertiary classifier for three secondary structure states (helix, sheet and coil) based on these binary classifiers. The SVM method achieved a good performance of segment overlap accuracy SOV=76.2 % through sevenfold cross validation on a database of 513 non-homologous protein chains with multiple sequence alignments, which out-performs existing methods. Meanwhile three-state overall per-residue accuracy Q(3) achieved 73.5 %, which is at least comparable to existing single prediction methods. Furthermore a useful "reliability index" for the predictions was developed. In addition, SVM has many attractive features, including effective avoidance of overfitting, the ability to handle large feature spaces, information condensing of the given data set, etc. The SVM method is conveniently applied to many other pattern classification tasks in biology.  相似文献   

8.
The need for accurate, automated protein classification methods continues to increase as advances in biotechnology uncover new proteins. G-protein coupled receptors (GPCRs) are a particularly difficult superfamily of proteins to classify due to extreme diversity among its members. Previous comparisons of BLAST, k-nearest neighbor (k-NN), hidden markov model (HMM) and support vector machine (SVM) using alignment-based features have suggested that classifiers at the complexity of SVM are needed to attain high accuracy. Here, analogous to document classification, we applied Decision Tree and Naive Bayes classifiers with chi-square feature selection on counts of n-grams (i.e. short peptide sequences of length n) to this classification task. Using the GPCR dataset and evaluation protocol from the previous study, the Naive Bayes classifier attained an accuracy of 93.0 and 92.4% in level I and level II subfamily classification respectively, while SVM has a reported accuracy of 88.4 and 86.3%. This is a 39.7 and 44.5% reduction in residual error for level I and level II subfamily classification, respectively. The Decision Tree, while inferior to SVM, outperforms HMM in both level I and level II subfamily classification. For those GPCR families whose profiles are stored in the Protein FAMilies database of alignments and HMMs (PFAM), our method performs comparably to a search against those profiles. Finally, our method can be generalized to other protein families by applying it to the superfamily of nuclear receptors with 94.5, 97.8 and 93.6% accuracy in family, level I and level II subfamily classification respectively.  相似文献   

9.
10.
We have developed an alignment-independent method for classification of G-protein coupled receptors (GPCRs) according to the principal chemical properties of their amino acid sequences. The method relies on a multivariate approach where the primary amino acid sequences are translated into vectors based on the principal physicochemical properties of the amino acids and transformation of the data into a uniform matrix by applying a modified autocross-covariance transform. The application of principal component analysis to a data set of 929 class A GPCRs showed a clear separation of the major classes of GPCRs. The application of partial least squares projection to latent structures created a highly valid model (cross-validated correlation coefficient, Q(2) = 0.895) that gave unambiguous classification of the GPCRs in the training set according to their ligand binding class. The model was further validated by external prediction of 535 novel GPCRs not included in the training set. Of the latter, only 14 sequences, confined in rapidly expanding GPCR classes, were mispredicted. Moreover, 90 orphan GPCRs out of 165 were tentatively identified to GPCR ligand binding class. The alignment-independent method could be used to assess the importance of the principal chemical properties of every single amino acid in the protein sequences for their contributions in explaining GPCR family membership. It was then revealed that all amino acids in the unaligned sequences contributed to the classifications, albeit to varying extent; the most important amino acids being those that could also be determined to be conserved by using traditional alignment-based methods.  相似文献   

11.
12.
13.
The group of 2502 transmembrane (TM) protein sequences with seven TM segments (7-tms) registered in SWISS-PROT 46.0 contains 2200 G-protein-coupled receptors (GPCRs), indicating that GPCR candidates can be detected with a reliability of 87.9% in the eukaryotic genomes merely by correctly predicting the number of TM segments as 7-tms. The predictive accuracies of TM topology-prediction methods proposed so far are not as high as expected; even the best method, HMMTOP 2.0, can only achieve a capture rate of 7-tms sequences of 77.6%. It is necessary to improve this performance as much as possible, even if by only a few percentage points, in order to identify as many novel GPCR candidate genes as possible among the increasing number of newly sequenced genomes. In this study, we propose a simple but useful prediction method for detecting as many 7-tms TM protein sequences as GPCR candidates in eukaryotic genomes as possible. This is achieved by employing a two-step prediction procedure. The first step involves collecting 7-tms sequences by the best prediction method (HMMTOP 2.0), and the second involves picking up the remaining 7-tms sequences by the second-best method (TMHMM 2.0). By this procedure, the capture rate of 7-tms TM protein sequences in SWISS-PROT can be improved considerably from 77.6% to 84.5%, and the number of GPCR candidate sequences predicted as 7-tms in the human genome (Build 35) is increased from 790 (by HMMTOP 2.0) to 903. These 790 and 903 candidate sequences include, respectively, 587 and 636 of the known human GPCRs of the 717 registered in SWISS-PROT 46.0, demonstrating that the proposed combinatorial method is effective in detecting GPCR candidate genes in eukaryotic genomes.  相似文献   

14.
Functional annotation of protein sequences with low similarity to well characterized protein sequences is a major challenge of computational biology in the post genomic era. The cyclin protein family is once such important family of proteins which consists of sequences with low sequence similarity making discovery of novel cyclins and establishing orthologous relationships amongst the cyclins, a difficult task. The currently identified cyclin motifs and cyclin associated domains do not represent all of the identified and characterized cyclin sequences. We describe a Support Vector Machine (SVM) based classifier, CyclinPred, which can predict cyclin sequences with high efficiency. The SVM classifier was trained with features of selected cyclin and non cyclin protein sequences. The training features of the protein sequences include amino acid composition, dipeptide composition, secondary structure composition and PSI-BLAST generated Position Specific Scoring Matrix (PSSM) profiles. Results obtained from Leave-One-Out cross validation or jackknife test, self consistency and holdout tests prove that the SVM classifier trained with features of PSSM profile was more accurate than the classifiers based on either of the other features alone or hybrids of these features. A cyclin prediction server--CyclinPred has been setup based on SVM model trained with PSSM profiles. CyclinPred prediction results prove that the method may be used as a cyclin prediction tool, complementing conventional cyclin prediction methods.  相似文献   

15.
G-protein coupled receptor (GPCR) is a membrane protein family, which serves as an interface between cell and the outside world. They are involved in various physiological processes and are the targets of more than 50% of the marketed drugs. The function of GPCRs can be known by conducting Biological experiments. However, the rapid increase of GPCR sequences entering into databanks, it is very time consuming and expensive to determine their function based only on experimental techniques. Hence, the computational prediction of GPCRs is very much demanding for both pharmaceutical and educational research. Feature extraction of GPCRs in the proposed research is performed using three techniques i.e. Pseudo amino acid composition, Wavelet based multi-scale energy and Evolutionary information based feature extraction by utilizing the position specific scoring matrices. For classification purpose, a majority voting based ensemble method is used; whose weights are optimized using genetic algorithm. Four classifiers are used in the ensemble i.e. Nearest Neighbor, Probabilistic Neural Network, Support Vector Machine and Grey Incidence Degree. The performance of the proposed method is assessed using Jackknife test for a number of datasets. First, the individual performances of classifiers are assessed for each dataset using Jackknife test. After that, the performance for each dataset is improved by using weighted ensemble classification. The weights of ensemble are optimized using various runs of Genetic Algorithm. We have compared our method with various other methods. The significance in performance of the proposed method depicts it to be useful for GPCRs classification.  相似文献   

16.
The knowledge collated from the known protein structures has revealed that the proteins are usually folded into the four structural classes: all-α, all-β, α/β and α + β. A number of methods have been proposed to predict the protein's structural class from its primary structure; however, it has been observed that these methods fail or perform poorly in the cases of distantly related sequences. In this paper, we propose a new method for protein structural class prediction using low homology (twilight-zone) protein sequences dataset. Since protein structural class prediction is a typical classification problem, we have developed a Support Vector Machine (SVM)-based method for protein structural class prediction that uses features derived from the predicted secondary structure and predicted burial information of amino acid residues. The examination of different individual as well as feature combinations revealed that the combination of secondary structural content, secondary structural and solvent accessibility state frequencies of amino acids gave rise to the best leave-one-out cross-validation accuracy of ~81% which is comparable to the best accuracy reported in the literature so far.  相似文献   

17.
Wen Z  Li M  Li Y  Guo Y  Wang K 《Amino acids》2007,32(2):277-283
As an important transmembrane protein family in eukaryon, G-protein coupled receptors (GPCRs) play a significant role in cellular signal transduction and are important targets for drug design. However, it is very difficult to resolve their tertiary structure by X-ray crystallography. In this study, we have developed a Delaunay model, which constructs a series of simplexes with latent variables to classify the families of GPCRs and projects unknown sequences to principle component space (PC-space) to predict their topology. Computational results show that, for the classification of GPCRs, the method achieves the accuracy of 91.0 and 87.6% for Class A, more than 80% for the other three classes in differentiating GPCRs from non-GPCRs and 70% for discriminating between four major classes of GPCR, respectively. When recognizing the structure of GPCRs, all the N-terminals of sequences can be determined correctly. The maximum accuracy of predicting transmembrane segments is achieved in the 7th transmembrane segment of Rhodopsin, which is 99.4%, and the average error is 2.1 amino acids, which is the lowest in all of the segments prediction. This method could provide structural information of a novel GPCR as a tool for experiments and other algorithms of structure prediction of GPCRs. Academic users should send their request for the MATLAB program for classifying GPCRs and predicting the topology of them at liml@scu.edu.cn .  相似文献   

18.
Prognostic prediction is important in medical domain, because it can be used to select an appropriate treatment for a patient by predicting the patient's clinical outcomes. For high-dimensional data, a normal prognostic method undergoes two steps: feature selection and prognosis analysis. Recently, the L?-L?-norm Support Vector Machine (L?-L? SVM) has been developed as an effective classification technique and shown good classification performance with automatic feature selection. In this paper, we extend L?-L? SVM for regression analysis with automatic feature selection. We further improve the L?-L? SVM for prognostic prediction by utilizing the information of censored data as constraints. We design an efficient solution to the new optimization problem. The proposed method is compared with other seven prognostic prediction methods on three realworld data sets. The experimental results show that the proposed method performs consistently better than the medium performance. It is more efficient than other algorithms with the similar performance.  相似文献   

19.
20.
Cai CZ  Han LY  Ji ZL  Chen YZ 《Proteins》2004,55(1):66-76
One approach for facilitating protein function prediction is to classify proteins into functional families. Recent studies on the classification of G-protein coupled receptors and other proteins suggest that a statistical learning method, Support vector machines (SVM), may be potentially useful for protein classification into functional families. In this work, SVM is applied and tested on the classification of enzymes into functional families defined by the Enzyme Nomenclature Committee of IUBMB. SVM classification system for each family is trained from representative enzymes of that family and seed proteins of Pfam curated protein families. The classification accuracy for enzymes from 46 families and for non-enzymes is in the range of 50.0% to 95.7% and 79.0% to 100% respectively. The corresponding Matthews correlation coefficient is in the range of 54.1% to 96.1%. Moreover, 80.3% of the 8,291 correctly classified enzymes are uniquely classified into a specific enzyme family by using a scoring function, indicating that SVM may have certain level of unique prediction capability. Testing results also suggest that SVM in some cases is capable of classification of distantly related enzymes and homologous enzymes of different functions. Effort is being made to use a more comprehensive set of enzymes as training sets and to incorporate multi-class SVM classification systems to further enhance the unique prediction accuracy. Our results suggest the potential of SVM for enzyme family classification and for facilitating protein function prediction. Our software is accessible at http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号