首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Cai CZ  Han LY  Ji ZL  Chen X  Chen YZ 《Nucleic acids research》2003,31(13):3692-3697
Prediction of protein function is of significance in studying biological processes. One approach for function prediction is to classify a protein into functional family. Support vector machine (SVM) is a useful method for such classification, which may involve proteins with diverse sequence distribution. We have developed a web-based software, SVMProt, for SVM classification of a protein into functional family from its primary sequence. SVMProt classification system is trained from representative proteins of a number of functional families and seed proteins of Pfam curated protein families. It currently covers 54 functional families and additional families will be added in the near future. The computed accuracy for protein family classification is found to be in the range of 69.1-99.6%. SVMProt shows a certain degree of capability for the classification of distantly related proteins and homologous proteins of different function and thus may be used as a protein function prediction tool that complements sequence alignment methods. SVMProt can be accessed at http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi.  相似文献   

2.
Protein function classification via support vector machine approach   总被引:2,自引:0,他引:2  
Support vector machine (SVM) is introduced as a method for the classification of proteins into functionally distinguished classes. Studies are conducted on a number of protein classes including RNA-binding proteins; protein homodimers, proteins responsible for drug absorption, proteins involved in drug distribution and excretion, and drug metabolizing enzymes. Testing accuracy for the classification of these protein classes is found to be in the range of 84-96%. This suggests the usefulness of SVM in the classification of protein functional classes and its potential application in protein function prediction.  相似文献   

3.
Han LY  Cai CZ  Ji ZL  Cao ZW  Cui J  Chen YZ 《Nucleic acids research》2004,32(21):6437-6444
The function of a protein that has no sequence homolog of known function is difficult to assign on the basis of sequence similarity. The same problem may arise for homologous proteins of different functions if one is newly discovered and the other is the only known protein of similar sequence. It is desirable to explore methods that are not based on sequence similarity. One approach is to assign functional family of a protein to provide useful hint about its function. Several groups have employed a statistical learning method, support vector machines (SVMs), for predicting protein functional family directly from sequence irrespective of sequence similarity. These studies showed that SVM prediction accuracy is at a level useful for functional family assignment. But its capability for assignment of distantly related proteins and homologous proteins of different functions has not been critically and adequately assessed. Here SVM is tested for functional family assignment of two groups of enzymes. One consists of 50 enzymes that have no homolog of known function from PSI-BLAST search of protein databases. The other contains eight pairs of homologous enzymes of different families. SVM correctly assigns 72% of the enzymes in the first group and 62% of the enzyme pairs in the second group, suggesting that it is potentially useful for facilitating functional study of novel proteins. A web version of our software, SVMProt, is accessible at http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi.  相似文献   

4.
5.
With the rapid increment of protein sequence data, it is indispensable to develop automated and reliable predictive methods for protein function annotation. One approach for facilitating protein function prediction is to classify proteins into functional families from primary sequence. Being the most important group of all proteins, the accurate prediction for enzyme family classes and subfamily classes is closely related to their biological functions. In this paper, for the prediction of enzyme subfamily classes, the Chou's amphiphilic pseudo-amino acid composition [Chou, K.C., 2005. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 21, 10-19] has been adopted to represent the protein samples for training the 'one-versus-rest' support vector machine. As a demonstration, the jackknife test was performed on the dataset that contains 2640 oxidoreductase sequences classified into 16 subfamily classes [Chou, K.C., Elrod, D.W., 2003. Prediction of enzyme family classes. J. Proteome Res. 2, 183-190]. The overall accuracy thus obtained was 80.87%. The significant enhancement in the accuracy indicates that the current method might play a complementary role to the exiting methods.  相似文献   

6.
Lin HH  Han LY  Cai CZ  Ji ZL  Chen YZ 《Proteins》2006,62(1):218-231
Transporters play key roles in cellular transport and metabolic processes, and in facilitating drug delivery and excretion. These proteins are classified into families based on the transporter classification (TC) system. Determination of the TC family of transporters facilitates the study of their cellular and pharmacological functions. Methods for predicting TC family without sequence alignments or clustering are particularly useful for studying novel transporters whose function cannot be determined by sequence similarity. This work explores the use of a machine learning method, support vector machines (SVMs), for predicting the family of transporters from their sequence without the use of sequence similarity. A total of 10,636 transporters in 13 TC subclasses, 1914 transporters in eight TC families, and 168,341 nontransporter proteins are used to train and test the SVM prediction system. Testing results by using a separate set of 4351 transporters and 83,151 nontransporter proteins show that the overall accuracy for predicting members of these TC subclasses and families is 83.4% and 88.0%, respectively, and that of nonmembers is 99.3% and 96.6%, respectively. The accuracies for predicting members and nonmembers of individual TC subclasses are in the range of 70.7-96.1% and 97.6-99.9%, respectively, and those of individual TC families are in the range of 60.6-97.1% and 91.5-99.4%, respectively. A further test by using 26,139 transmembrane proteins outside each of the 13 TC subclasses shows that 90.4-99.6% of these are correctly predicted. Our study suggests that the SVM is potentially useful for facilitating functional study of transporters irrespective of sequence similarity.  相似文献   

7.
Elucidation of the interaction of proteins with different molecules is of significance in the understanding of cellular processes. Computational methods have been developed for the prediction of protein-protein interactions. But insufficient attention has been paid to the prediction of protein-RNA interactions, which play central roles in regulating gene expression and certain RNA-mediated enzymatic processes. This work explored the use of a machine learning method, support vector machines (SVM), for the prediction of RNA-binding proteins directly from their primary sequence. Based on the knowledge of known RNA-binding and non-RNA-binding proteins, an SVM system was trained to recognize RNA-binding proteins. A total of 4011 RNA-binding and 9781 non-RNA-binding proteins was used to train and test the SVM classification system, and an independent set of 447 RNA-binding and 4881 non-RNA-binding proteins was used to evaluate the classification accuracy. Testing results using this independent evaluation set show a prediction accuracy of 94.1%, 79.3%, and 94.1% for rRNA-, mRNA-, and tRNA-binding proteins, and 98.7%, 96.5%, and 99.9% for non-rRNA-, non-mRNA-, and non-tRNA-binding proteins, respectively. The SVM classification system was further tested on a small class of snRNA-binding proteins with only 60 available sequences. The prediction accuracy is 40.0% and 99.9% for snRNA-binding and non-snRNA-binding proteins, indicating a need for a sufficient number of proteins to train SVM. The SVM classification systems trained in this work were added to our Web-based protein functional classification software SVMProt, at http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi. Our study suggests the potential of SVM as a useful tool for facilitating the prediction of protein-RNA interactions.  相似文献   

8.
Enzyme function conservation has been used to derive the threshold of sequence identity necessary to transfer function from a protein of known function to an unknown protein. Using pairwise sequence comparison, several studies suggested that when the sequence identity is above 40%, enzyme function is well conserved. In contrast, Rost argued that because of database bias, the results from such simple pairwise comparisons might be misleading. Thus, by grouping enzyme sequences into families based on sequence similarity and selecting representative sequences for comparison, he showed that enzyme function starts to diverge quickly when the sequence identity is below 70%. Here, we employ a strategy similar to Rost's to reduce the database bias; however, we classify enzyme families based not only on sequence similarity, but also on functional similarity, i.e. sequences in each family must have the same four digits or the same first three digits of the enzyme commission (EC) number. Furthermore, instead of selecting representative sequences for comparison, we calculate the function conservation of each enzyme family and then average the degree of enzyme function conservation across all enzyme families. Our analysis suggests that for functional transferability, 40% sequence identity can still be used as a confident threshold to transfer the first three digits of an EC number; however, to transfer all four digits of an EC number, above 60% sequence identity is needed to have at least 90% accuracy. Moreover, when PSI-BLAST is used, the magnitude of the E-value is found to be weakly correlated with the extent of enzyme function conservation in the third iteration of PSI-BLAST. As a result, functional annotation based on the E-values from PSI-BLAST should be used with caution. We also show that by employing an enzyme family-specific sequence identity threshold above which 100% functional conservation is required, functional inference of unknown sequences can be accurately accomplished. However, this comes at a cost: those true positive sequences below this threshold cannot be uniquely identified.  相似文献   

9.
In the post-genome era, the prediction of protein function is one of the most demanding tasks in the study of bioinformatics. Machine learning methods, such as the support vector machines (SVMs), greatly help to improve the classification of protein function. In this work, we integrated SVMs, protein sequence amino acid composition, and associated physicochemical properties into the study of nucleic-acid-binding proteins prediction. We developed the binary classifications for rRNA-, RNA-, DNA-binding proteins that play an important role in the control of many cell processes. Each SVM predicts whether a protein belongs to rRNA-, RNA-, or DNA-binding protein class. Self-consistency and jackknife tests were performed on the protein data sets in which the sequences identity was < 25%. Test results show that the accuracies of rRNA-, RNA-, DNA-binding SVMs predictions are approximately 84%, approximately 78%, approximately 72%, respectively. The predictions were also performed on the ambiguous and negative data set. The results demonstrate that the predicted scores of proteins in the ambiguous data set by RNA- and DNA-binding SVM models were distributed around zero, while most proteins in the negative data set were predicted as negative scores by all three SVMs. The score distributions agree well with the prior knowledge of those proteins and show the effectiveness of sequence associated physicochemical properties in the protein function prediction. The software is available from the author upon request.  相似文献   

10.
The function of the protein is primarily dictated by its structure. Therefore it is far more logical to find the functional clues of the protein in its overall 3-dimensional fold or its global structure. In this paper, we have developed a novel Support Vector Machines (SVM) based prediction model for functional classification and prediction of proteins using features extracted from its global structure based on fragment libraries. Fragment libraries have been previously used for abintio modelling of proteins and protein structure comparisons. The query protein structure is broken down into a collection of short contiguous backbone fragments and this collection is discretized using a library of fragments. The input feature vector is frequency vector that counts the number of each library fragment in the collection of fragments by all-to-all fragment comparisons. SVM models were trained and optimised for obtaining the best 10-fold Cross validation accuracy for classification. As an example, this method was applied for prediction and classification of Cell Adhesion molecules (CAMs). Thirty-four different fragment libraries with sizes ranging from 4 to 400 and fragment lengths ranging from 4 to 12 were used for obtaining the best prediction model. The best 10-fold CV accuracy of 95.25% was obtained for library of 400 fragments of length 10. An accuracy of 87.5% was obtained on an unseen test dataset consisting of 20 CAMs and 20 NonCAMs. This shows that protein structure can be accurately and uniquely described using 400 representative fragments of length 10.  相似文献   

11.
This paper explores the use of support vector machine (SVM) for protein function prediction. Studies are conducted on several groups of proteins with different functions including DNA-binding proteins, RNA-binding proteins, G-protein coupled receptors, drug absorption proteins, drug metabolizing enzymes, drug distribution and excretion proteins. The computed accuracy for the prediction of these proteins is found to be in the range of 82.32% to 99.7%, which illustrates the potential of SVM in facilitating protein function prediction.  相似文献   

12.
Matrix metalloproteinase (MMPs) and disintegrin and metalloprotease (ADAMs) belong to the zinc-dependent metalloproteinase family of proteins. These proteins participate in various physiological and pathological states. Thus, prediction of these proteins using amino acid sequence would be helpful. We have developed a method to predict these proteins based on the features derived from Chou’s pseudo amino acid composition (PseAAC) server and support vector machine (SVM) as a powerful machine learning approach. With this method, for ADAMs and MMPs families, an overall accuracy and Matthew’s correlation coefficient (MCC) of 95.89 and 0.90% were achieved respectively. Furthermore, the method is able to predict two major subclasses of MMP family; Furin-activated secreted MMPs and Type II trans-membrane; with MCC of 0.89 and 0.91%, respectively. The overall accuracy for Furin-activated secreted MMPs and Type II trans-membrane was 98.18 and 99.07, respectively. Our data demonstrates an effective classification of Metalloproteinase family based on the concept of PseAAC and SVM.  相似文献   

13.
Automatic methods for predicting functionally important residues   总被引:9,自引:0,他引:9  
Sequence analysis is often the first guide for the prediction of residues in a protein family that may have functional significance. A few methods have been proposed which use the division of protein families into subfamilies in the search for those positions that could have some functional significance for the whole family, but at the same time which exhibit the specificity of each subfamily ("Tree-determinant residues"). However, there are still many unsolved questions like the best division of a protein family into subfamilies, or the accurate detection of sequence variation patterns characteristic of different subfamilies. Here we present a systematic study in a significant number of protein families, testing the statistical meaning of the Tree-determinant residues predicted by three different methods that represent the range of available approaches. The first method takes as a starting point a phylogenetic representation of a protein family and, following the principle of Relative Entropy from Information Theory, automatically searches for the optimal division of the family into subfamilies. The second method looks for positions whose mutational behavior is reminiscent of the mutational behavior of the full-length proteins, by directly comparing the corresponding distance matrices. The third method is an automation of the analysis of distribution of sequences and amino acid positions in the corresponding multidimensional spaces using a vector-based principal component analysis. These three methods have been tested on two non-redundant lists of protein families: one composed by proteins that bind a variety of ligand groups, and the other composed by proteins with annotated functionally relevant sites. In most cases, the residues predicted by the three methods show a clear tendency to be close to bound ligands of biological relevance and to those amino acids described as participants in key aspects of protein function. These three automatic methods provide a wide range of possibilities for biologists to analyze their families of interest, in a similar way to the one presented here for the family of proteins related with ras-p21.  相似文献   

14.
15.
蛋白质是有机生命体内不可或缺的化合物,在生命活动中发挥着多种重要作用,了解蛋白质的功能有助于医学和药物研发等领域的研究。此外,酶在绿色合成中的应用一直备受人们关注,但是由于酶的种类和功能多种多样,获取特定功能酶的成本高昂,限制了其进一步的应用。目前,蛋白质的具体功能主要通过实验表征确定,该方法实验工作繁琐且耗时耗力,同时,随着生物信息学和测序技术的高速发展,已测序得到的蛋白质序列数量远大于功能获得注释的序列数量,高效预测蛋白质功能变得至关重要。随着计算机技术的蓬勃发展,由数据驱动的机器学习方法已成为应对这些挑战的有效解决方案。本文对蛋白质功能及其注释方法以及机器学习的发展历程和操作流程进行了概述,聚焦于机器学习在酶功能预测领域的应用,对未来人工智能辅助蛋白质功能高效研究的发展方向提出了展望。  相似文献   

16.
A novel alignment-free method for computing functional similarity of membrane proteins based on features of hydropathy distribution is presented. The features of hydropathy distribution are used to represent protein families as hydropathy profiles. The profiles statistically summarize the hydropathy distribution of member proteins. The summation is made by using hydropathy features that numerically represent structurally/functionally significant portions of protein sequences. The hydropathy profiles are numerical vectors that are points in a high dimensional ‘hydropathy’ space. Their similarities are identified by projection of the space onto principal axes. Here, the approach is applied to the secondary transporters. The analysis using the presented approach is validated by the standard classification of the secondary transporters. The presented analysis allows for prediction of function attributes for proteins of uncharacterized families of secondary transporters. The results obtained using the presented analysis may help to characterize unknown function attributes of secondary transporters. They also show that analysis of hydropathy distribution can be used for function prediction of membrane proteins.  相似文献   

17.
An efficient algorithm for large-scale detection of protein families   总被引:6,自引:0,他引:6  
Detection of protein families in large databases is one of the principal research objectives in structural and functional genomics. Protein family classification can significantly contribute to the delineation of functional diversity of homologous proteins, the prediction of function based on domain architecture or the presence of sequence motifs as well as comparative genomics, providing valuable evolutionary insights. We present a novel approach called TRIBE-MCL for rapid and accurate clustering of protein sequences into families. The method relies on the Markov cluster (MCL) algorithm for the assignment of proteins into families based on precomputed sequence similarity information. This novel approach does not suffer from the problems that normally hinder other protein sequence clustering algorithms, such as the presence of multi-domain proteins, promiscuous domains and fragmented proteins. The method has been rigorously tested and validated on a number of very large databases, including SwissProt, InterPro, SCOP and the draft human genome. Our results indicate that the method is ideally suited to the rapid and accurate detection of protein families on a large scale. The method has been used to detect and categorise protein families within the draft human genome and the resulting families have been used to annotate a large proportion of human proteins.  相似文献   

18.
The increasing number and diversity of protein sequence families requires new methods to define and predict details regarding function. Here, we present a method for analysis and prediction of functional sub-types from multiple protein sequence alignments. Given an alignment and set of proteins grouped into sub-types according to some definition of function, such as enzymatic specificity, the method identifies positions that are indicative of functional differences by comparison of sub-type specific sequence profiles, and analysis of positional entropy in the alignment. Alignment positions with significantly high positional relative entropy correlate with those known to be involved in defining sub-types for nucleotidyl cyclases, protein kinases, lactate/malate dehydrogenases and trypsin-like serine proteases. We highlight new positions for these proteins that suggest additional experiments to elucidate the basis of specificity. The method is also able to predict sub-type for unclassified sequences. We assess several variations on a prediction method, and compare them to simple sequence comparisons. For assessment, we remove close homologues to the sequence for which a prediction is to be made (by a sequence identity above a threshold). This simulates situations where a protein is known to belong to a protein family, but is not a close relative of another protein of known sub-type. Considering the four families above, and a sequence identity threshold of 30 %, our best method gives an accuracy of 96 % compared to 80 % obtained for sequence similarity and 74 % for BLAST. We describe the derivation of a set of sub-type groupings derived from an automated parsing of alignments from PFAM and the SWISSPROT database, and use this to perform a large-scale assessment. The best method gives an average accuracy of 94 % compared to 68 % for sequence similarity and 79 % for BLAST. We discuss implications for experimental design, genome annotation and the prediction of protein function and protein intra-residue distances.  相似文献   

19.
碳水化合物活性酶数据库( CAZy)是关于能够合成或者分解复杂碳水化合物和糖复合物的酶类的一个数据库资源,其基于蛋白质结构域中的氨基酸序列相似性,将碳水化合物活性酶类归入不同蛋白质家族。 CAZy数据库中包含了碳水化合物酶类的物种来源、酶功能EC分类、基因序列、蛋白质序列及其结构等信息。而随着宏基因组学技术的快速发展,CAZy数据库中家族内序列数据量剧增,这为家族内进一步进行亚家族分类奠定了基础;而蛋白质家族内新一层精细分类的引入可提高亚家族中酶分子功能预测的准确度,进而可指导酶分子理性设计来提高特定功能酶组分设计的成功概率,从而推动生物质转化产业的发展。  相似文献   

20.
The need for accurate, automated protein classification methods continues to increase as advances in biotechnology uncover new proteins. G-protein coupled receptors (GPCRs) are a particularly difficult superfamily of proteins to classify due to extreme diversity among its members. Previous comparisons of BLAST, k-nearest neighbor (k-NN), hidden markov model (HMM) and support vector machine (SVM) using alignment-based features have suggested that classifiers at the complexity of SVM are needed to attain high accuracy. Here, analogous to document classification, we applied Decision Tree and Naive Bayes classifiers with chi-square feature selection on counts of n-grams (i.e. short peptide sequences of length n) to this classification task. Using the GPCR dataset and evaluation protocol from the previous study, the Naive Bayes classifier attained an accuracy of 93.0 and 92.4% in level I and level II subfamily classification respectively, while SVM has a reported accuracy of 88.4 and 86.3%. This is a 39.7 and 44.5% reduction in residual error for level I and level II subfamily classification, respectively. The Decision Tree, while inferior to SVM, outperforms HMM in both level I and level II subfamily classification. For those GPCR families whose profiles are stored in the Protein FAMilies database of alignments and HMMs (PFAM), our method performs comparably to a search against those profiles. Finally, our method can be generalized to other protein families by applying it to the superfamily of nuclear receptors with 94.5, 97.8 and 93.6% accuracy in family, level I and level II subfamily classification respectively.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号