共查询到20条相似文献,搜索用时 0 毫秒
1.
Given the sequence of a protein, how can we predict whether it is a membrane protein or non-membrane protein? If it is, what membrane protein type it belongs to? Since these questions are closely relevant to the function of an uncharacterized protein, their importance is self-evident. Particularly, with the explosion of protein sequences entering into databanks and the relatively much slower progress in using biochemical experiments to determine their functions, it is highly desired to develop an automated method that can be used to give a fast answers to these questions. By hybridizing the functional domain (FunD) and pseudo-amino acid composition (PseAA), a new strategy called FunD-PseAA predictor was introduced. To test the power of the predictor, a highly non-homologous data set was constructed where none of proteins has 25% sequence identity to any other. The overall success rates obtained with the FunD-PseAA predictor on such a data set by the jackknife cross-validation test was 85% for the case in identifying membrane protein and non-membrane protein, and 91% in identifying the membrane protein type among the following 5 categories: (1) type-1 membrane protein, (2) type-2 membrane protein, (3) multipass transmembrane protein, (4) lipid chain-anchored membrane protein, and (5) GPI-anchored membrane protein. These rates are much higher than those obtained by the other methods on the same stringent data set, indicating that the FunD-PseAA predictor may become a useful high throughput tool in bioinformatics and proteomics. 相似文献
2.
Prediction of Saccharomyces cerevisiae protein functional class from functional domain composition 总被引:3,自引:0,他引:3
MOTIVATION: A key goal of genomics is to assign function to genes, especially for orphan sequences. RESULTS: We compared the clustered functional domains in the SBASE database to each protein sequence using BLASTP. This representation for a protein is a vector, where each of the non-zero entries in the vector indicates a significant match between the sequence of interest and the SBASE domain. The machine learning methods nearest neighbour algorithm (NNA) and support vector machines are used for predicting protein functional classes from this information. We find that the best results are found using the SBASE-A database and the NNA, namely 72% accuracy for 79% coverage. We tested an assigning function based on searching for InterPro sequence motifs and by taking the most significant BLAST match within the dataset. We applied the functional domain composition method to predict the functional class of 2018 currently unclassified yeast open reading frames. AVAILABILITY: A program for the prediction method, that uses NNA called Functional Class Prediction based on Functional Domains (FCPFD) is available and can be obtained by contacting Y.D.Cai at y.cai@umist.ac.uk 相似文献
3.
Predicting enzyme subclass by functional domain composition and pseudo amino acid composition 总被引:3,自引:0,他引:3
As a continuous effort to use the sequence approach to identify enzymatic function at a deeper level, investigations are extended from the main enzyme classes (Protein Sci. 2004, 13, 2857-2863) to their subclasses. This is indispensable if we wish to understand the molecular mechanism of an enzyme at a deeper level. For each of the 6 main enzyme classes (i.e., oxidoreductase, transferase, hydrolase, lyase, isomerase, and ligase), a subclass training dataset is constructed. To reduce homologous bias, a stringent cutoff was imposed that all the entries included in the datasets have less than 40% sequence identity to each other. To catch the core feature that is intimately related to the biological function, the sample of a protein is represented by hybridizing the functional domain composition and pseudo amino acid composition. On the basis of such a hybridization representation, the FunD-PseAA predictor is established. It is demonstrated by the jackknife cross-validation tests that the overall success rate in identifying the 21 subclasses of oxidoreductases is above 86%, and the corresponding rates in identifying the subclasses of the other 5 main enzyme classes are 94-97%. The high success rates imply that the FunD-PseAA predictor may become a useful tool in bioinformatics and proteomics of the post-genomic era. 相似文献
4.
Protein N-glycosylation plays an important role in protein function. Yet, at present, few computational methods are available for the prediction of this protein modification. This prompted our development of a support vector machine (SVM)-based method for this task, as well as a partial least squares (PLS) regression based prediction method for comparison. A functional domain feature space was used to create SVM and PLS models, which achieved accuracies of 83.91% and 79.89%, respectively, as evaluated by a leave-one-out cross-validation. Subsequently, SVM and PLS models were developed based on functional domain and protein secretion information, which yielded accuracies of 89.13% and 86%, respectively. This analysis demonstrates that the protein functional domain and secretion information are both efficient predictors of N-glycosylation. 相似文献
5.
Background
The number and the arrangement of subunits that form a protein are referred to as quaternary structure. Quaternary structure is an important protein attribute that is closely related to its function. Proteins with quaternary structure are called oligomeric proteins. Oligomeric proteins are involved in various biological processes, such as metabolism, signal transduction, and chromosome replication. Thus, it is highly desirable to develop some computational methods to automatically classify the quaternary structure of proteins from their sequences. 相似文献6.
Because a priori knowledge of a protein structural class can provide useful information about its overall structure, the determination of protein structural class is a quite meaningful topic in protein science. However, with the rapid increase in newly found protein sequences entering into databanks, it is both time-consuming and expensive to do so based solely on experimental techniques. Therefore, it is vitally important to develop a computational method for predicting the protein structural class quickly and accurately. To deal with the challenge, this article presents a dual-layer support vector machine (SVM) fusion network that is featured by using a different pseudo-amino acid composition (PseAA). The PseAA here contains much information that is related to the sequence order of a protein and the distribution of the hydrophobic amino acids along its chain. As a showcase, the rigorous jackknife cross-validation test was performed on the two benchmark data sets constructed by Zhou. A significant enhancement in success rates was observed, indicating that the current approach may serve as a powerful complementary tool to other existing methods in this area. 相似文献
7.
Structural class characterizes the overall folding type of a protein or its domain and the prediction of protein structural class has become both an important and a challenging topic in protein science. Moreover, the prediction itself can stimulate the development of novel predictors that may be straightforwardly applied to many other relational areas. In this paper, 10 frequently used sequence-derived structural and physicochemical features, which can be easily computed by the PROFEAT (Protein Features) web server, were taken as inputs of support vector machines to develop statistical learning models for predicting the protein structural class. More importantly, a strategy of merging different features, called best-first search, was developed. It was shown through the rigorous jackknife cross-validation test that the success rates by our method were significantly improved. We anticipate that the present method may also have important impacts on boosting the predictive accuracies for a series of other protein attributes, such as subcellular localization, membrane types, enzyme family and subfamily classes, among many others. 相似文献
8.
A novel classifier, the so-called “LogitBoost” classifier, was introduced to predict the structural class of a protein domain according to its amino acid sequence. LogitBoost is featured by introducing a log-likelihood loss function to reduce the sensitivity to noise and outliers, as well as by performing classification via combining many weak classifiers together to build up a very strong and robust classifier. It was demonstrated thru jackknife cross-validation tests that LogitBoost outperformed other classifiers including “support vector machine,” a very powerful classifier widely used in biological literatures. It is anticipated that LogitBoost can also become a useful vehicle in classifying other attributes of proteins according to their sequences, such as subcellular localization and enzyme family class, among many others. 相似文献
9.
Background
Metabolic pathway is a highly regulated network consisting of many metabolic reactions involving substrates, enzymes, and products, where substrates can be transformed into products with particular catalytic enzymes. Since experimental determination of the network of substrate-enzyme-product triad (whether the substrate can be transformed into the product with a given enzyme) is both time-consuming and expensive, it would be very useful to develop a computational approach for predicting the network of substrate-enzyme-product triads. 相似文献10.
We evaluated the i-peptides occurrence frequency in the protein sequences belonging to the two datasets which include proteins with a sequence similarity lower than 25% and 40%, respectively. We worked out a new structural class prediction algorithm using the most frequent i-peptides (with i=2, 3, 4), which characterize the four structural classes. Using the tri-peptides, much more able to gain structural information from sequences compared to the di-peptides, the best results were obtained. Compared to the other methods, similarly founded on peptide occurrence frequencies, our method achieves the best prediction accuracy. We compared it also with methods founded on more sophisticated computational approaches. 相似文献
11.
Determination of protein structural class solely from sequence information is a challenging task. Several attempts to solve this problem using various methods can be found in literature. We present support vector machine (SVM) approach where probability-based decision is used along with class-wise optimized feature sets. This approach has two distinguishing characteristics from earlier attempts: (1) it uses class-wise optimized features and (2) decisions of different SVM classifiers are coupled with probability estimates to make the final prediction. The algorithm was tested on three datasets, containing 498 domains, 1092 domains and 5261 domains. Ten-fold external cross-validation was performed to assess the performance of the algorithm. Significantly high accuracy of 92.89% was obtained for the 498-dataset. We achieved 54.67% accuracy for the dataset with 1092 domains, which is better than the previously reported best accuracy of 53.8%. We obtained 59.43% prediction accuracy for the larger and less redundant 5261-dataset. We also investigated the advantage of using class-wise features over union of these features (conventional approach) in one-vs.-all SVM framework. Our results clearly show the advantage of using class-wise optimized features. Brief analysis of the selected class-wise features indicates their biological significance. 相似文献
12.
A new approach of predicting structural classes of protein domain sequences is presented in this paper. Besides the amino acid composition, the composition of several dipeptides, tripeptides, tetrapeptides, pentapeptides and hexapeptides are taken into account based on the stepwise discriminant analysis. The result of jackknife test shows that this new approach can lead to higher predictive sensitivity and specificity for reduced sequence similarity datasets. Considering the dataset PDB40-B constructed by Brenner and colleagues, 75.2% protein domain sequences are correctly assigned in the jackknife test for the four structural classes: all-alpha, all-beta, alpha/beta and alpha + beta, which is improved by 19.4% in jackknife test and 25.5% in resubstitution test, in contrast with the component-coupled algorithm using amino acid composition alone (AAC approach) for the same dataset. In the cross-validation test with dataset PDB40-J constructed by Park and colleagues, more than 80% predictive accuracy is obtained. Furthermore, for the dataset constructed by Chou and Maggiona, the accuracy of 100% and 99.7% can be easily achieved, respectively, in the resubstitution test and in the jackknife test merely taking the composition of dipeptides into account. Therefore, this new method provides an effective tool to extract valuable information from protein sequences, which can be used for the systematic analysis of small or medium size protein sequences. The computer programs used in this paper are available on request. 相似文献
13.
An optimization approach to predicting protein structural class from amino acid composition. 总被引:11,自引:0,他引:11
下载免费PDF全文

Proteins are generally classified into four structural classes: all-alpha proteins, all-beta proteins, alpha + beta proteins, and alpha/beta proteins. In this article, a protein is expressed as a vector of 20-dimensional space, in which its 20 components are defined by the composition of its 20 amino acids. Based on this, a new method, the so-called maximum component coefficient method, is proposed for predicting the structural class of a protein according to its amino acid composition. In comparison with the existing methods, the new method yields a higher general accuracy of prediction. Especially for the all-alpha proteins, the rate of correct prediction obtained by the new method is much higher than that by any of the existing methods. For instance, for the 19 all-alpha proteins investigated previously by P.Y. Chou, the rate of correct prediction by means of his method was 84.2%, but the correct rate when predicted with the new method would be 100%! Furthermore, the new method is characterized by an explicable physical picture. This is reflected by the process in which the vector representing a protein to be predicted is decomposed into four component vectors, each of which corresponds to one of the norms of the four protein structural classes. 相似文献
14.
Maqsood Hayat 《Journal of theoretical biology》2011,271(1):10-3077
Membrane proteins are vital type of proteins that serve as channels, receptors, and energy transducers in a cell. Prediction of membrane protein types is an important research area in bioinformatics. Knowledge of membrane protein types provides some valuable information for predicting novel example of the membrane protein types. However, classification of membrane protein types can be both time consuming and susceptible to errors due to the inherent similarity of membrane protein types. In this paper, neural networks based membrane protein type prediction system is proposed. Composite protein sequence representation (CPSR) is used to extract the features of a protein sequence, which includes seven feature sets; amino acid composition, sequence length, 2 gram exchange group frequency, hydrophobic group, electronic group, sum of hydrophobicity, and R-group. Principal component analysis is then employed to reduce the dimensionality of the feature vector. The probabilistic neural network (PNN), generalized regression neural network, and support vector machine (SVM) are used as classifiers. A high success rate of 86.01% is obtained using SVM for the jackknife test. In case of independent dataset test, PNN yields the highest accuracy of 95.73%. These classifiers exhibit improved performance using other performance measures such as sensitivity, specificity, Mathew's correlation coefficient, and F-measure. The experimental results show that the prediction performance of the proposed scheme for classifying membrane protein types is the best reported, so far. This performance improvement may largely be credited to the learning capabilities of neural networks and the composite feature extraction strategy, which exploits seven different properties of protein sequences. The proposed Mem-Predictor can be accessed at http://111.68.99.218/Mem-Predictor. 相似文献
15.
Support vector machines for predicting membrane protein types by using functional domain composition 总被引:9,自引:0,他引:9
下载免费PDF全文

Membrane proteins are generally classified into the following five types: 1), type I membrane protein; 2), type II membrane protein; 3), multipass transmembrane proteins; 4), lipid chain-anchored membrane proteins; and 5), GPI-anchored membrane proteins. In this article, based on the concept of using the functional domain composition to define a protein, the Support Vector Machine algorithm is developed for predicting the membrane protein type. High success rates are obtained by both the self-consistency and jackknife tests. The current approach, complemented with the powerful covariant discriminant algorithm based on the pseudo-amino acid composition that has incorporated quasi-sequence-order effect as recently proposed by K. C. Chou (2001), may become a very useful high-throughput tool in the area of bioinformatics and proteomics. 相似文献
16.
Jia P Qian Z Zeng Z Cai Y Li Y 《Biochemical and biophysical research communications》2007,357(2):366-370
Assigning subcellular localization (SL) to proteins is one of the major tasks of functional proteomics. Despite the impressive technical advances of the past decades, it is still time-consuming and laborious to experimentally determine SL on a high throughput scale. Thus, computational predictions are the preferred method for large-scale assignment of protein SL, and if appropriate, followed up by experimental studies. In this report, using a machine learning approach, the Nearest Neighbor Algorithm (NNA), we developed a prediction system for protein SL in which we incorporated a protein functional domain profile. The overall accuracy achieved by this system is 93.96%. Furthermore, comparisons with other methods have been conducted to demonstrate the validity and efficiency of our prediction system. We also provide an implementation of our Subcellular Location Prediction System (SLPS), which is available at http://pcal.biosino.org. 相似文献
17.
In this paper, based on the approach by combining the "functional domain composition" [K.C. Chou, Y. D. Cai, J. Biol. Chem. 277 (2002) 45765] and the pseudo-amino acid composition [K.C. Chou, Proteins Struct. Funct. Genet. 43 (2001) 246; Correction Proteins Struct. Funct. Genet. 2044 (2001) 2060], the Nearest Neighbour Algorithm (NNA) was developed for predicting the protein subcellular location. Very high success rates were observed, suggesting that such a hybrid approach may become a useful high-throughput tool in the area of bioinformatics and proteomics. 相似文献
18.
19.
Kedarisetti KD Kurgan L Dick S 《Biochemical and biophysical research communications》2006,348(3):981-988
Structural class characterizes the overall folding type of a protein or its domain. A number of computational methods have been proposed to predict structural class based on primary sequences; however, the accuracy of these methods is strongly affected by sequence homology. This paper proposes, an ensemble classification method and a compact feature-based sequence representation. This method improves prediction accuracy for the four main structural classes compared to competing methods, and provides highly accurate predictions for sequences of widely varying homologies. The experimental evaluation of the proposed method shows superior results across sequences that are characterized by entire homology spectrum, ranging from 25% to 90% homology. The error rates were reduced by over 20% when compared with using individual prediction methods and most commonly used composition vector representation of protein sequences. Comparisons with competing methods on three large benchmark datasets consistently show the superiority of the proposed method. 相似文献
20.
Computational prediction of protein structural class based on sequence data remains a challenging problem in current protein science. In this paper, a new feature extraction approach based on relative polypeptide composition is introduced. This approach could take into account the background distribution of a given k-mer under a Markov model of order k-2, and avoid the curse of dimensionality with the increase of k by using a T-statistic feature selection strategy. The selected features are then fed to a support vector machine to perform the prediction. To verify the performance of our method, jackknife cross-validation tests are performed on four widely used benchmark datasets. Comparison of our results with existing methods shows that our method provides satisfactory performance for structural class prediction. 相似文献