共查询到20条相似文献,搜索用时 31 毫秒
1.
Prediction of protein subcellular localization by support vector machines using multi-scale energy and pseudo amino acid composition 总被引:1,自引:0,他引:1
As more and more genomes have been discovered in recent years, there is an urgent need to develop a reliable method to predict the subcellular localization for the explosion of newly found proteins. However, many well-known prediction methods based on amino acid composition have problems utilizing the sequence-order information. Here, based on the concept of Chou's pseudo amino acid composition (PseAA), a new feature extraction method, the multi-scale energy (MSE) approach, is introduced to incorporate the sequence-order information. First, a protein sequence was mapped to a digital signal using the amino acid index. Then, by wavelet transform, the mapped signal was broken down into several scales in which the energy factors were calculated and further formed into an MSE feature vector. Following this, combining this MSE feature vector with amino acid composition (AA), we constructed a series of MSEPseAA feature vectors to represent the protein subcellular localization sequences. Finally, according to a new kind of normalization approach, the MSEPseAA feature vectors were normalized to form the improved MSEPseAA vectors, named as IEPseAA. Using the technique of IEPseAA, C-support vector machine (C-SVM) and three multi-class SVMs strategies, quite promising results were obtained, indicating that MSE is quite effective in reflecting the sequence-order effects and might become a useful tool for predicting the other attributes of proteins as well. 相似文献
2.
基于氨基酸组成分布的蛋白质同源寡聚体分类研究 总被引:7,自引:0,他引:7
基于一种新的特征提取方法——氨基酸组成分布,使用支持向量机作为成员分类器,采用“一对一”的多类分类策略,从蛋白质一级序列对四类同源寡聚体进行分类研究。结果表明,在10-CV检验下,基于氨基酸组成分布,其总分类精度和精度指数分别达到了86.22%和67.12%,比基于氨基酸组成成分的传统特征提取方法分别提高了5.74和10.03个百分点,比二肽组成成分特征提取方法分别提高了3.12和5.63个百分点,说明氨基酸组成分布对于蛋白质同源寡聚体分类是一种非常有效的特征提取方法;将氨基酸组成分布和蛋白质序列长度特征组合,其总分类精度和精度指数分别达到了86.35%和67.23%,说明蛋白质序列长度特征含有一定的空间结构信息。 相似文献
3.
A prior knowledge of protein structural classes can provide useful information about its overall structure, so it is very
important for quick and accurate determination of protein structural class with computation method in protein science. One
of the key for computation method is accurate protein sample representation. Here, based on the concept of Chou’s pseudo-amino
acid composition (AAC, Chou, Proteins: structure, function, and genetics, 43:246–255, 2001), a novel method of feature extraction
that combined continuous wavelet transform (CWT) with principal component analysis (PCA) was introduced for the prediction
of protein structural classes. Firstly, the digital signal was obtained by mapping each amino acid according to various physicochemical
properties. Secondly, CWT was utilized to extract new feature vector based on wavelet power spectrum (WPS), which contains
more abundant information of sequence order in frequency domain and time domain, and PCA was then used to reorganize the feature
vector to decrease information redundancy and computational complexity. Finally, a pseudo-amino acid composition feature vector
was further formed to represent primary sequence by coupling AAC vector with a set of new feature vector of WPS in an orthogonal
space by PCA. As a showcase, the rigorous jackknife cross-validation test was performed on the working datasets. The results
indicated that prediction quality has been improved, and the current approach of protein representation may serve as a useful
complementary vehicle in classifying other attributes of proteins, such as enzyme family class, subcellular localization,
membrane protein types and protein secondary structure, etc. 相似文献
4.
Betsy Sheena Cherian Achuthsankar S. Nair 《Biochemical and biophysical research communications》2010,391(4):1670-1674
Subcellular location of protein is constructive information in determining its function, screening for drug candidates, vaccine design, annotation of gene products and in selecting relevant proteins for further studies. Computational prediction of subcellular localization deals with predicting the location of a protein from its amino acid sequence. For a computational localization prediction method to be more accurate, it should exploit all possible relevant biological features that contribute to the subcellular localization. In this work, we extracted the biological features from the full length protein sequence to incorporate more biological information. A new biological feature, distribution of atomic composition is effectively used with, multiple physiochemical properties, amino acid composition, three part amino acid composition, and sequence similarity for predicting the subcellular location of the protein. Support Vector Machines are designed for four modules and prediction is made by a weighted voting system. Our system makes prediction with an accuracy of 100, 82.47, 88.81 for self-consistency test, jackknife test and independent data test respectively. Our results provide evidence that the prediction based on the biological features derived from the full length amino acid sequence gives better accuracy than those derived from N-terminal alone. Considering the features as a distribution within the entire sequence will bring out underlying property distribution to a greater detail to enhance the prediction accuracy. 相似文献
5.
In this paper, a novel approach, ELM-PCA, is introduced for the first time to predict protein subcellular localization. Firstly, Protein Samples are represented by the pseudo amino acid composition (PseAAC). Secondly, the principal component analysis (PCA) is employed to extract essential features. Finally, the Elman Recurrent Neural Network (RNN) is used as a classifier to identify the protein sequences. The results demonstrate that the proposed approach is effective and practical. 相似文献
6.
Knowing protein structure and inferring its function from the structure are one of the main issues of computational structural biology, and often the first step is studying protein secondary structure. There have been many attempts to predict protein secondary structure contents. Previous attempts assumed that the content of protein secondary structure can be predicted successfully using the information on the amino acid composition of a protein. Recent methods achieved remarkable prediction accuracy by using the expanded composition information. The overall average error of the most successful method is 3.4%. Here, we demonstrate that even if we only use the simple amino acid composition information alone, it is possible to improve the prediction accuracy significantly if the evolutionary information is included. The idea is motivated by the observation that evolutionarily related proteins share the similar structure. After calculating the homolog-averaged amino acid composition of a protein, which can be easily obtained from the multiple sequence alignment by running PSI-BLAST, those 20 numbers are learned by a multiple linear regression, an artificial neural network and a support vector regression. The overall average error of method by a support vector regression is 3.3%. It is remarkable that we obtain the comparable accuracy without utilizing the expanded composition information such as pair-coupled amino acid composition. This work again demonstrates that the amino acid composition is a fundamental characteristic of a protein. It is anticipated that our novel idea can be applied to many areas of protein bioinformatics where the amino acid composition information is utilized, such as subcellular localization prediction, enzyme subclass prediction, domain boundary prediction, signal sequence prediction, and prediction of unfolded segment in a protein sequence, to name a few. 相似文献
7.
Summary. The subnuclear localization of nuclear protein is very important for in-depth understanding of the construction and function
of the nucleus. Based on the amino acid and pseudo amino acid composition (PseAA) as originally introduced by K. C. Chou can
incorporate much more information of a protein sequence than the classical amino acid composition so as to significantly enhance
the power of using a discrete model to predict various attributes of a protein, an algorithm of increment of diversity combined
with the improved quadratic discriminant analysis is proposed to predict the protein subnuclear location. The overall predictive
success rates and correlation coefficient are 75.4% and 0.629 for 504 single localization proteins in jackknife test, and
80.4% for an independent set of 92 multi-localization proteins, respectively. For 406 single localization nuclear proteins
with ≤25% sequence identity, the results of jackknife test show that the overall accuracy of prediction is 77.1%.
Authors’ address: Qian-Zhong Li, Laboratory of Theoretical Biophysics, Department of Physics, College of Sciences and Technology,
Inner Mongolia University, Hohhot 010021, China 相似文献
8.
Predicting protein subcellular location using Chou's pseudo amino acid composition and improved hybrid approach 总被引:1,自引:0,他引:1
The location of a protein in a cell is closely correlated with its biological function. Based on the concept that the protein subcellular location is mainly determined by its amino acid and pseudo amino acid composition (PseAA), a new algorithm of increment of diversity combined with support vector machine is proposed to predict the protein subcellular location. The subcellular locations of plant and non-plant proteins are investigated by our method. The overall prediction accuracies in jackknife test are 88.3% for the eukaryotic plant proteins and 92.4% for the eukaryotic non-plant proteins, respectively. In order to estimate the effect of the sequence identity on predictive result, the proteins with sequence identity 相似文献
9.
Identifying a protein's subcellular localization is an important step to understand its function. However, the involved experimental work is usually laborious, time consuming and costly. Computational prediction hence becomes valuable to reduce the inefficiency. Here we provide a method to predict protein subcellular localization by using amino acid composition and physicochemical properties. The method concatenates the information extracted from a protein's N-terminal, middle and full sequence. Each part is represented by amino acid composition, weighted amino acid composition, five-level grouping composition and five-level dipeptide composition. We divided our dataset into training and testing set. The training set is used to determine the best performing amino acid index by using five-fold cross validation, whereas the testing set acts as the independent dataset to evaluate the performance of our model. With the novel representation method, we achieve an accuracy of approximately 75% on independent dataset. We conclude that this new representation indeed performs well and is able to extract the protein sequence information. We have developed a web server for predicting protein subcellular localization. The web server is available at http://aaindexloc.bii.a-star.edu.sg . 相似文献
10.
Predicting subcellular localization of mycobacterial proteins by using Chou's pseudo amino acid composition 总被引:1,自引:0,他引:1
The successful prediction of protein subcellular localization directly from protein primary sequence is useful to protein function prediction and drug discovery. In this paper, by using the concept of pseudo amino acid composition (PseAAC), the mycobacterial proteins are studied and predicted by support vector machine (SVM) and increment of diversity combined with modified Mahalanobis Discriminant (IDQD). The results of jackknife cross-validation for 450 non-redundant proteins show that the overall predicted successful rates of SVM and IDQD are 82.2% and 79.1%, respectively. Compared with other existing methods, SVM combined with PseAAC display higher accuracies. 相似文献
11.
Information of protein subcellular location plays an important role in molecular cell biology. Prediction of the subcellular location of proteins will help to understand their functions and interactions. In this paper, a different mode of pseudo amino acid composition was proposed to represent protein samples for predicting their subcellular localization via the following procedures: based on the optimal splice site of each protein sequence, we divided a sequence into sorting signal part and mature protein part, and extracted sequence features from each part separately. Then, the combined features were fed into the SVM classifier to perform the prediction. By the jackknife test on a benchmark dataset in which none of proteins included has more than 90% pairwise sequence identity to any other, the overall accuracies achieved by the method are 94.5% and 90.3% for prokaryotic and eukaryotic proteins, respectively. The results indicate that the prediction quality by our method is quite satisfactory. It is anticipated that the current method may serve as an alternative approach to the existing prediction methods. 相似文献
12.
Prediction of protein subcellular multi-localization based on the general form of Chou's pseudo amino acid composition 总被引:1,自引:0,他引:1
Many proteins bear multi-locational characteristics, and this phenomenon is closely related to biological function. However, most of the existing methods can only deal with single-location proteins. Therefore, an automatic and reliable ensemble classifier for protein subcellular multi-localization is needed. We propose a new ensemble classifier combining the KNN (K-nearest neighbour) and SVM (support vector machine) algorithms to predict the subcellular localization of eukaryotic, Gram-negative bacterial and viral proteins based on the general form of Chou's pseudo amino acid composition, i.e., GO (gene ontology) annotations, dipeptide composition and AmPseAAC (Amphiphilic pseudo amino acid composition). This ensemble classifier was developed by fusing many basic individual classifiers through a voting system. The overall prediction accuracies obtained by the KNN-SVM ensemble classifier are 95.22, 93.47 and 80.72% for the eukaryotic, Gram-negative bacterial and viral proteins, respectively. Our prediction accuracies are significantly higher than those by previous methods and reveal that our strategy better predicts subcellular locations of multi-location proteins. 相似文献
13.
Prediction of the subcellular location of prokaryotic proteins based on a new representation of the amino acid composition 总被引:3,自引:0,他引:3
Feng ZP 《Biopolymers》2001,58(5):491-499
A new representation of protein sequence is devoted in this paper, in which each protein can be represented by a 20-dimensional (20D) vector of unit length. Inspired by the principle of superposition of state in quantum mechanics, the squares of the 20 components of the vector correspond to the amino acid composition. Using the new representation of the primary sequence and Bayes Discriminant Algorithm, the subcellular location of prokaryotic proteins was predicted. The overall predictive accuracy in the jackknife test can be 3% higher than the result of using amino acid composition directly for the database of sequence identity is less than 90%, but 5% higher when sequence identity is less than 80%. The higher predictive accuracy indicates that the current measure of extracting the information from the primary sequence is efficient. Since the subcellular location restricting a protein's possible function, the present method should also be a useful measure for the systematic analysis of genome data. The program used in this paper is available on request. 相似文献
14.
Summary. The interaction of non-covalently bound monomeric protein subunits forms oligomers. The oligomeric proteins are superior to
the monomers within the scope of functional evolution of biomacromolecules. Such complexes are involved in various biological
processes, and play an important role. It is highly desirable to predict oligomer types automatically from their sequence.
Here, based on the concept of pseudo amino acid composition, an improved feature extraction method of weighted auto-correlation
function of amino acid residue index and Naive Bayes multi-feature fusion algorithm is proposed and applied to predict protein
homo-oligomer types. We used the support vector machine (SVM) as base classifiers, in order to obtain better results. For
example, the total accuracies of A, B, C, D and E sets based on this improved feature extraction method are 77.63, 77.16,
76.46, 76.70 and 75.06% respectively in the jackknife test, which are 6.39, 5.92, 5.22, 5.46 and 3.82% higher than that of
G set based on conventional amino acid composition method with the same SVM. Comparing with Chou’s feature extraction method
of incorporating quasi-sequence-order effect, our method can increase the total accuracy at a level of 3.51 to 1.01%. The
total accuracy improves from 79.66 to 80.83% by using the Naive Bayes Feature Fusion algorithm. These results show: 1) The
improved feature extraction method is effective and feasible, and the feature vectors based on this method may contain more
protein quaternary structure information and appear to capture essential information about the composition and hydrophobicity
of residues in the surface patches that buried in the interfaces of associated subunits; 2) Naive Bayes Feature Fusion algorithm
and SVM can be referred as a powerful computational tool for predicting protein homo-oligomer types. 相似文献
15.
16.
MultiLoc: prediction of protein subcellular localization using N-terminal targeting sequences, sequence motifs and amino acid composition 总被引:10,自引:0,他引:10
Höglund A Dönnes P Blum T Adolph HW Kohlbacher O 《Bioinformatics (Oxford, England)》2006,22(10):1158-1165
MOTIVATION: Functional annotation of unknown proteins is a major goal in proteomics. A key annotation is the prediction of a protein's subcellular localization. Numerous prediction techniques have been developed, typically focusing on a single underlying biological aspect or predicting a subset of all possible localizations. An important step is taken towards emulating the protein sorting process by capturing and bringing together biologically relevant information, and addressing the clear need to improve prediction accuracy and localization coverage. RESULTS: Here we present a novel SVM-based approach for predicting subcellular localization, which integrates N-terminal targeting sequences, amino acid composition and protein sequence motifs. We show how this approach improves the prediction based on N-terminal targeting sequences, by comparing our method TargetLoc against existing methods. Furthermore, MultiLoc performs considerably better than comparable methods predicting all major eukaryotic subcellular localizations, and shows better or comparable results to methods that are specialized on fewer localizations or for one organism. AVAILABILITY: http://www-bs.informatik.uni-tuebingen.de/Services/MultiLoc/ 相似文献
17.
Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs 总被引:14,自引:0,他引:14
MOTIVATION: The subcellular location of a protein is closely correlated to its function. Thus, computational prediction of subcellular locations from the amino acid sequence information would help annotation and functional prediction of protein coding genes in complete genomes. We have developed a method based on support vector machines (SVMs). RESULTS: We considered 12 subcellular locations in eukaryotic cells: chloroplast, cytoplasm, cytoskeleton, endoplasmic reticulum, extracellular medium, Golgi apparatus, lysosome, mitochondrion, nucleus, peroxisome, plasma membrane, and vacuole. We constructed a data set of proteins with known locations from the SWISS-PROT database. A set of SVMs was trained to predict the subcellular location of a given protein based on its amino acid, amino acid pair, and gapped amino acid pair compositions. The predictors based on these different compositions were then combined using a voting scheme. Results obtained through 5-fold cross-validation tests showed an improvement in prediction accuracy over the algorithm based on the amino acid composition only. This prediction method is available via the Internet. 相似文献
18.
Apoptosis proteins are very important for understanding the mechanism of programmed cell death. Obtaining information on subcellular
location of apoptosis proteins is very helpful to understand the apoptosis mechanism. In this paper, based on amino acid substitution
matrix and auto covariance transformation, we introduce a new sequence-based model, which not only quantitatively describes
the differences between amino acids, but also partially incorporates the sequence-order information. This method is applied
to predict the apoptosis proteins’ subcellular location of two widely used datasets by the support vector machine classifier.
The results obtained by jackknife test are quite promising, indicating that the proposed method might serve as a potential
and efficient prediction model for apoptosis protein subcellular location prediction. 相似文献
19.
20.
Using cellular automata images and pseudo amino acid composition to predict protein subcellular location 总被引:6,自引:0,他引:6
Summary. The avalanche of newly found protein sequences in the post-genomic era has motivated and challenged us to develop an automated
method that can rapidly and accurately predict the localization of an uncharacterized protein in cells because the knowledge
thus obtained can greatly speed up the process in finding its biological functions. However, it is very difficult to establish
such a desired predictor by acquiring the key statistical information buried in a pile of extremely complicated and highly
variable sequences. In this paper, based on the concept of the pseudo amino acid composition (Chou, K. C. PROTEINS: Structure, Function, and Genetics, 2001, 43: 246–255), the approach of cellular automata image is introduced to cope with this problem. Many important features,
which are originally hidden in the long amino acid sequences, can be clearly displayed through their cellular automata images.
One of the remarkable merits by doing so is that many image recognition tools can be straightforwardly applied to the target
aimed here. High success rates were observed through the self-consistency, jackknife, and independent dataset tests, respectively. 相似文献