首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Matrix metalloproteinase (MMPs) and disintegrin and metalloprotease (ADAMs) belong to the zinc-dependent metalloproteinase family of proteins. These proteins participate in various physiological and pathological states. Thus, prediction of these proteins using amino acid sequence would be helpful. We have developed a method to predict these proteins based on the features derived from Chou’s pseudo amino acid composition (PseAAC) server and support vector machine (SVM) as a powerful machine learning approach. With this method, for ADAMs and MMPs families, an overall accuracy and Matthew’s correlation coefficient (MCC) of 95.89 and 0.90% were achieved respectively. Furthermore, the method is able to predict two major subclasses of MMP family; Furin-activated secreted MMPs and Type II trans-membrane; with MCC of 0.89 and 0.91%, respectively. The overall accuracy for Furin-activated secreted MMPs and Type II trans-membrane was 98.18 and 99.07, respectively. Our data demonstrates an effective classification of Metalloproteinase family based on the concept of PseAAC and SVM.  相似文献   

2.
DNA-binding proteins play an important role in most cellular processes, such as gene regulation, recombination, repair, replication, and DNA modification. In this article, an optimal Chou's pseudo amino acid composition (PseAAC) based on physicochemical characters of amino acid is proposed to represent proteins for identifying DNAbinding proteins. Six physicochemical characters of amino acids are utilized to generate the sequence features via the web server PseAAC. The optimal values of two important parameters (correlation factor δ and weighting factor w) about PseAAC are determined to get the appropriate representation of proteins, which ultimately result in better prediction performance. Experimental results on the benchmark datasets using random forest show that our method is really promising to predict DNA-binding proteins and may at least be a useful supplement tool to existing methods.  相似文献   

3.
4.
5.
The successful prediction of protein subcellular localization directly from protein primary sequence is useful to protein function prediction and drug discovery. In this paper, by using the concept of pseudo amino acid composition (PseAAC), the mycobacterial proteins are studied and predicted by support vector machine (SVM) and increment of diversity combined with modified Mahalanobis Discriminant (IDQD). The results of jackknife cross-validation for 450 non-redundant proteins show that the overall predicted successful rates of SVM and IDQD are 82.2% and 79.1%, respectively. Compared with other existing methods, SVM combined with PseAAC display higher accuracies.  相似文献   

6.
For a protein, an important characteristic is its location or compartment in a cell. This is because a protein has to be located in its proper position in a cell to perform its biological functions. Therefore, predicting protein subcellular location is an important and challenging task in current molecular and cellular biology. In this paper, based on AdaBoost.ME algorithm and Chou's PseAAC (pseudo amino acid composition), a new computational method was developed to identify protein subcellular location. AdaBoost.ME is an improved version of AdaBoost algorithm that can directly extend the original AdaBoost algorithm to deal with multi-class cases without the need to reduce it to multiple two-class problems. In some previous studies the conventional amino acid composition was applied to represent protein samples. In order to take into account the sequence order effects, in this study we use Chou's PseAAC to represent protein samples. To demonstrate that AdaBoost.ME is a robust and efficient model in predicting protein subcellular locations, the same protein dataset used by Cedano et al. (Journal of Molecular Biology, 1997, 266: 594-600) is adopted in this paper. It can be seen from the computed results that the accuracy achieved by our method is better than those by the methods developed by the previous investigators.  相似文献   

7.
Apoptosis proteins play an essential role in regulating a balance between cell proliferation and death. The successful prediction of subcellular localization of apoptosis proteins directly from primary sequence is much benefited to understand programmed cell death and drug discovery. In this paper, by use of Chou’s pseudo amino acid composition (PseAAC), a total of 317 apoptosis proteins are predicted by support vector machine (SVM). The jackknife cross-validation is applied to test predictive capability of proposed method. The predictive results show that overall prediction accuracy is 91.1% which is higher than previous methods. Furthermore, another dataset containing 98 apoptosis proteins is examined by proposed method. The overall predicted successful rate is 92.9%.  相似文献   

8.
Prediction of thermophilic and mesophilic protein plays a crucial role in both biochemistry and bioengineering. In this study, a different mode of pseudo amino acid composition (PseAAC) was proposed to formulate the protein samples by integrating the amino acid composition, the physic chemical features, as well as the composition transition and distribution features, where each of the protein samples was represented by a numerical vector through the sequence-based approach. Using the support vector machine algorithm, an accurate and reliable classifier was constructed to predict the thermophilic and mesophilic proteins. Moreover, three feature reduction algorithms were obtained for locating the most vital features and reducing the size of feature space. Among the three feature reduction algorithms, the genetic algorithm performed best. Finally, with the reduced features extracted from the genetic algorithm, it was observed that for the selected dataset the new classifier achieved a high accuracy of 95.93% with the Matthews correlation coefficient of 0.9187.  相似文献   

9.
With the accomplishment of human genome sequencing, the number of sequence-known proteins has increased explosively. In contrast, the pace is much slower in determining their biological attributes. As a consequence, the gap between sequence-known proteins and attribute-known proteins has become increasingly large. The unbalanced situation, which has critically limited our ability to timely utilize the newly discovered proteins for basic research and drug development, has called for developing computational methods or high-throughput automated tools for fast and reliably identifying various attributes of uncharacterized proteins based on their sequence information alone. Actually, during the last two decades or so, many methods in this regard have been established in hope to bridge such a gap. In the course of developing these methods, the following things were often needed to consider: (1) benchmark dataset construction, (2) protein sample formulation, (3) operating algorithm (or engine), (4) anticipated accuracy, and (5) web-server establishment. In this review, we are to discuss each of the five procedures, with a special focus on the introduction of pseudo amino acid composition (PseAAC), its different modes and applications as well as its recent development, particularly in how to use the general formulation of PseAAC to reflect the core and essential features that are deeply hidden in complicated protein sequences.  相似文献   

10.
Integral membrane proteins are central to many cellular processes and constitute approximately 50% of potential targets for novel drugs. However, the number of outer membrane proteins (OMPs) present in the public structure database is very limited due to the difficulties in determining structure with experimental methods. Therefore, discriminating OMPs from non-OMPs with computational methods is of medical importance as well as genome sequencing necessity. In this study, some sequence-derived structural and physicochemical features of proteins were incorporated with amino acid composition to discriminate OMPs from non-OMPs using support vector machines. The discrimination performance of the proposed method is evaluated on a benchmark dataset of 208 OMPs, 673 globular proteins, and 206 α-helical membrane proteins. A high overall accuracy of 97.8% was observed in the 5-fold cross-validation test. In addition, the current method distinguished OMPs from globular proteins and α-helical membrane proteins with overall accuracies of 98.2 and 96.4%, respectively. The prediction performance is superior to the state-of-the-art methods in the literature. It is anticipated that the current method might be a powerful tool for the discrimination of OMPs.  相似文献   

11.
The function of the protein is closely correlated with its subcellular localization. Probing into the mechanism of protein sorting and predicting protein subcellular location can provide important clues or insights for understanding the function of proteins. In this paper, we introduce a new PseAAC approach to encode the protein sequence based on the physicochemical properties of amino acid residues. Each of the protein samples was defined as a 146D (dimensional) vector including the 20 amino acid composition components and 126 adjacent triune residues contents. To evaluate the effectiveness of this encoding scheme, we did jackknife tests on three datasets using the support vector machine algorithm. The total prediction accuracies are 84.9%, 91.2%, and 92.6%, respectively. The satisfactory results indicate that our method could be a useful tool in the area of bioinformatics and proteomics.  相似文献   

12.
Ketoacyl synthases are enzymes involved in fatty acid synthesis and can be classified into five families based on primary sequence similarity. Different families have different catalytic mechanisms. Developing cost-effective computational models to identify the family of ketoacyl synthases will be helpful for enzyme engineering and in knowing individual enzymes’ catalytic mechanisms. In this work, a support vector machine-based method was developed to predict ketoacyl synthase family using the n-peptide composition of reduced amino acid alphabets. In jackknife cross-validation, the model based on the 2-peptide composition of a reduced amino acid alphabet of size 13 yielded the best overall accuracy of 96.44% with average accuracy of 93.36%, which is superior to other state-of-the-art methods. This result suggests that the information provided by n-peptide compositions of reduced amino acid alphabets provides efficient means for enzyme family classification and that the proposed model can be efficiently used for ketoacyl synthase family annotation.  相似文献   

13.
14.
寡聚蛋白质广泛地参与多种生命活动,对其预测研究有重要的意义。文章从蛋白质序列出发,提出多策略滑动伸缩窗特征提取方法,采用“ 一对一”的多类分类策略,对蛋白质同源寡聚体进行预测研究。结果表明,在Jackknife检验下,基于支持向量机的多策略滑动伸缩窗特征和氨基酸组成成分构成的特征集在加权情况下,其总分类精度最高达到了75.37%,比单纯的氨基酸组成成分法提高10.05%,比参考文献最好特征BG_Zhang提高了3.82%。 说明多策略滑动伸缩窗特征提取方法对于蛋白质同源寡聚体分类,是一种非常有效的特征提取方法。  相似文献   

15.
To evaluate the possibility of an unknown protein to be a resistant gene against Xanthomonas oryzae pv. oryzae, a different mode of pseudo amino acid composition (PseAAC) is proposed to formulate the protein samples by integrating the amino acid composition, as well as the Chaos games representation (CGR) method. Some numerical comparisons of triangle, quadrangle and 12-vertex polygon CGR are carried to evaluate the efficiency of using these fractal figures in classifiers. The numerical results show that among the three polygon methods, triangle method owns a good fractal visualization and performs the best in the classifier construction. By using triangle + 12-vertex polygon CGR as the mathematical feature, the classifier achieves 98.13% in Jackknife test and MCC achieves 0.8462.  相似文献   

16.
Translation is a key process for gene expression. Timely identification of the translation initiation site (TIS) is very important for conducting in-depth genome analysis. With the avalanche of genome sequences generated in the postgenomic age, it is highly desirable to develop automated methods for rapidly and effectively identifying TIS. Although some computational methods were proposed in this regard, none of them considered the global or long-range sequence-order effects of DNA, and hence their prediction quality was limited. To count this kind of effects, a new predictor, called “iTIS-PseTNC,” was developed by incorporating the physicochemical properties into the pseudo trinucleotide composition, quite similar to the PseAAC (pseudo amino acid composition) approach widely used in computational proteomics. It was observed by the rigorous cross-validation test on the benchmark dataset that the overall success rate achieved by the new predictor in identifying TIS locations was over 97%. As a web server, iTIS-PseTNC is freely accessible at http://lin.uestc.edu.cn/server/iTIS-PseTNC. To maximize the convenience of the vast majority of experimental scientists, a step-by-step guide is provided on how to use the web server to obtain the desired results without the need to go through detailed mathematical equations, which are presented in this paper just for the integrity of the new prection method.  相似文献   

17.
Proteases are essential to most biological processes though they themselves remain intact during the processes. In this research, a computational approach was developed for predicting the families of proteases based on their sequences. According to the concept of pseudo amino acid composition, in order to catch the essential patterns for the sequences of proteases, the sample of a protein was formulated by a series of its biological features. There were a total of 132 biological features, which were sourced from various biochemical and physicochemical properties of the constituent amino acids. The importance of these features to the prediction is rated by Maximum Relevance Minimum Redundancy algorithm and then the Incremental Feature Selection was applied to select an optimal feature set, which was used to construct a predictor through the nearest neighbor algorithm. As a demonstration, the overall success rate by the jackknife test in identifying proteases among their seven families was 92.74%. It was revealed by further analysis on the optimal feature set that the secondary structure and amino acid composition play the key roles for the classification, which is quite consistent with some previous findings. The promising results imply that the predictor as presented in this paper may become a useful tool for studying proteases.  相似文献   

18.
Gao QB  Wang ZZ  Yan C  Du YH 《FEBS letters》2005,579(16):3444-3448
To understand the structure and function of a protein, an important task is to know where it occurs in the cell. Thus, a computational method for properly predicting the subcellular location of proteins would be significant in interpreting the original data produced by the large-scale genome sequencing projects. The present work tries to explore an effective method for extracting features from protein primary sequence and find a novel measurement of similarity among proteins for classifying a protein to its proper subcellular location. We considered four locations in eukaryotic cells and three locations in prokaryotic cells, which have been investigated by several groups in the past. A combined feature of primary sequence defined as a 430D (dimensional) vector was utilized to represent a protein, including 20 amino acid compositions, 400 dipeptide compositions and 10 physicochemical properties. To evaluate the prediction performance of this encoding scheme, a jackknife test based on nearest neighbor algorithm was employed. The prediction accuracies for cytoplasmic, extracellular, mitochondrial, and nuclear proteins in the former dataset were 86.3%, 89.2%, 73.5% and 89.4%, respectively, and the total prediction accuracy reached 86.3%. As for the prediction accuracies of cytoplasmic, extracellular, and periplasmic proteins in the latter dataset, the prediction accuracies were 97.4%, 86.0%, and 79.7, respectively, and the total prediction accuracy of 92.5% was achieved. The results indicate that this method outperforms some existing approaches based on amino acid composition or amino acid composition and dipeptide composition.  相似文献   

19.
Ma J  Gu H 《BMB reports》2010,43(10):670-676
In this paper, a novel approach, ELM-PCA, is introduced for the first time to predict protein subcellular localization. Firstly, Protein Samples are represented by the pseudo amino acid composition (PseAAC). Secondly, the principal component analysis (PCA) is employed to extract essential features. Finally, the Elman Recurrent Neural Network (RNN) is used as a classifier to identify the protein sequences. The results demonstrate that the proposed approach is effective and practical.  相似文献   

20.
G protein-coupled receptors (GPCRs) are among the most frequent targets of therapeutic drugs. With the avalanche of newly generated protein sequences in the post genomic age, to expedite the process of drug discovery, it is highly desirable to develop an automated method to rapidly identify GPCRs and their types. A new predictor was developed by hybridizing two different modes of pseudo-amino acid composition (PseAAC): the functional domain PseAAC and the low-frequency Fourier spectrum PseAAC. The new predictor is called GPCR-2L, where "2L" means that it is a two-layer predictor: the 1st layer prediction engine is to identify a query protein as GPCR or not; if it is, the prediction will be automatically continued to further identify it as belonging to one of the following six types: (1) rhodopsin-like (Class A), (2) secretin-like (Class B), (3) metabotropic glutamate/pheromone (Class C), (4) fungal pheromone (Class D), (5) cAMP receptor (Class E), or (6) frizzled/smoothened family (Class F). The overall success rate of GPCR-2L in identifying proteins as GPCRs or non-GPCRs is over 97.2%, while identifying GPCRs among their six types is over 97.8%. Such high success rates were derived by the rigorous jackknife cross-validation on a stringent benchmark dataset, in which none of the included proteins had ≥40% pairwise sequence identity to any other protein in a same subset. As a user-friendly web-server, GPCR-2L is freely accessible to the public at http://icpr.jci.edu.cn/, by which one can obtain the 2-level results in about 20 s for a query protein sequence of 500 amino acids. The longer the sequence is, the more time it may usually need. The high success rates reported here indicate that it is a quite effective approach to identify GPCRs and their types with the functional domain information and the low-frequency Fourier spectrum analysis. It is anticipated that GPCR-2L may become a useful tool for both basic research and drug development in the areas related to GPCRs.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号