Recent advances in large-scale genome sequencing have led to the rapid accumulation of amino acid sequences of proteins whose functions are unknown. Since the functions of these proteins are closely correlated with their subcellular localizations, many efforts have been made to develop a variety of methods for predicting protein subcellular location. In this study, based on the strategy by hybridizing the functional domain composition and the pseudo-amino acid composition (Cai and Chou [2003]: Biochem. Biophys. Res. Commun. 305:407-411), the Intimate Sorting Algorithm (ISort predictor) was developed for predicting the protein subcellular location. As a showcase, the same plant and non-plant protein datasets as investigated by the previous investigators were used for demonstration. The overall success rate by the jackknife test for the plant protein dataset was 85.4%, and that for the non-plant protein dataset 91.9%. These are so far the highest success rates achieved for the two datasets by following a rigorous cross validation test procedure, further confirming that such a hybrid approach may become a very useful high-throughput tool in the area of bioinformatics, proteomics, as well as molecular cell biology.  相似文献   

抗冻蛋白是一类具有提高生物抗冻能力的蛋白质。抗冻蛋白能够特异性的与冰晶相结合,进而阻止体液内冰核的形成与生长。因此,对抗冻蛋白的生物信息学研究对生物工程发展。提高作物抗冻性有重要的推动作用。本文采用由400条抗冻蛋白序列和400条非抗冻蛋白序列构成数据集,以伪氨基酸组分为特征,利用支持向量机分类算法预测抗冻蛋白,对训练集预测精度达到91.3%,对测试集预测精度达到78.8%。该结果证明伪氨基酸组分能够很好的反映抗冻蛋白特性,并能够用于预测抗冻蛋白。  相似文献   

Based on the recent development in the gene ontology and functional domain databases, a new hybridization approach is developed for predicting protein subcellular location by combining the gene product, functional domain, and quasi-sequence-order effects. As a showcase, the same prokaryotic and eukaryotic datasets, which were studied by many previous investigators, are used for demonstration. The overall success rate by the jackknife test for the prokaryotic set is 94.7% and that for the eukaryotic set 92.9%. These are so far the highest success rates achieved for the two datasets by following a rigorous cross-validation test procedure, suggesting that such a hybrid approach may become a very useful high-throughput tool in the area of bioinformatics, proteomics, as well as molecular cell biology. The very high success rates also reflect the fact that the subcellular localization of a protein is closely correlated with: (1). the biological objective to which the gene or gene product contributes, (2). the biochemical activity of a gene product, and (3). the place in the cell where a gene product is active.  相似文献   

A new method has been developed to predict the enzymatic attribute of proteins by hybridizing the gene product composition and pseudo amino acid composition. As a demonstration, a working dataset was generated with a cutoff of 60% sequence identity to avoid redundancy and bias in statistical prediction. The dataset thus constructed contains 39989 protein sequences, of which 27469 are non-enzymes and 12520 enzymes that were further classified into 6 enzyme family classes according to their 6 main EC (Enzyme Commission) numbers (2314 are oxidoreductases, 3653 transferases, 3246 hydrolases, 1307 lyases, 676 isomerases, and 1324 ligases). The overall success rate by the jackknife test for the identification between enzyme and non-enzyme was 94%, and that for the identification among the 6 enzyme family classes was 98%. It is anticipated that, with the rapid increase of protein sequences entering into databanks, the current method will become a useful automated tool in identifying the enzymatic attribute of a newly found protein sequence.  相似文献   

According to the recent experiments, proteins in budding yeast can be distinctly classified into 22 subcellular locations. Of these proteins, some bear the multi-locational feature, i.e., occur in more than one location. However, so far all the existing methods in predicting protein subcellular location were developed to deal with only the mono-locational case where a query protein is assumed to belong to one, and only one, subcellular location. To stimulate the development of subcellular location prediction, an augmentation procedure is formulated that will enable the existing methods to tackle the multi-locational problem as well. It has been observed thru a jackknife cross-validation test that the success rate obtained by the augmented GO-FnD-PseAA algorithm [BBRC 320 (2004) 1236] is overwhelmingly higher than those by the other augmented methods. It is anticipated that the augmented GO-FunD-PseAA predictor will become a very useful tool in predicting protein subcellular localization for both basic research and practical application.  相似文献   

DNA-binding proteins are crucial for various cellular processes and hence have become an important target for both basic research and drug development. With the avalanche of protein sequences generated in the postgenomic age, it is highly desired to establish an automated method for rapidly and accurately identifying DNA-binding proteins based on their sequence information alone. Owing to the fact that all biological species have developed beginning from a very limited number of ancestral species, it is important to take into account the evolutionary information in developing such a high-throughput tool. In view of this, a new predictor was proposed by incorporating the evolutionary information into the general form of pseudo amino acid composition via the top-n-gram approach. It was observed by comparing the new predictor with the existing methods via both jackknife test and independent data-set test that the new predictor outperformed its counterparts. It is anticipated that the new predictor may become a useful vehicle for identifying DNA-binding proteins. It has not escaped our notice that the novel approach to extract evolutionary information into the formulation of statistical samples can be used to identify many other protein attributes as well.  相似文献   

Membrane proteins are vital type of proteins that serve as channels, receptors, and energy transducers in a cell. Prediction of membrane protein types is an important research area in bioinformatics. Knowledge of membrane protein types provides some valuable information for predicting novel example of the membrane protein types. However, classification of membrane protein types can be both time consuming and susceptible to errors due to the inherent similarity of membrane protein types. In this paper, neural networks based membrane protein type prediction system is proposed. Composite protein sequence representation (CPSR) is used to extract the features of a protein sequence, which includes seven feature sets; amino acid composition, sequence length, 2 gram exchange group frequency, hydrophobic group, electronic group, sum of hydrophobicity, and R-group. Principal component analysis is then employed to reduce the dimensionality of the feature vector. The probabilistic neural network (PNN), generalized regression neural network, and support vector machine (SVM) are used as classifiers. A high success rate of 86.01% is obtained using SVM for the jackknife test. In case of independent dataset test, PNN yields the highest accuracy of 95.73%. These classifiers exhibit improved performance using other performance measures such as sensitivity, specificity, Mathew's correlation coefficient, and F-measure. The experimental results show that the prediction performance of the proposed scheme for classifying membrane protein types is the best reported, so far. This performance improvement may largely be credited to the learning capabilities of neural networks and the composite feature extraction strategy, which exploits seven different properties of protein sequences. The proposed Mem-Predictor can be accessed at  相似文献   

膜蛋白是重要的药物靶位点,对膜蛋白类型的研究有助于药物的成功设计,因此正确预测膜蛋白类型对于药物研发是十分必要的。本文采用由274条分枝杆菌膜蛋白序列组成的一致性小于40%的数据集,以经过优化的伪氨基酸组分为特征,利用支持向量机分类算法预测分枝杆菌膜蛋白类型,在Jackknife检验下,得到85.4%的总体准确率和72.2%的平均准确率。结果说明,该方法可用于分枝杆菌膜蛋白类型的识别,将有助于抗分枝杆菌药物的开发。  相似文献   

Ion channels are integral membrane proteins that control movement of ions into or out of cells. They are key components in a wide range of biological processes. Different types of ion channels have different biological functions. With the appearance of vast proteomic data, it is highly desirable for both basic research and drug-target discovery to develop a computational method for the reliable prediction of ion channels and their types. In this study, we developed a support vector machine-based method to predict ion channels and their types using primary sequence information. A feature selection technique, analysis of variance (ANOVA), was introduced to remove feature redundancy and find out an optimized feature set for improving predictive performance. Jackknife cross-validated results show that the proposed method can discriminate ion channels from non-ion channels with an overall accuracy of 86.6%, classify voltage-gated ion channels and ligand-gated ion channels with an overall accuracy of 92.6% and predict four types (potassium, sodium, calcium and anion) of voltage-gated ion channels with an overall accuracy of 87.8%, respectively. These results indicate that the proposed method can correctly identify ion channels and provide important instructions for drug-target discovery. The predictor can be freely downloaded from http://cobi.uestc.edu.cn/people/hlin/tools/IonchanPred/.  相似文献   

The conotoxin proteins are disulfide rich small peptides that target ion channels and G protein coupled receptors. And they provide promising application in treating some chronic pain, epilepsy, cardiovascular diseases, and so on. Conotoxins may be classified into 11 superfamilies: A, D, I1, I2, J, L, M, O, P, S, and T according to the disulfide connectivity, highly conserved N-terminal precursor sequence and similar mode of actions. Successful prediction mature conotoxin superfamily peptide has important signification for the biological and pharmacological functions of the toxins. In this study, a new algorithm of increment of diversity combined with modified Mahalanobis discriminant is presented to predict five superfamilies by using the pseudo amino acid composition. The results of jackknife cross-validation test show that the overall prediction sensitivity and specificity are 88% and 91%, respectively. The predictive algorithm is also used to predict three O-conotoxin families. The 72% sensitivity and 78% specificity are obtained. These results indicate that the conotoxin superfamily peptides correlate with their amino acid compositions.  相似文献   

The outer membrane proteins (OMPs) are β-barrel membrane proteins that performed lots of biology functions. The discriminating OMPs from other non-OMPs is a very important task for understanding some biochemical process. In this study, a method that combines increment of diversity with modified Mahalanobis Discriminant, called IDQD, is presented to predict 208 OMPs, 206 transmembrane helical proteins (TMHPs) and 673 globular proteins (GPs) by using Chou's pseudo amino acid compositions as parameters. The overall accuracy of jackknife cross-validation is 93.2% and 96.1%, respectively, for three datasets (OMPs, TMHPs and GPs) and two datasets (OMPs and non-OMPs). These predicted results suggest that the method can be effectively applied to discriminate OMPs, TMHPs and GPs. And it also indicates that the pseudo amino acid composition can better reflect the core feature of membrane proteins than the classical amino acid composition.  相似文献   

With the explosive growth of protein sequences entering into protein data banks in the post-genomic era, it is highly demanded to develop automated methods for rapidly and effectively identifying the protein–protein binding sites (PPBSs) based on the sequence information alone. To address this problem, we proposed a predictor called iPPBS-PseAAC, in which each amino acid residue site of the proteins concerned was treated as a 15-tuple peptide segment generated by sliding a window along the protein chains with its center aligned with the target residue. The working peptide segment is further formulated by a general form of pseudo amino acid composition via the following procedures: (1) it is converted into a numerical series via the physicochemical properties of amino acids; (2) the numerical series is subsequently converted into a 20-D feature vector by means of the stationary wavelet transform technique. Formed by many individual “Random Forest” classifiers, the operation engine to run prediction is a two-layer ensemble classifier, with the 1st-layer voting out the best training data-set from many bootstrap systems and the 2nd-layer voting out the most relevant one from seven physicochemical properties. Cross-validation tests indicate that the new predictor is very promising, meaning that many important key features, which are deeply hidden in complicated protein sequences, can be extracted via the wavelets transform approach, quite consistent with the facts that many important biological functions of proteins can be elucidated with their low-frequency internal motions. The web server of iPPBS-PseAAC is accessible at http://www.jci-bioinfo.cn/iPPBS-PseAAC, by which users can easily acquire their desired results without the need to follow the complicated mathematical equations involved.  相似文献   

The nucleus is the brain of eukaryotic cells that guides the life processes of the cell by issuing key instructions. For in-depth understanding of the biochemical process of the nucleus, the knowledge of localization of nuclear proteins is very important. With the avalanche of protein sequences generated in the post-genomic era, it is highly desired to develop an automated method for fast annotating the subnuclear locations for numerous newly found nuclear protein sequences so as to be able to timely utilize them for basic research and drug discovery. In view of this, a novel approach is developed for predicting the protein subnuclear location. It is featured by introducing a powerful classifier, the optimized evidence-theoretic K-nearest classifier, and using the pseudo amino acid composition [K.C. Chou, PROTEINS: Structure, Function, and Genetics, 43 (2001) 246], which can incorporate a considerable amount of sequence-order effects, to represent protein samples. As a demonstration, identifications were performed for 370 nuclear proteins among the following 9 subnuclear locations: (1) Cajal body, (2) chromatin, (3) heterochromatin, (4) nuclear diffuse, (5) nuclear pore, (6) nuclear speckle, (7) nucleolus, (8) PcG body, and (9) PML body. The overall success rates thus obtained by both the re-substitution test and jackknife cross-validation test are significantly higher than those by existing classifiers on the same working dataset. It is anticipated that the powerful approach may also become a useful high throughput vehicle to bridge the huge gap occurring in the post-genomic era between the number of gene sequences in databases and the number of gene products that have been functionally characterized. The OET-KNN classifier will be available at www.pami.sjtu.edu.cn/people/hbshen.  相似文献   

Integral membrane proteins are central to many cellular processes and constitute approximately 50% of potential targets for novel drugs. However, the number of outer membrane proteins (OMPs) present in the public structure database is very limited due to the difficulties in determining structure with experimental methods. Therefore, discriminating OMPs from non-OMPs with computational methods is of medical importance as well as genome sequencing necessity. In this study, some sequence-derived structural and physicochemical features of proteins were incorporated with amino acid composition to discriminate OMPs from non-OMPs using support vector machines. The discrimination performance of the proposed method is evaluated on a benchmark dataset of 208 OMPs, 673 globular proteins, and 206 α-helical membrane proteins. A high overall accuracy of 97.8% was observed in the 5-fold cross-validation test. In addition, the current method distinguished OMPs from globular proteins and α-helical membrane proteins with overall accuracies of 98.2 and 96.4%, respectively. The prediction performance is superior to the state-of-the-art methods in the literature. It is anticipated that the current method might be a powerful tool for the discrimination of OMPs.  相似文献   

了解真核细胞中细胞核内蛋白质的定位情况对于新发现蛋白质的功能注释具有重要意义.随着蛋白质数据库中蛋白质序列数量的急速增加,采用计算方法来预测蛋白质亚核定位已经成为蛋白质科学领域研究的热点.根据Chou提出的伪氨基酸组成离散模型,提出了一种新的蛋白质亚核定位预测方法.计算蛋白质序列的近似熵作为附加特征构建伪氨基酸组成,表示蛋白质序列特征,AdaBoost分类算法作为预测工具.与已报道的亚核定位预测方法的性能相比,这种方法具有更高的准确率.  相似文献   

蛋白质相互作用研究有助于揭示生命过程的许多本质问题,也有助于疾病预防、诊断,对药物研制具有重要的参考价值。文章首先构建出蛋白质作用数据库,提出分段氨基酸组成成分特征提取方法来预测蛋白质相互作用。10CV检验下,基于支持向量机的3段氨基酸组成成分特征提取方法的预测总精度为86.2%,比传统的氨基酸组成成分方法提高2.31个百分点;采用Guo的数据库和检验方法,3段氨基酸组成成分特征提取方法的预测总精度为90.11%,比Guo的自相关函数特征提取方法提高2.75个百分点,从而表明分段氨基酸组成成分特征提取方法可有效地应用于蛋白质相互作用预测。  相似文献   

Chou KC  Cai YD 《Proteins》2003,53(2):282-289
In the protein universe, many proteins are composed of two or more polypeptide chains, generally referred to as subunits, that associate through noncovalent interactions and, occasionally, disulfide bonds. With the number of protein sequences entering into data banks rapidly increasing, we are confronted with a challenge: how to develop an automated method to identify the quaternary attribute for a new polypeptide chain (i.e., whether it is formed just as a monomer, or as a dimer, trimer, or any other oligomer). This is important, because the functions of proteins are closely related to their quaternary attribute. For example, some critical ligands only bind to dimers but not to monomers; some marvelous allosteric transitions only occur in tetramers but not other oligomers; and some ion channels are formed by tetramers, whereas others are formed by pentamers. To explore this problem, we adopted the pseudo amino acid composition originally proposed for improving the prediction of protein subcellular location (Chou, Proteins, 2001; 43:246-255). The advantage of using the pseudo amino acid composition to represent a protein is that it has paved a way that can take into account a considerable amount of sequence-order effects to significantly improve prediction quality. Results obtained by resubstitution, jack-knife, and independent data set tests, have indicated that the current approach might be quite promising in dealing with such an extremely complicated and difficult problem.  相似文献   

