首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The mitochondrion is a key organelle of eukaryotic cell that provides the energy for cellular activities. Correctly identifying submitochondria locations of proteins can provide plentiful information for understanding their functions. However, using web-experimental methods to recognize submitochondria locations of proteins are time-consuming and costly. Thus, it is highly desired to develop a bioinformatics method to predict the submitochondria locations of mitochondrion proteins. In this work, a novel method based on support vector machine was developed to predict the submitochondria locations of mitochondrion proteins by using over-represented tetrapeptides selected by using binomial distribution. A reliable and rigorous benchmark dataset including 495 mitochondrion proteins with sequence identity ≤25 % was constructed for testing and evaluating the proposed model. Jackknife cross-validated results showed that the 91.1 % of the 495 mitochondrion proteins can be correctly predicted. Subsequently, our model was estimated by three existing benchmark datasets. The overall accuracies are 94.0, 94.7 and 93.4 %, respectively, suggesting that the proposed model is potentially useful in the realm of mitochondrion proteome research. Based on this model, we built a predictor called TetraMito which is freely available at http://lin.uestc.edu.cn/server/TetraMito.  相似文献   

2.
Protein-nucleotide interactions are ubiquitous in a wide variety of biological processes. Accurately identifying interaction residues solely from protein sequences is useful for both protein function annotation and drug design, especially in the post-genomic era, as large volumes of protein data have not been functionally annotated. Protein-nucleotide binding residue prediction is a typical imbalanced learning problem, where binding residues are extremely fewer in number than non-binding residues. Alleviating the severity of class imbalance has been demonstrated to be a promising means of improving the prediction performance of a machine-learning-based predictor for class imbalance problems. However, little attention has been paid to the negative impact of class imbalance on protein-nucleotide binding residue prediction. In this study, we propose a new supervised over-sampling algorithm that synthesizes additional minority class samples to address class imbalance. The experimental results from protein-nucleotide interaction datasets demonstrate that the proposed supervised over-sampling algorithm can relieve the severity of class imbalance and help to improve prediction performance. Based on the proposed over-sampling algorithm, a predictor, called TargetSOS, is implemented for protein-nucleotide binding residue prediction. Cross-validation tests and independent validation tests demonstrate the effectiveness of TargetSOS. The web-server and datasets used in this study are freely available at http://www.csbio.sjtu.edu.cn/bioinf/TargetSOS/.  相似文献   

3.
The ability to search sequence datasets for membrane spanning proteins is an important requirement for genome annotation. However, the development of algorithms to identify novel types of transmembrane beta-barrel (TMB) protein has proven substantially harder than for transmembrane helical proteins, owing to a shorter TM domain in which only alternate residues are hydrophobic. Although recent reports have described important improvements in the development of such algorithms, there is still concern over their ability to confidently screen genomes. Here we describe a new algorithm combining composition and hidden Markov model topology based classifiers (called TMB-Hunt2), which achieves a crossvalidation accuracy of >95%, with 96.7% precision and 94.2% recall. An overview is given of the algorithm design, with a thorough assessment of performance and application to a number of genomes. Of particular note is that TMB/extracellular protein discrimination is significantly more difficult than TMB/cytoplasmic protein discrimination, with the predictor correctly rejecting just 74% of extracellular proteins, in comparison to 98% of cytoplasmic proteins. Focus is given to directions for further improvements in TMB/non-TMB protein discrimination, with a call for the development of standardized tests and assessments of such algorithms. Tools and datasets are made available through a website called TMB-Web (http://www.bioinformatics.leeds.ac.uk/TMB-Web/TMB-Hunt2).  相似文献   

4.
The outer membrane proteins (OMPs) are β-barrel membrane proteins that performed lots of biology functions. The discriminating OMPs from other non-OMPs is a very important task for understanding some biochemical process. In this study, a method that combines increment of diversity with modified Mahalanobis Discriminant, called IDQD, is presented to predict 208 OMPs, 206 transmembrane helical proteins (TMHPs) and 673 globular proteins (GPs) by using Chou's pseudo amino acid compositions as parameters. The overall accuracy of jackknife cross-validation is 93.2% and 96.1%, respectively, for three datasets (OMPs, TMHPs and GPs) and two datasets (OMPs and non-OMPs). These predicted results suggest that the method can be effectively applied to discriminate OMPs, TMHPs and GPs. And it also indicates that the pseudo amino acid composition can better reflect the core feature of membrane proteins than the classical amino acid composition.  相似文献   

5.
6.
DNA结合蛋白(DNA-binding proteins,DBPs)的鉴定在原核和真核生物的基因和蛋白质功能注释研究中具有十分重要的意义.本研究首次运用间隔二肽组分(gapped-dipeptide composition,Gap DPC)结合递归特征消除法(recursive feature elimination,RFE)鉴定DBPs.首先获得待测蛋白质氨基酸序列的位置特异性得分矩阵(position specific scoring matrix,PSSM),在此基础上提取蛋白质的Gap DPC特征,通过RFE法选择最优特征,然后利用支持向量机(support vector machine,SVM)作为分类器,在蛋白质序列数据集PDB396和LB1068中进行夹克刀交叉验证(jackknife cross validation test).研究结果显示,基于PDB396和LB1068数据集,DBPs预测的准确率、Matthews相关系数、敏感性和特异性分别达到93.43%、0.86、89.04%和96.00%,以及86.33%、0.73、86.49%和86.18%,明显优于文献报道中的相关方法,为DBPs的鉴定提供了新的模型.  相似文献   

7.
林昊 《生物信息学》2009,7(4):252-254
由于蛋白质亚细胞位置与其一级序列存在很强的相关性,利用多样性增量来描述蛋白质之间氨基酸组分和二肽组分的相似程度,采用修正的马氏判别式(这里称为IDQD方法)对分枝杆菌蛋白质的亚细胞位置进行了预测。利用Jackknife检验对不同序列相似度下的蛋白质数据集进行了预测研究,结果显示,当数据集的序列相似度小于等于70%时,算法的预测精度稳定在75%左右。在对整体852条蛋白质的预测成功率达到87.7%,这一结果优于已有算法的预测精度,说明IDQD是一种有效的分枝杆菌蛋白质亚细胞预测方法。  相似文献   

8.
Du P  Wang X  Xu C  Gao Y 《Analytical biochemistry》2012,425(2):117-119
The pseudo-amino acid composition has been widely used to convert complicated protein sequences with various lengths to fixed length digital feature vectors while keeping considerable sequence order information. However, so far the only software available to the public is the web server PseAAC (http://www.csbio.sjtu.edu.cn/bioinf/PseAAC), which has some limitations in dealing with large-scale datasets. Here, we propose a new cross-platform stand-alone software program, called PseAAC-Builder (http://www.pseb.sf.net), which can be used to generate various modes of Chou's pseudo-amino acid composition in a much more efficient and flexible way. It is anticipated that PseAAC-Builder may become a useful tool for studying various protein attributes.  相似文献   

9.

Background

Vitamins are typical ligands that play critical roles in various metabolic processes. The accurate identification of the vitamin-binding residues solely based on a protein sequence is of significant importance for the functional annotation of proteins, especially in the post-genomic era, when large volumes of protein sequences are accumulating quickly without being functionally annotated.

Results

In this paper, a new predictor called TargetVita is designed and implemented for predicting protein-vitamin binding residues using protein sequences. In TargetVita, features derived from the position-specific scoring matrix (PSSM), predicted protein secondary structure, and vitamin binding propensity are combined to form the original feature space; then, several feature subspaces are selected by performing different feature selection methods. Finally, based on the selected feature subspaces, heterogeneous SVMs are trained and then ensembled for performing prediction.

Conclusions

The experimental results obtained with four separate vitamin-binding benchmark datasets demonstrate that the proposed TargetVita is superior to the state-of-the-art vitamin-specific predictor, and an average improvement of 10% in terms of the Matthews correlation coefficient (MCC) was achieved over independent validation tests. The TargetVita web server and the datasets used are freely available for academic use at http://csbio.njust.edu.cn/bioinf/TargetVita or http://www.csbio.sjtu.edu.cn/bioinf/TargetVita.

Electronic supplementary material

The online version of this article (doi:10.1186/1471-2105-15-297) contains supplementary material, which is available to authorized users.  相似文献   

10.

Background

Computational identification of apicoplast-targeted proteins is important in drug target determination for diseases such as malaria. While there are established methods for identifying proteins with a bipartite signal in multiple species of Apicomplexa, not all apicoplast-targeted proteins possess this bipartite signature. The publication of recent experimental findings of apicoplast membrane proteins, called transmembrane proteins, that do not possess a bipartite signal has made it feasible to devise a machine learning approach for identifying this new class of apicoplast-targeted proteins computationally.

Methodology/principal findings

In this work, we develop a method for predicting apicoplast-targeted transmembrane proteins for multiple species of Apicomplexa, whereby several classifiers trained on different feature sets and based on different algorithms are evaluated and combined in an ensemble classification model to obtain the best expected performance. The feature sets considered are the hydrophobicity and composition characteristics of amino acids over transmembrane domains, the existence of short sequence motifs over cytosolically disposed regions, and Gene Ontology (GO) terms associated with given proteins. Our model, ApicoAMP, is an ensemble classification model that combines decisions of classifiers following the majority vote principle. ApicoAMP is trained on a set of proteins from 11 apicomplexan species and achieves 91% overall expected accuracy.

Conclusions/significance

ApicoAMP is the first computational model capable of identifying apicoplast-targeted transmembrane proteins in Apicomplexa. The ApicoAMP prediction software is available at http://code.google.com/p/apicoamp/ and http://bcb.eecs.wsu.edu.  相似文献   

11.
Recent advances in large-scale genome sequencing have led to the rapid accumulation of amino acid sequences of proteins whose functions are unknown. Since the functions of these proteins are closely correlated with their subcellular localizations, many efforts have been made to develop a variety of methods for predicting protein subcellular location. In this study, based on the strategy by hybridizing the functional domain composition and the pseudo-amino acid composition (Cai and Chou [2003]: Biochem. Biophys. Res. Commun. 305:407-411), the Intimate Sorting Algorithm (ISort predictor) was developed for predicting the protein subcellular location. As a showcase, the same plant and non-plant protein datasets as investigated by the previous investigators were used for demonstration. The overall success rate by the jackknife test for the plant protein dataset was 85.4%, and that for the non-plant protein dataset 91.9%. These are so far the highest success rates achieved for the two datasets by following a rigorous cross validation test procedure, further confirming that such a hybrid approach may become a very useful high-throughput tool in the area of bioinformatics, proteomics, as well as molecular cell biology.  相似文献   

12.
Revealing the subcellular location of newly discovered protein sequences can bring insight to their function and guide research at the cellular level. The rapidly increasing number of sequences entering the genome databanks has called for the development of automated analysis methods. Currently, most existing methods used to predict protein subcellular locations cover only one, or a very limited number of species. Therefore, it is necessary to develop reliable and effective computational approaches to further improve the performance of protein subcellular prediction and, at the same time, cover more species. The current study reports the development of a novel predictor called MSLoc-DT to predict the protein subcellular locations of human, animal, plant, bacteria, virus, fungi, and archaea by introducing a novel feature extraction approach termed Amino Acid Index Distribution (AAID) and then fusing gene ontology information, sequential evolutionary information, and sequence statistical information through four different modes of pseudo amino acid composition (PseAAC) with a decision template rule. Using the jackknife test, MSLoc-DT can achieve 86.5, 98.3, 90.3, 98.5, 95.9, 98.1, and 99.3% overall accuracy for human, animal, plant, bacteria, virus, fungi, and archaea, respectively, on seven stringent benchmark datasets. Compared with other predictors (e.g., Gpos-PLoc, Gneg-PLoc, Virus-PLoc, Plant-PLoc, Plant-mPLoc, ProLoc-Go, Hum-PLoc, GOASVM) on the gram-positive, gram-negative, virus, plant, eukaryotic, and human datasets, the new MSLoc-DT predictor is much more effective and robust. Although the MSLoc-DT predictor is designed to predict the single location of proteins, our method can be extended to multiple locations of proteins by introducing multilabel machine learning approaches, such as the support vector machine and deep learning, as substitutes for the K-nearest neighbor (KNN) method. As a user-friendly web server, MSLoc-DT is freely accessible at http://bioinfo.ibp.ac.cn/MSLOC_DT/index.html.  相似文献   

13.

Background  

Feature selection plays an undeniably important role in classification problems involving high dimensional datasets such as microarray datasets. For filter-based feature selection, two well-known criteria used in forming predictor sets are relevance and redundancy. However, there is a third criterion which is at least as important as the other two in affecting the efficacy of the resulting predictor sets. This criterion is the degree of differential prioritization (DDP), which varies the emphases on relevance and redundancy depending on the value of the DDP. Previous empirical works on publicly available microarray datasets have confirmed the effectiveness of the DDP in molecular classification. We now propose to establish the fundamental strengths and merits of the DDP-based feature selection technique. This is to be done through a simulation study which involves vigorous analyses of the characteristics of predictor sets found using different values of the DDP from toy datasets designed to mimic real-life microarray datasets.  相似文献   

14.
研究表明,许多神经退行性疾病都与蛋白质在高尔基体中的定位有关,因此,正确识别亚高尔基体蛋白质对相关疾病药物的研制有一定帮助,本文建立了两类亚高尔基体蛋白质数据集,提取了氨基酸组分信息、联合三联体信息、平均化学位移、基因本体注释信息等特征信息,利用支持向量机算法进行预测,基于5-折交叉检验下总体预测成功率为87.43%。  相似文献   

15.
The identification of virulent proteins in any de-novo sequenced genome is useful in estimating its pathogenic ability and understanding the mechanism of pathogenesis. Similarly, the identification of such proteins could be valuable in comparing the metagenome of healthy and diseased individuals and estimating the proportion of pathogenic species. However, the common challenge in both the above tasks is the identification of virulent proteins since a significant proportion of genomic and metagenomic proteins are novel and yet unannotated. The currently available tools which carry out the identification of virulent proteins provide limited accuracy and cannot be used on large datasets. Therefore, we have developed an MP3 standalone tool and web server for the prediction of pathogenic proteins in both genomic and metagenomic datasets. MP3 is developed using an integrated Support Vector Machine (SVM) and Hidden Markov Model (HMM) approach to carry out highly fast, sensitive and accurate prediction of pathogenic proteins. It displayed Sensitivity, Specificity, MCC and accuracy values of 92%, 100%, 0.92 and 96%, respectively, on blind dataset constructed using complete proteins. On the two metagenomic blind datasets (Blind A: 51–100 amino acids and Blind B: 30–50 amino acids), it displayed Sensitivity, Specificity, MCC and accuracy values of 82.39%, 97.86%, 0.80 and 89.32% for Blind A and 71.60%, 94.48%, 0.67 and 81.86% for Blind B, respectively. In addition, the performance of MP3 was validated on selected bacterial genomic and real metagenomic datasets. To our knowledge, MP3 is the only program that specializes in fast and accurate identification of partial pathogenic proteins predicted from short (100–150 bp) metagenomic reads and also performs exceptionally well on complete protein sequences. MP3 is publicly available at http://metagenomics.iiserb.ac.in/mp3/index.php.  相似文献   

16.
As a continuous effort to use the sequence approach to identify enzymatic function at a deeper level, investigations are extended from the main enzyme classes (Protein Sci. 2004, 13, 2857-2863) to their subclasses. This is indispensable if we wish to understand the molecular mechanism of an enzyme at a deeper level. For each of the 6 main enzyme classes (i.e., oxidoreductase, transferase, hydrolase, lyase, isomerase, and ligase), a subclass training dataset is constructed. To reduce homologous bias, a stringent cutoff was imposed that all the entries included in the datasets have less than 40% sequence identity to each other. To catch the core feature that is intimately related to the biological function, the sample of a protein is represented by hybridizing the functional domain composition and pseudo amino acid composition. On the basis of such a hybridization representation, the FunD-PseAA predictor is established. It is demonstrated by the jackknife cross-validation tests that the overall success rate in identifying the 21 subclasses of oxidoreductases is above 86%, and the corresponding rates in identifying the subclasses of the other 5 main enzyme classes are 94-97%. The high success rates imply that the FunD-PseAA predictor may become a useful tool in bioinformatics and proteomics of the post-genomic era.  相似文献   

17.

Background

As one of the most common protein post-translational modifications, glycosylation is involved in a variety of important biological processes. Computational identification of glycosylation sites in protein sequences becomes increasingly important in the post-genomic era. A new encoding scheme was employed to improve the prediction of mucin-type O-glycosylation sites in mammalian proteins.

Results

A new protein bioinformatics tool, CKSAAP_OGlySite, was developed to predict mucin-type O-glycosylation serine/threonine (S/T) sites in mammalian proteins. Using the composition of k-spaced amino acid pairs (CKSAAP) based encoding scheme, the proposed method was trained and tested in a new and stringent O-glycosylation dataset with the assistance of Support Vector Machine (SVM). When the ratio of O-glycosylation to non-glycosylation sites in training datasets was set as 1:1, 10-fold cross-validation tests showed that the proposed method yielded a high accuracy of 83.1% and 81.4% in predicting O-glycosylated S and T sites, respectively. Based on the same datasets, CKSAAP_OGlySite resulted in a higher accuracy than the conventional binary encoding based method (about +5.0%). When trained and tested in 1:5 datasets, the CKSAAP encoding showed a more significant improvement than the binary encoding. We also merged the training datasets of S and T sites and integrated the prediction of S and T sites into one single predictor (i.e. S+T predictor). Either in 1:1 or 1:5 datasets, the performance of this S+T predictor was always slightly better than those predictors where S and T sites were independently predicted, suggesting that the molecular recognition of O-glycosylated S/T sites seems to be similar and the increase of the S+T predictor's accuracy may be a result of expanded training datasets. Moreover, CKSAAP_OGlySite was also shown to have better performance when benchmarked against two existing predictors.

Conclusion

Because of CKSAAP encoding's ability of reflecting characteristics of the sequences surrounding mucin-type O-glycosylation sites, CKSAAP_ OGlySite has been proved more powerful than the conventional binary encoding based method. This suggests that it can be used as a competitive mucin-type O-glycosylation site predictor to the biological community. CKSAAP_OGlySite is now available at http://bioinformatics.cau.edu.cn/zzd_lab/CKSAAP_OGlySite/.  相似文献   

18.
Shi SP  Qiu JD  Sun XY  Suo SB  Huang SY  Liang RP 《PloS one》2012,7(6):e38772
Protein methylation is predominantly found on lysine and arginine residues, and carries many important biological functions, including gene regulation and signal transduction. Given their important involvement in gene expression, protein methylation and their regulatory enzymes are implicated in a variety of human disease states such as cancer, coronary heart disease and neurodegenerative disorders. Thus, identification of methylation sites can be very helpful for the drug designs of various related diseases. In this study, we developed a method called PMeS to improve the prediction of protein methylation sites based on an enhanced feature encoding scheme and support vector machine. The enhanced feature encoding scheme was composed of the sparse property coding, normalized van der Waals volume, position weight amino acid composition and accessible surface area. The PMeS achieved a promising performance with a sensitivity of 92.45%, a specificity of 93.18%, an accuracy of 92.82% and a Matthew's correlation coefficient of 85.69% for arginine as well as a sensitivity of 84.38%, a specificity of 93.94%, an accuracy of 89.16% and a Matthew's correlation coefficient of 78.68% for lysine in 10-fold cross validation. Compared with other existing methods, the PMeS provides better predictive performance and greater robustness. It can be anticipated that the PMeS might be useful to guide future experiments needed to identify potential methylation sites in proteins of interest. The online service is available at http://bioinfo.ncu.edu.cn/inquiries_PMeS.aspx.  相似文献   

19.

Background

DNA-binding proteins are vital for the study of cellular processes. In recent genome engineering studies, the identification of proteins with certain functions has become increasingly important and needs to be performed rapidly and efficiently. In previous years, several approaches have been developed to improve the identification of DNA-binding proteins. However, the currently available resources are insufficient to accurately identify these proteins. Because of this, the previous research has been limited by the relatively unbalanced accuracy rate and the low identification success of the current methods.

Results

In this paper, we explored the practicality of modelling DNA binding identification and simultaneously employed an ensemble classifier, and a new predictor (nDNA-Prot) was designed. The presented framework is comprised of two stages: a 188-dimension feature extraction method to obtain the protein structure and an ensemble classifier designated as imDC. Experiments using different datasets showed that our method is more successful than the traditional methods in identifying DNA-binding proteins. The identification was conducted using a feature that selected the minimum Redundancy and Maximum Relevance (mRMR). An accuracy rate of 95.80% and an Area Under the Curve (AUC) value of 0.986 were obtained in a cross validation. A test dataset was tested in our method and resulted in an 86% accuracy, versus a 76% using iDNA-Prot and a 68% accuracy using DNA-Prot.

Conclusions

Our method can help to accurately identify DNA-binding proteins, and the web server is accessible at http://datamining.xmu.edu.cn/~songli/nDNA. In addition, we also predicted possible DNA-binding protein sequences in all of the sequences from the UniProtKB/Swiss-Prot database.

Electronic supplementary material

The online version of this article (doi:10.1186/1471-2105-15-298) contains supplementary material, which is available to authorized users.  相似文献   

20.
Intrinsically disordered proteins or, regions perform important biological functions through their dynamic conformations during binding. Thus accurate identification of these disordered regions have significant implications in proper annotation of function, induced fold prediction and drug design to combat critical diseases. We introduce DisPredict, a disorder predictor that employs a single support vector machine with RBF kernel and novel features for reliable characterization of protein structure. DisPredict yields effective performance. In addition to 10-fold cross validation, training and testing of DisPredict was conducted with independent test datasets. The results were consistent with both the training and test error minimal. The use of multiple data sources, makes the predictor generic. The datasets used in developing the model include disordered regions of various length which are categorized as short and long having different compositions, different types of disorder, ranging from fully to partially disordered regions as well as completely ordered regions. Through comparison with other state of the art approaches and case studies, DisPredict is found to be a useful tool with competitive performance. DisPredict is available at https://github.com/tamjidul/DisPredict_v1.0.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号