首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 296 毫秒
1.
基于模糊支持向量机的膜蛋白折叠类型预测   总被引:1,自引:0,他引:1  
现有的基于支持向量机(support vector machine,SVM)来预测膜蛋白折叠类型的方法.利用的蛋白质序列特征并不充分.并且在处理多类蛋白质分类问题时存在不可分区域,针对这两类问题.提取蛋白质序列的氨基酸和二肽组成特征,并计算加权的多阶氨基酸残基指数相关系数特征,将3类特征融和作为分类器的输入特征矢量.并采用模糊SVM(fuzzy SVM,FSVM)算法解决对传统SVM不可分数据的分类.在无冗余的数据集上测试结果显示.改进的特征提取方法在相同分类算法下预测性能优于已有的特征提取方法:FSVM在相同特征提取方法下性能优于传统的SVM.二者相结合的分类策略在独立性数据集测试下的预测精度达到96.6%.优于现有的多种预测方法.能够作为预测膜蛋白和其它蛋白质折叠类型的有效工具.  相似文献   

2.
有关蛋白质功能的研究是解析生命奥秘的基础,机器学习技术在该领域已有广泛应用。利用支持向量机(support vectormachine,SVM)方法,构建一个预测蛋白质功能位点的通用平台。该平台先提取非同源蛋白质序列,再对这些序列进行特征编码(包括序列的基本信息、物化特征、结构信息及序列保守性特征等),以编码好的样本作为训练数据,利用SVM进行训练,得到敏感性、特异性、Matthew相关系数、准确率及ROC曲线等评价指标,反复测试,得到评价指标最优的SVM模型后,便可以用来预测蛋白质序列上的功能位点。该平台除了应用在预测蛋白质功能位点之外,还可以应用于疾病相关单核苷酸多态性(SNP)预测分析、预测蛋白质结构域分析、生物分子间的相互作用等。  相似文献   

3.
蛋白质折叠模式识别是一种分析蛋白质结构的重要方法。以序列相似性较低的蛋白质为训练集,提取蛋白质序列信息频数及疏水性等信息作为折叠类型特征,从SCOP数据库中已分类蛋白质构建1 393种折叠模式的数据集,采用SVM预测蛋白质1 393种折叠模式。封闭测试准确率达99.612 2%,基于SCOP的开放测试准确率达79.632 9%。基于另一个权威测试集的开放测试折叠准确率达64.705 9%,SCOP类准确率达76.470 6%,可以有效地对蛋白质折叠模式进行预测,从而为蛋白质从头预测提供参考。  相似文献   

4.
基于最近邻居算法,从蛋白质一级序列出发,利用蛋白质序列氨基酸组成、二肤组成以及混合组成方法对蛋白质单聚体、二聚体、三聚体、四聚体、五聚体、六聚体和八聚体进行分类研究。结果表明:采用二肽组成编码方法的预洲效果最好,Jackknife检验和独立测试集检验的总体预测精度分别达到90.83%和95.48%,比相同数据集上基于伪氨基酸组成和组分耦合预测的方法提高了12和15个百分点;特别是对于五聚体蛋白,预测精度分别提高了90和50个百分点;说明二肽组成对于蛋白质四级结构分类研究是一种非常有效的特征提取方法。  相似文献   

5.
李楠  李春 《生物信息学》2012,10(4):238-240
基于氨基酸的16种分类模型,给出蛋白质序列的派生序列,进而结合加权拟熵和LZ复杂度构造出34维特征向量来表示蛋白质序列。借助于贝叶斯分类器对同源性不超过25%的640数据集进行蛋白质结构类预测,准确度达到71.28%。  相似文献   

6.
DNA结合蛋白(DNA-binding proteins,DBPs)的鉴定在原核和真核生物的基因和蛋白质功能注释研究中具有十分重要的意义.本研究首次运用间隔二肽组分(gapped-dipeptide composition,Gap DPC)结合递归特征消除法(recursive feature elimination,RFE)鉴定DBPs.首先获得待测蛋白质氨基酸序列的位置特异性得分矩阵(position specific scoring matrix,PSSM),在此基础上提取蛋白质的Gap DPC特征,通过RFE法选择最优特征,然后利用支持向量机(support vector machine,SVM)作为分类器,在蛋白质序列数据集PDB396和LB1068中进行夹克刀交叉验证(jackknife cross validation test).研究结果显示,基于PDB396和LB1068数据集,DBPs预测的准确率、Matthews相关系数、敏感性和特异性分别达到93.43%、0.86、89.04%和96.00%,以及86.33%、0.73、86.49%和86.18%,明显优于文献报道中的相关方法,为DBPs的鉴定提供了新的模型.  相似文献   

7.
集成改进KNN算法预测蛋白质亚细胞定位   总被引:1,自引:0,他引:1       下载免费PDF全文
基于Adaboost算法对多个相似性比对K最近邻(K-nearest neighbor,KNN)分类器集成实现蛋白质的亚细胞定位预测。相似性比对KNN算法分别以氨基酸组成、二肽、伪氨基酸组成为蛋白序列特征,在KNN的决策阶段使用Blast比对决定蛋白质的亚细胞定位。在Jackknife检验下,Adaboost集成分类算法提取3种蛋白序列特征,3种特征在数据集CH317和Gram1253的最高预测成功率分别为92.4%和93.1%。结果表明Adaboost集成改进KNN分类预测方法是一种有效的蛋白质亚细胞定位预测方法。  相似文献   

8.
药物研发是非常重要但也十分耗费人力物力的过程。利用计算机辅助预测药物与蛋白质亲和力的方法可以极大地加快药物研发过程。药物靶标亲和力预测的关键在于对药物和蛋白质进行准确详细地信息表征。提出一种基于深度学习与多层次信息融合的药物靶标亲和力的预测模型,试图通过综合药物与蛋白质的多层次信息,来获得更好的预测表现。首先将药物表述成分子图和扩展连接指纹两种形式,分别利用图卷积神经网络模块和全连接层进行学习;其次将蛋白质序列和蛋白质K-mer特征分别输入卷积神经网络模块和全连接层来学习蛋白质潜在特征;随后将4个通道学习到的特征进行融合,再利用全连接层进行预测。在两个基准药物靶标亲和力数据集上验证了所提方法的有效性,并与其他已有模型作对比研究。结果说明提出的模型相比基准模型能得到更好的预测性能,表明提出的综合药物与蛋白质多层次信息的药物靶标亲和力预测策略是有效的。  相似文献   

9.
目的:基于支持向量机建立一个自动化识别新肽链四级结构的方法,提高现有方法的识别精度.方法:改进4种已有的蛋白质一级序列特征值提取方法,采用线性和非线性组合预测方法建立一个有效的组合预测模型.结果:以同源二聚体及非同源二聚体为例.对4种特征值提取方法进行改进后其分类精度均提升了2~3%;进一步实施线性与非线性组合预测后,其分类精度再次提高了2~3%,使独立测试集的分类精度达到了90%以上.结论:4种特征值提取方法均较好地反应出蛋白质一级序列包含四级结构信息,组合预测方法能有效地集多种特征值提取方法优势于一体.  相似文献   

10.
通过研究神经网络权值矩阵的算法,挖掘蛋白质二级结构与氨基酸序列间的内在规律,提高一级序列预测二级结构的准确度。神经网络方法在特征分类方面具有良好表现,经过学习训练后的神经元连接权值矩阵包含样本的内在特征和规律。研究使用神经网络权值矩阵打分预测;采用错位比对方法寻找敏感的氨基酸邻域;分析测试集在不同加窗长度下的共性表现。实验表明,在滑动窗口长度L=7时,预测性能变化显著;邻域位置P=4的氨基酸残基对预测性能有加强作用。该研究方法为基于局部序列特征的蛋白质二级结构预测提供了新的算法设计。  相似文献   

11.
Predict potential drug targets from the ion channel proteins based on SVM   总被引:1,自引:0,他引:1  
The identification of molecular targets is a critical step in the drug discovery and development process. Ion channel proteins represent highly attractive drug targets implicated in a diverse range of disorders, in particular in the cardiovascular and central nervous systems. Due to the limits of experimental technique and low-throughput nature of patch-clamp electrophysiology, they remain a target class waiting to be exploited. In our study, we combined three types of protein features, primary sequence, secondary structure and subcellular localization to predict potential drug targets from ion channel proteins applying classical support vector machine (SVM) method. In addition, our prediction comprised two stages. In stage 1, we predicted ion channel target proteins based on whole-genome target protein characteristics. Firstly, we performed feature selection by Mann-Whitney U test, then made predictions to identify potential ion channel targets by SVM and designed a new evaluating indicator Q to prioritize results. In stage 2, we made a prediction based on known ion channel target protein characteristics. Genetic algorithm was used to select features and SVM was used to predict ion channel targets. Then, we integrated results of two stages, and found that five ion channel proteins appeared in both prediction results including CGMP-gated cation channel beta subunit and Gamma-aminobutyric acid receptor subunit alpha-5, etc., and four of which were relative to some nerve diseases. It suggests that these five proteins are potential targets for drug discovery and our prediction strategies are effective.  相似文献   

12.
We have introduced a new method of protein secondary structure prediction which is based on the theory of support vector machine (SVM). SVM represents a new approach to supervised pattern classification which has been successfully applied to a wide range of pattern recognition problems, including object recognition, speaker identification, gene function prediction with microarray expression profile, etc. In these cases, the performance of SVM either matches or is significantly better than that of traditional machine learning approaches, including neural networks.The first use of the SVM approach to predict protein secondary structure is described here. Unlike the previous studies, we first constructed several binary classifiers, then assembled a tertiary classifier for three secondary structure states (helix, sheet and coil) based on these binary classifiers. The SVM method achieved a good performance of segment overlap accuracy SOV=76.2 % through sevenfold cross validation on a database of 513 non-homologous protein chains with multiple sequence alignments, which out-performs existing methods. Meanwhile three-state overall per-residue accuracy Q(3) achieved 73.5 %, which is at least comparable to existing single prediction methods. Furthermore a useful "reliability index" for the predictions was developed. In addition, SVM has many attractive features, including effective avoidance of overfitting, the ability to handle large feature spaces, information condensing of the given data set, etc. The SVM method is conveniently applied to many other pattern classification tasks in biology.  相似文献   

13.
探讨原发性肝癌患者精确放疗后乙型肝炎病毒(hepatitis b virus,HBV)再激活的危险特征和分类预测模型。提出基于遗传算法的特征选择方法,从原发性肝癌数据的初始特征集中选择HBV再激活的最优特征子集。建立贝叶斯和支持向量机的HBV再激活分类预测模型,并预测最优特征子集和初始特征集的分类性能。实验结果表明,基于遗传算法的特征选择提高了HBV再激活分类性能,最优特征子集的分类性能明显优于初始特征子集的分类性能。影响HBV再激活的最优特征子集包括:HBV DNA水平,肿瘤分期TNM,Child-Pugh,外放边界和全肝最大剂量。贝叶斯的分类准确性最高可达82.89%,支持向量机的分类准确性最高可达83.34%。  相似文献   

14.
Protein function classification via support vector machine approach   总被引:2,自引:0,他引:2  
Support vector machine (SVM) is introduced as a method for the classification of proteins into functionally distinguished classes. Studies are conducted on a number of protein classes including RNA-binding proteins; protein homodimers, proteins responsible for drug absorption, proteins involved in drug distribution and excretion, and drug metabolizing enzymes. Testing accuracy for the classification of these protein classes is found to be in the range of 84-96%. This suggests the usefulness of SVM in the classification of protein functional classes and its potential application in protein function prediction.  相似文献   

15.
Elucidation of the interaction of proteins with different molecules is of significance in the understanding of cellular processes. Computational methods have been developed for the prediction of protein-protein interactions. But insufficient attention has been paid to the prediction of protein-RNA interactions, which play central roles in regulating gene expression and certain RNA-mediated enzymatic processes. This work explored the use of a machine learning method, support vector machines (SVM), for the prediction of RNA-binding proteins directly from their primary sequence. Based on the knowledge of known RNA-binding and non-RNA-binding proteins, an SVM system was trained to recognize RNA-binding proteins. A total of 4011 RNA-binding and 9781 non-RNA-binding proteins was used to train and test the SVM classification system, and an independent set of 447 RNA-binding and 4881 non-RNA-binding proteins was used to evaluate the classification accuracy. Testing results using this independent evaluation set show a prediction accuracy of 94.1%, 79.3%, and 94.1% for rRNA-, mRNA-, and tRNA-binding proteins, and 98.7%, 96.5%, and 99.9% for non-rRNA-, non-mRNA-, and non-tRNA-binding proteins, respectively. The SVM classification system was further tested on a small class of snRNA-binding proteins with only 60 available sequences. The prediction accuracy is 40.0% and 99.9% for snRNA-binding and non-snRNA-binding proteins, indicating a need for a sufficient number of proteins to train SVM. The SVM classification systems trained in this work were added to our Web-based protein functional classification software SVMProt, at http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi. Our study suggests the potential of SVM as a useful tool for facilitating the prediction of protein-RNA interactions.  相似文献   

16.
X-ray crystallography is the primary approach to solve the three-dimensional structure of a protein. However, a major bottleneck of this method is the failure of multi-step experimental procedures to yield diffraction-quality crystals, including sequence cloning, protein material production, purification, crystallization and ultimately, structural determination. Accordingly, prediction of the propensity of a protein to successfully undergo these experimental procedures based on the protein sequence may help narrow down laborious experimental efforts and facilitate target selection. A number of bioinformatics methods based on protein sequence information have been developed for this purpose. However, our knowledge on the important determinants of propensity for a protein sequence to produce high diffraction-quality crystals remains largely incomplete. In practice, most of the existing methods display poorer performance when evaluated on larger and updated datasets. To address this problem, we constructed an up-to-date dataset as the benchmark, and subsequently developed a new approach termed ‘PredPPCrys’ using the support vector machine (SVM). Using a comprehensive set of multifaceted sequence-derived features in combination with a novel multi-step feature selection strategy, we identified and characterized the relative importance and contribution of each feature type to the prediction performance of five individual experimental steps required for successful crystallization. The resulting optimal candidate features were used as inputs to build the first-level SVM predictor (PredPPCrys I). Next, prediction outputs of PredPPCrys I were used as the input to build second-level SVM classifiers (PredPPCrys II), which led to significantly enhanced prediction performance. Benchmarking experiments indicated that our PredPPCrys method outperforms most existing procedures on both up-to-date and previous datasets. In addition, the predicted crystallization targets of currently non-crystallizable proteins were provided as compendium data, which are anticipated to facilitate target selection and design for the worldwide structural genomics consortium. PredPPCrys is freely available at http://www.structbioinfor.org/PredPPCrys.  相似文献   

17.
Cai CZ  Han LY  Ji ZL  Chen YZ 《Proteins》2004,55(1):66-76
One approach for facilitating protein function prediction is to classify proteins into functional families. Recent studies on the classification of G-protein coupled receptors and other proteins suggest that a statistical learning method, Support vector machines (SVM), may be potentially useful for protein classification into functional families. In this work, SVM is applied and tested on the classification of enzymes into functional families defined by the Enzyme Nomenclature Committee of IUBMB. SVM classification system for each family is trained from representative enzymes of that family and seed proteins of Pfam curated protein families. The classification accuracy for enzymes from 46 families and for non-enzymes is in the range of 50.0% to 95.7% and 79.0% to 100% respectively. The corresponding Matthews correlation coefficient is in the range of 54.1% to 96.1%. Moreover, 80.3% of the 8,291 correctly classified enzymes are uniquely classified into a specific enzyme family by using a scoring function, indicating that SVM may have certain level of unique prediction capability. Testing results also suggest that SVM in some cases is capable of classification of distantly related enzymes and homologous enzymes of different functions. Effort is being made to use a more comprehensive set of enzymes as training sets and to incorporate multi-class SVM classification systems to further enhance the unique prediction accuracy. Our results suggest the potential of SVM for enzyme family classification and for facilitating protein function prediction. Our software is accessible at http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi.  相似文献   

18.
Background: Computational tools have been widely used in drug discovery process since they reduce the time and cost. Prediction of whether a protein is druggable is fundamental and crucial for drug research pipeline. Sequence based protein function prediction plays vital roles in many research areas. Training data, protein features selection and machine learning algorithms are three indispensable elements that drive the successfulness of the models. Methods: In this study, we tested the performance of different combinations of protein features and machine learning algorithms, based on FDA-approved small molecules’ targets, in druggable proteins prediction. We also enlarged the dataset to include the targets of small molecules that were in experiment or clinical investigation. Results: We found that although the 146-d vector used by Li et al. with neuron network achieved the best training accuracy of 91.10%, overlapped 3-gram word2vec with logistic regression achieved best prediction accuracy on independent test set (89.55%) and on newly approved-targets. Enlarged dataset with targets of small molecules in experiment and clinical investigation were trained. Unfortunately, the best training accuracy was only 75.48%. In addition, we applied our models to predict potential targets for references in future study. Conclusions: Our study indicates the potential ability of word2vec in the prediction of druggable protein. And the training dataset of druggable protein should not be extended to targets that are lack of verification. The target prediction package could be found on https://github.com/pkumdl/target_prediction.  相似文献   

19.
MOTIVATION: Mutagenicity is among the toxicological end points that pose the highest concern. The accelerated pace of drug discovery has heightened the need for efficient prediction methods. Currently, most available tools fall short of the desired degree of accuracy, and can only provide a binary classification. It is of significance to develop a discriminative and informative model for the mutagenicity prediction. RESULTS: Here we developed a mutagenic probability prediction model addressing the problem, based on datasets covering a large chemical space. A novel molecular electrophilicity vector (MEV) is first devised to represent the structure profile of chemical compounds. An extended support vector machine (SVM) method is then used to derive the posterior probabilistic estimation of mutagenicity from the MEVs of the training set. The results show that our model gives a better performance than TOPKAT (http://www.accelrys.com) and other previously published methods. In addition, a confidence level related to the prediction can be provided, which may help people make more flexible decisions on chemical ordering or synthesis. AVAILABILITY: The binary program (ZGTOX_1.1) based on our model and samples of input datasets on Windows PC are available at http://dddc.ac.cn/adme upon request from the authors.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号