首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 171 毫秒
1.
基于支持向量机融合网络的蛋白质折叠子识别研究   总被引:11,自引:1,他引:11  
在不依赖于序列相似性的条件下,蛋白质折叠子识别是一种分析蛋白质结构的重要方法.提出了一种三层支持向量机融合网络,从蛋白质的氨基酸序列出发,对27类折叠子进行识别.融合网络使用支持向量机作为成员分类器,采用“多对多”的多类分类策略,将折叠子的6种特征分为主要特征和次要特征,构建了多个差异的融合方案,然后对这些融合方案进行动态选择得到最终决策.当分类之前难以确定哪些参与组合的特征种类能够使分类结果最好时,提供了一种可靠的解决方案来自动选择特征信息互补最大的组合,保证了最佳分类结果.最后,识别系统对独立测试样本的总分类精度达到61.04%.结果和对比表明,此方法是一种有效的折叠子识别方法.  相似文献   

2.
以序列相似性低于40%的1895条蛋白质序列构建涵盖27个折叠类型的蛋白质折叠子数据库,从蛋白质序列出发,用模体频数值、低频功率谱密度值、氨基酸组分、预测的二级结构信息和自相关函数值构成组合向量表示蛋白质序列信息,采用支持向量机算法,基于整体分类策略,对27类蛋白质折叠子的折叠类型进行预测,独立检验的预测精度达到了66.67%。同时,以同样的特征参数和算法对27类折叠子的4个结构类型进行了预测,独立检验的预测精度达到了89.24%。将同样的方法用于前人使用过的27类折叠子数据库,得到了好于前人的预测结果。  相似文献   

3.
依据蛋白质折叠子中氨基酸保守性,以氨基酸、氨基酸的极性、氨基酸的电性以及氨基酸的亲—疏水性为参数,从蛋白质的氨基酸序列出发,采用"一对多"的分类策略,通过构建打分矩阵和选取氨基酸序列模式片断,利用5种相似性打分函数对27类折叠子进行识别,最好的预测精度达到83.46%。结果表明,打分矩阵是预测多类蛋白质折叠子有效的方法。  相似文献   

4.
构建基于折叠核心的全α类蛋白取代矩阵   总被引:1,自引:0,他引:1  
氨基酸残基取代矩阵是影响多序列比对效果的重要因素,现有的取代矩阵对低相似序列的比对性能较低.在已有的 BLOSUM 取代矩阵算法基础上,定义了基于蛋白质折叠核心结构的序列 结构数据块;提出一种新的基于全α类蛋白质折叠核心结构的氨基酸残基取代矩阵——TOPSSUM25,用于提高低相似度序列的比对效果.将矩阵TOPSSUM25导入多序列比对程序,对相似性小于25%的一组四螺旋束序列 结构数据块的测试结果表明,基于 TOPSSUM25的多序列比对效果明显优于BLOSUM30矩阵;基于一个BAliBASE子集的比对检验也进一步表明, TOPSSUM25在全α类蛋白质的两两序列比对上优于BLOSUM30矩阵.研究结果可为进一步的阐明低同源蛋白质序列 结构 功能关系提供帮助.  相似文献   

5.
蛋白质折叠类型分类方法及分类数据库   总被引:1,自引:0,他引:1  
李晓琴  仁文科  刘岳  徐海松  乔辉 《生物信息学》2010,8(3):245-247,253
蛋白质折叠规律研究是生命科学重大前沿课题,折叠分类是蛋白质折叠研究的基础。目前的蛋白质折叠类型分类基本上靠专家完成,不同的库分类并不相同,迫切需要一个建立在统一原理基础上的蛋白质折叠类型数据库。本文以ASTRAL-1.65数据库中序列同源性在25%以下、分辨率小于2.5的蛋白为基础,通过对蛋白质空间结构的观察及折叠类型特征的分析,提出以蛋白质折叠核心为中心、以蛋白质结构拓扑不变性为原则、以蛋白质折叠核心的规则结构片段组成、连接和空间排布为依据的蛋白质折叠类型分类方法,建立了低相似度蛋白质折叠分类数据库——LIFCA,包含259种蛋白质折叠类型。数据库的建立,将为进一步的蛋白质折叠建模及数据挖掘、蛋白质折叠识别、蛋白质折叠结构进化研究奠定基础。  相似文献   

6.
基于氨基酸组成分布的蛋白质同源寡聚体分类研究   总被引:7,自引:0,他引:7  
基于一种新的特征提取方法——氨基酸组成分布,使用支持向量机作为成员分类器,采用“一对一”的多类分类策略,从蛋白质一级序列对四类同源寡聚体进行分类研究。结果表明,在10-CV检验下,基于氨基酸组成分布,其总分类精度和精度指数分别达到了86.22%和67.12%,比基于氨基酸组成成分的传统特征提取方法分别提高了5.74和10.03个百分点,比二肽组成成分特征提取方法分别提高了3.12和5.63个百分点,说明氨基酸组成分布对于蛋白质同源寡聚体分类是一种非常有效的特征提取方法;将氨基酸组成分布和蛋白质序列长度特征组合,其总分类精度和精度指数分别达到了86.35%和67.23%,说明蛋白质序列长度特征含有一定的空间结构信息。  相似文献   

7.
基于模糊支持向量机的膜蛋白折叠类型预测   总被引:1,自引:0,他引:1  
现有的基于支持向量机(support vector machine,SVM)来预测膜蛋白折叠类型的方法.利用的蛋白质序列特征并不充分.并且在处理多类蛋白质分类问题时存在不可分区域,针对这两类问题.提取蛋白质序列的氨基酸和二肽组成特征,并计算加权的多阶氨基酸残基指数相关系数特征,将3类特征融和作为分类器的输入特征矢量.并采用模糊SVM(fuzzy SVM,FSVM)算法解决对传统SVM不可分数据的分类.在无冗余的数据集上测试结果显示.改进的特征提取方法在相同分类算法下预测性能优于已有的特征提取方法:FSVM在相同特征提取方法下性能优于传统的SVM.二者相结合的分类策略在独立性数据集测试下的预测精度达到96.6%.优于现有的多种预测方法.能够作为预测膜蛋白和其它蛋白质折叠类型的有效工具.  相似文献   

8.
蛋白质折叠模式识别是一种分析蛋白质结构的重要方法。以序列相似性较低的蛋白质为训练集,提取蛋白质序列信息频数及疏水性等信息作为折叠类型特征,从SCOP数据库中已分类蛋白质构建1 393种折叠模式的数据集,采用SVM预测蛋白质1 393种折叠模式。封闭测试准确率达99.612 2%,基于SCOP的开放测试准确率达79.632 9%。基于另一个权威测试集的开放测试折叠准确率达64.705 9%,SCOP类准确率达76.470 6%,可以有效地对蛋白质折叠模式进行预测,从而为蛋白质从头预测提供参考。  相似文献   

9.
从蛋白质折叠成自由能最小的稳定结构类型为研究的出发点,为揭示蛋白质空间折叠的动力学本质,对非同源蛋白质数据库,以蛋白质序列的氮基酸频率和自协方差函数为特征矢量,求出表征特征矢量中各分量耦合作用与协同作用的协方差矩阵所对应的特征值.与Chou的方法相比,更全面地反映了蛋白质折叠密码的简并性、全局性和多意性,为定量表征折叠成不同结构类的蛋白质,提供了一种动力学参数分析方法.  相似文献   

10.
蛋白质折叠规律研究是生命科学领域重要的前沿课题之一,蛋白质折叠类型分类是折叠规律研究的基础。本研究以SCOP数据库的蛋白质折叠类型分类为基础、以Astral SCOPe 2.05数据库中相似性小于40%的α、β、α+β及α/β类所属的折叠类型为研究对象,完成了989种蛋白质折叠类型的模板构建并形成模板数据库;基于折叠类型设计模板建立了蛋白质折叠类型分类方法,实现了SCOP数据库蛋白质折叠类型的自动化分类。家族模板自洽性检验与独立性检验所得的敏感性、特异性以及MCC的平均值分别为:95.00%、99.99%、0.94与90.00%、99.97%、0.92,折叠类型模板自洽性检验与独立性检验所得的敏感性、特异性以及MCC的平均值分别为:93.71%、99.97%、0.91与86.00%、99.93%、0.87。结果表明:模板设计合理,可有效用于对已知结构的蛋白质进行分类。  相似文献   

11.
12.
The recognition of protein folds is an important step in the prediction of protein structure and function. Recently, an increasing number of researchers have sought to improve the methods for protein fold recognition. Following the construction of a dataset consisting of 27 protein fold classes by Ding and Dubchak in 2001, prediction algorithms, parameters and the construction of new datasets have improved for the prediction of protein folds. In this study, we reorganized a dataset consisting of 76-fold classes constructed by Liu et al. and used the values of the increment of diversity, average chemical shifts of secondary structure elements and secondary structure motifs as feature parameters in the recognition of multi-class protein folds. With the combined feature vector as the input parameter for the Random Forests algorithm and ensemble classification strategy, we propose a novel method to identify the 76 protein fold classes. The overall accuracy of the test dataset using an independent test was 66.69%; when the training and test sets were combined, with 5-fold cross-validation, the overall accuracy was 73.43%. This method was further used to predict the test dataset and the corresponding structural classification of the first 27-protein fold class dataset, resulting in overall accuracies of 79.66% and 93.40%, respectively. Moreover, when the training set and test sets were combined, the accuracy using 5-fold cross-validation was 81.21%. Additionally, this approach resulted in improved prediction results using the 27-protein fold class dataset constructed by Ding and Dubchak.  相似文献   

13.
Identification on protein folding types is always based on the 27-class folds dataset, which was provided by Ding & Dubchak in 2001. But with the avalanche of protein sequences, fold data is also expanding, so it will be the inevitable trend to improve the existing dataset and expand more folding types. In this paper, we construct a multi-class protein fold dataset, which contains 3,457 protein chains with sequence identity below 35% and could be classified into 76 fold types. It was 4 times larger than Ding & Dubchak's dataset. Furthermore, our work proposes a novel approach of support vector machine based on optimal features. By combining motif frequency, low-frequency power spectral density, amino acid composition, the predicted secondary structure and the values of auto-correlation function as feature parameters set, the method adopts criterion of the maximum correlation and the minimum redundancy to filter these features and obtain a 95-dimensions optimal feature subset. Based on the ensemble classification strategy, with 95-dimensions optimal feature as input parameters of support vector machine, we identify the 76-class protein folds and overall accuracy measures up to 44.92% by independent test. In addition, this method has been further used to identify upgraded 27-class protein folds, overall accuracy achieves 66.56%. At last, we also test our method on Ding & Dubchak's 27-class folds dataset and obtained better identification results than most of the previous reported results.  相似文献   

14.
MOTIVATION: What constitutes a baseline level of success for protein fold recognition methods? As fold recognition benchmarks are often presented without any thought to the results that might be expected from a purely random set of predictions, an analysis of fold recognition baselines is long overdue. Given varying amounts of basic information about a protein-ranging from the length of the sequence to a knowledge of its secondary structure-to what extent can the fold be determined by intelligent guesswork? Can simple methods that make use of secondary structure information assign folds more accurately than purely random methods and could these methods be used to construct viable hierarchical classifications? EXPERIMENTS PERFORMED: A number of rapid automatic methods which score similarities between protein domains were devised and tested. These methods ranged from those that incorporated no secondary structure information, such as measuring absolute differences in sequence lengths, to more complex alignments of secondary structure elements. Each method was assessed for accuracy by comparison with the Class Architecture Topology Homology (CATH) classification. Methods were rated against both a random baseline fold assignment method as a lower control and FSSP as an upper control. Similarity trees were constructed in order to evaluate the accuracy of optimum methods at producing a classification of structure. RESULTS: Using a rigorous comparison of methods with CATH, the random fold assignment method set a lower baseline of 11% true positives allowing for 3% false positives and FSSP set an upper benchmark of 47% true positives at 3% false positives. The optimum secondary structure alignment method used here achieved 27% true positives at 3% false positives. Using a less rigorous Critical Assessment of Structure Prediction (CASP)-like sensitivity measurement the random assignment achieved 6%, FSSP-59% and the optimum secondary structure alignment method-32%. Similarity trees produced by the optimum method illustrate that these methods cannot be used alone to produce a viable protein structural classification system. CONCLUSIONS: Simple methods that use perfect secondary structure information to assign folds cannot produce an accurate protein taxonomy, however they do provide useful baselines for fold recognition. In terms of a typical CASP assessment our results suggest that approximately 6% of targets with folds in the databases could be assigned correctly by randomly guessing, and as many as 32% could be recognised by trivial secondary structure comparison methods, given knowledge of their correct secondary structures.  相似文献   

15.
Here we perform a systematic exploration of the use of distance constraints derived from small angle X-ray scattering (SAXS) measurements to filter candidate protein structures for the purpose of protein structure prediction. This is an intrinsically more complex task than that of applying distance constraints derived from NMR data where the identity of the pair of amino acid residues subject to a given distance constraint is known. SAXS, on the other hand, yields a histogram of pair distances (pair distribution function), but the identities of the pairs contributing to a given bin of the histogram are not known. Our study is based on an extension of the Levitt-Hinds coarse grained approach to ab initio protein structure prediction to generate a candidate set of C(alpha) backbones. In spite of the lack of specific residue information inherent in the SAXS data, our study shows that the implementation of a SAXS filter is capable of effectively purifying the set of native structure candidates and thus provides a substantial improvement in the reliability of protein structure prediction. We test the quality of our predicted C(alpha) backbones by doing structural homology searches against the Dali domain library, and find that the results are very encouraging. In spite of the lack of local structural details and limited modeling accuracy at the C(alpha) backbone level, we find that useful information about fold classification can be extracted from this procedure. This approach thus provides a way to use a SAXS data based structure prediction algorithm to generate potential structural homologies in cases where lack of sequence homology prevents identification of candidate folds for a given protein. Thus our approach has the potential to help in determination of the biological function of a protein based on structural homology instead of sequence homology.  相似文献   

16.
The quest to order and classify protein structures has lead to various classification schemes, focusing mostly on hierarchical relationships between structural domains. At the coarsest classification level, such schemes typically identify hundreds of types of fundamental units called folds. As a result, we picture protein structure space as a collection of isolated fold islands. It is obvious, however, that many protein folds share structural and functional commonalities. Locating those commonalities is important for our understanding of protein structure, function, and evolution. Here, we present an alternative view of the protein fold space, based on an interfold similarity measure that is related to the frequency of fragments shared between folds. In this view, protein structures form a complicated, crossconnected network with very interesting topology. We show that interfold similarity based on sequence/structure fragments correlates well with similarities of functions between protein populations in different folds.  相似文献   

17.
Current classification systems for protein structure show many inconsistencies both within and between systems. The metafold concept was introduced to identify fold similarities by consensus and thus provide a more unified view of fold space. Using cradle-loop barrels as an example, we propose to use the metafold as the next hierarchical level above the fold, encompassing a group of topologically related folds for which a homologous relationship has been substantiated. We see this as an important step on the way to a classification of proteins by natural descent.  相似文献   

18.

Background  

Domain experts manually construct the Structural Classification of Protein (SCOP) database to categorize and compare protein structures. Even though using the SCOP database is believed to be more reliable than classification results from other methods, it is labor intensive. To mimic human classification processes, we develop an automatic SCOP fold classification system to assign possible known SCOP folds and recognize novel folds for newly-discovered proteins.  相似文献   

19.
The problem of protein tertiary structure prediction from primary sequence can be separated into two subproblems: generation of a library of possible folds and specification of a best fold given the library. A distance geometry procedure based on random pairwise metrization with good sampling properties was used to generate a library of 500 possible structures for each of 11 small helical proteins. The input to distance geometry consisted of sets of restraints to enforce predicted helical secondary structure and a generic range of 5 to 11 A between predicted contact residues on all pairs of helices. For each of the 11 targets, the resulting library contained structures with low RMSD versus the native structure. Near-native sampling was enhanced by at least three orders of magnitude compared to a random sampling of compact folds. All library members were scored with a combination of an all-atom distance-dependent function, a residue pair-potential, and a hydrophobicity function. In six of the 11 cases, the best-ranking fold was considered to be near native. Each library was also reduced to a final ab initio prediction via consensus distance geometry performed over the 50 best-ranking structures from the full set of 500. The consensus results were of generally higher quality, yielding six predictions within 6.5 A of the native fold. These favorable predictions corresponded to those for which the correlation between the RMSD and the scoring function were highest. The advantage of the reported methodology is its extreme simplicity and potential for including other types of structural restraints.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号