首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 953 毫秒
1.
蛋白质超二级结构预测是三级结构预测的一个非常重要的中间步骤。本文从蛋白质的一级序列出发,对5793个蛋白质中的四类简单超二级结构进行预测,以位点氨基酸为参数,采用3种片段截取方式,分别用离散增量算法预测的结果不理想,将组合的离散增量值作为特征参数输入支持向量机,取得了较好的预测结果,5交叉检验的平均预测总精度达到83.0%,Matthew’s相关系数在0.71以上。  相似文献   

2.
基于氨基酸序列,用打分值、离散增量、自相关函数值和距离值来表示β-发夹模体信息,通过二次判别方法对上述信息进行融合,预测数据库ArchDB40和EVA中的β-发夹模体。文章使用的β-发夹模体包含的loop长为2~10个氨基酸,当序列模式长为17个氨基酸时,对两个数据库中β-发夹5交叉检验预测的总精度分别达到83.1%和80.7%,相关系数达到0.59和0.61,好于前人的预测结果。  相似文献   

3.
基于蛋白质序列组分信息,提出一个离散增量结合二次判别分析法(IDQD)预测蛋白质相互作用的模型,对人类蛋白质相互作用进行预测.自洽检验的识别精度达到75.89%,3-fold交叉检验的敏感性和特异性分别为64.22%和64.68%.结果表明IDQD算法可以用于蛋白质相互作用的预测.  相似文献   

4.
随机森林方法预测膜蛋白类型   总被引:2,自引:0,他引:2  
膜蛋白的类型与其功能是密切相关的,因此膜蛋白类型的预测是研究其功能的重要手段,从蛋白质的氨基酸序列出发对膜蛋白的类型进行预测有重要意义。文章基于蛋白质的氨基酸序列,将组合离散增量和伪氨基酸组分信息共同作为预测参数,采用随机森林分类器,对8类膜蛋白进行了预测。在Jackknife检验下的预测精度为86.3%,独立检验的预测精度为93.8%,取得了好于前人的预测结果。  相似文献   

5.
以序列相似性低于40%的1895条蛋白质序列构建涵盖27个折叠类型的蛋白质折叠子数据库,从蛋白质序列出发,用模体频数值、低频功率谱密度值、氨基酸组分、预测的二级结构信息和自相关函数值构成组合向量表示蛋白质序列信息,采用支持向量机算法,基于整体分类策略,对27类蛋白质折叠子的折叠类型进行预测,独立检验的预测精度达到了66.67%。同时,以同样的特征参数和算法对27类折叠子的4个结构类型进行了预测,独立检验的预测精度达到了89.24%。将同样的方法用于前人使用过的27类折叠子数据库,得到了好于前人的预测结果。  相似文献   

6.
本文首次将基于位置权重矩阵的打分函数用于蛋白质二级结构的预测中.我们选取CB513数据库作为基准数据库,首先在库中截取11残基和21残基片段,依据中心残基的二级结构类型分成3个集合;然后分别建立20种氨基酸、以及依据亲疏水性约化成的6种氨基酸在3个集合中的位置权重矩阵;对于任意一个待测的序列片段X通过和3个位置权重矩阵比较,应用打分函数得到3个不同的分值,比较哪个分值最大X就属于哪一类;最后在误差允许范围内对预测结果进行修正,得到的预测精度Q_3最高达到了80%.  相似文献   

7.
李菁  王炜 《中国科学C辑》2006,36(6):552-562
序列比对是寻找蛋白质结构保守性区域的常用方法, 然而当序列相似小于30%时比对准确度却不高, 这是因为在这些序列中具有相似结构功能的不同残基在序列比对中往往被错误配对. 基于相似的物理化学性质, 某些残基可以被归类为一组, 而应用这些简化后的残基字符可以有效地简化蛋白质序列的复杂性并保持序列的主要信息. 因此, 如果20种天然氨基酸残基能够正确的归类, 可以有效地提高序列比对的准确度. 本文基于蛋白质结构比对数据库DAPS, 提出了一种新的氨基酸残基归类方法, 并可以同时得到不同简化程度下的替代矩阵用于序列比对. 归类的合理性由相互熵方法确认, 并且应用简化后的字符表于序列比对来识别蛋白质的结构保守区域. 结果表明, 当氨基酸残基字符简化到9个左右时能够有效地提高序列比对的准确度.  相似文献   

8.
采用同源克隆结合RACE法,克隆了繁缕核糖体失活蛋白的全长cDNA,命名为q3(GenBank accession GQ870262)。序列分析结果表明,q3的开放阅读框(ORF)长780 bp,编码259个氨基酸。序列G+C含量为41.5%,与大部分Ⅰ型RIP基因相近。q3编码的蛋白质命名为Q3,理论分子量为28.16 kD,pI为9.44,均与Ⅰ型核糖体失活蛋白相近;包含由23个氨基酸组成的信号肽。功能结构域分析发现,该蛋白含有3个蛋白激酶磷酸化位点、4个络氨酸蛋白激酶磷酸化位点和7个N-肉豆蔻酰化位点。三级结构预测发现,有35.52%的氨基酸残基参与了α螺旋,24.32%的氨基酸残基组成延伸链,40.15%的氨基酸残基随机缠绕其中。基于繁缕及其近缘种核糖体失活蛋白的氨基酸序列构建的系统发育树显示,其结构与经典分类结果基本一致。  相似文献   

9.
闫化军  章毅 《生物信息学》2004,2(4):19-24,41
运用加入竞争层的BP网络,研究了基于蛋白质二级结构内容的域结构类预测问题.在BP网络中嵌入一竞争,层显著提高了网络预测性能.仅使用了一个小的训练集和简单的网络结构,获得了很高的预测精度自支持精度97.62%,jack-knife测试精度97.62%,及平均外推精度90.74%.在建立更完备的域结构类特征向量和更有代表性的训练集的基础上,所述方法将为蛋白质域结构分类领域提供新的分类基准.  相似文献   

10.
林昊 《生物信息学》2009,7(4):252-254
由于蛋白质亚细胞位置与其一级序列存在很强的相关性,利用多样性增量来描述蛋白质之间氨基酸组分和二肽组分的相似程度,采用修正的马氏判别式(这里称为IDQD方法)对分枝杆菌蛋白质的亚细胞位置进行了预测。利用Jackknife检验对不同序列相似度下的蛋白质数据集进行了预测研究,结果显示,当数据集的序列相似度小于等于70%时,算法的预测精度稳定在75%左右。在对整体852条蛋白质的预测成功率达到87.7%,这一结果优于已有算法的预测精度,说明IDQD是一种有效的分枝杆菌蛋白质亚细胞预测方法。  相似文献   

11.
Based on the conservation analysis of the 683 latest experimentally verified sigma(70)-promoter sequences of Escherichia coli K-12, it is found that the conservative hexamers segments in different sites play a key role of promoter regions, a novel position-correlation scoring matrix (PCSM) algorithm for predicting sigma(70) promoter is presented. The predictive capacity of the algorithm is tested by 10-cross validation test. The results show that the overall prediction accuracies (sensitivity) and specificity are 91% and 81%, respectively. By selecting the 683 experimentally verified sigma(70) promoters as training set and searching for the complete sequence in E. coli K-12 with 4639221bp. Results show that the 100% of the 683 experimentally verified sigma(70) promoters have been identified and some possible promoters are predicted.  相似文献   

12.
Identifying local conformational changes induced by subtle differences on amino acid sequences is critical in exploring the functional variations of the proteins. In this study, we designed a computational scheme to predict the dihedral angle variations for different amino acid sequences by using conditional random field. This computational tool achieved an accuracy of 87% and 84% in 10-fold cross validation in a large data set for φ and Ψ, respectively. The prediction accuracies of φ and Ψ are positively correlated to each other for most of the 20 types of amino acids. Helical amino acids can achieve higher prediction accuracy in general, while amino acids in beet sheet have higher accuracy at specific angular regions. The prediction accuracy of φ is negatively correlated with amino acid flexibility represented by Vihinen Index. The prediction accuracy of φ can also be negatively correlated with angle distribution dispersion.  相似文献   

13.
14.
Prediction of a complex super-secondary structure is a key step in the study of tertiary structures of proteins. The strand-loop-helix-loop-strand (βαβ) motif is an important complex super-secondary structure in proteins. Many functional sites and active sites often occur in polypeptides of βαβ motifs. Therefore, the accurate prediction of βαβ motifs is very important to recognizing protein tertiary structure and the study of protein function. In this study, the βαβ motif dataset was first constructed using the DSSP package. A statistical analysis was then performed on βαβ motifs and non-βαβ motifs. The target motif was selected, and the length of the loop-α-loop varies from 10 to 26 amino acids. The ideal fixed-length pattern comprised 32 amino acids. A Support Vector Machine algorithm was developed for predicting βαβ motifs by using the sequence information, the predicted structure and function information to express the sequence feature. The overall predictive accuracy of 5-fold cross-validation and independent test was 81.7% and 76.7%, respectively. The Matthew’s correlation coefficient of the 5-fold cross-validation and independent test are 0.63 and 0.53, respectively. Results demonstrate that the proposed method is an effective approach for predicting βαβ motifs and can be used for structure and function studies of proteins.  相似文献   

15.
Hu LL  Niu S  Huang T  Wang K  Shi XH  Cai YD 《PloS one》2010,5(12):e15917

Background

Hydroxylation is an important post-translational modification and closely related to various diseases. Besides the biotechnology experiments, in silico prediction methods are alternative ways to identify the potential hydroxylation sites.

Methodology/Principal Findings

In this study, we developed a novel sequence-based method for identifying the two main types of hydroxylation sites – hydroxyproline and hydroxylysine. First, feature selection was made on three kinds of features consisting of amino acid indices (AAindex) which includes various physicochemical properties and biochemical properties of amino acids, Position-Specific Scoring Matrices (PSSM) which represent evolution information of amino acids and structural disorder of amino acids in the sliding window with length of 13 amino acids, then the prediction model were built using incremental feature selection method. As a result, the prediction accuracies are 76.0% and 82.1%, evaluated by jackknife cross-validation on the hydroxyproline dataset and hydroxylysine dataset, respectively. Feature analysis suggested that physicochemical properties and biochemical properties and evolution information of amino acids contribute much to the identification of the protein hydroxylation sites, while structural disorder had little relation to protein hydroxylation. It was also found that the amino acid adjacent to the hydroxylation site tends to exert more influence than other sites on hydroxylation determination.

Conclusions/Significance

These findings may provide useful insights for exploiting the mechanisms of hydroxylation.  相似文献   

16.
Gao QB  Wang ZZ  Yan C  Du YH 《FEBS letters》2005,579(16):3444-3448
To understand the structure and function of a protein, an important task is to know where it occurs in the cell. Thus, a computational method for properly predicting the subcellular location of proteins would be significant in interpreting the original data produced by the large-scale genome sequencing projects. The present work tries to explore an effective method for extracting features from protein primary sequence and find a novel measurement of similarity among proteins for classifying a protein to its proper subcellular location. We considered four locations in eukaryotic cells and three locations in prokaryotic cells, which have been investigated by several groups in the past. A combined feature of primary sequence defined as a 430D (dimensional) vector was utilized to represent a protein, including 20 amino acid compositions, 400 dipeptide compositions and 10 physicochemical properties. To evaluate the prediction performance of this encoding scheme, a jackknife test based on nearest neighbor algorithm was employed. The prediction accuracies for cytoplasmic, extracellular, mitochondrial, and nuclear proteins in the former dataset were 86.3%, 89.2%, 73.5% and 89.4%, respectively, and the total prediction accuracy reached 86.3%. As for the prediction accuracies of cytoplasmic, extracellular, and periplasmic proteins in the latter dataset, the prediction accuracies were 97.4%, 86.0%, and 79.7, respectively, and the total prediction accuracy of 92.5% was achieved. The results indicate that this method outperforms some existing approaches based on amino acid composition or amino acid composition and dipeptide composition.  相似文献   

17.
The recognition of protein folds is an important step in the prediction of protein structure and function. Recently, an increasing number of researchers have sought to improve the methods for protein fold recognition. Following the construction of a dataset consisting of 27 protein fold classes by Ding and Dubchak in 2001, prediction algorithms, parameters and the construction of new datasets have improved for the prediction of protein folds. In this study, we reorganized a dataset consisting of 76-fold classes constructed by Liu et al. and used the values of the increment of diversity, average chemical shifts of secondary structure elements and secondary structure motifs as feature parameters in the recognition of multi-class protein folds. With the combined feature vector as the input parameter for the Random Forests algorithm and ensemble classification strategy, we propose a novel method to identify the 76 protein fold classes. The overall accuracy of the test dataset using an independent test was 66.69%; when the training and test sets were combined, with 5-fold cross-validation, the overall accuracy was 73.43%. This method was further used to predict the test dataset and the corresponding structural classification of the first 27-protein fold class dataset, resulting in overall accuracies of 79.66% and 93.40%, respectively. Moreover, when the training set and test sets were combined, the accuracy using 5-fold cross-validation was 81.21%. Additionally, this approach resulted in improved prediction results using the 27-protein fold class dataset constructed by Ding and Dubchak.  相似文献   

18.

Background

Genomic selection is a recently developed technology that is beginning to revolutionize animal breeding. The objective of this study was to estimate marker effects to derive prediction equations for direct genomic values for 16 routinely recorded traits of American Angus beef cattle and quantify corresponding accuracies of prediction.

Methods

Deregressed estimated breeding values were used as observations in a weighted analysis to derive direct genomic values for 3570 sires genotyped using the Illumina BovineSNP50 BeadChip. These bulls were clustered into five groups using K-means clustering on pedigree estimates of additive genetic relationships between animals, with the aim of increasing within-group and decreasing between-group relationships. All five combinations of four groups were used for model training, with cross-validation performed in the group not used in training. Bivariate animal models were used for each trait to estimate the genetic correlation between deregressed estimated breeding values and direct genomic values.

Results

Accuracies of direct genomic values ranged from 0.22 to 0.69 for the studied traits, with an average of 0.44. Predictions were more accurate when animals within the validation group were more closely related to animals in the training set. When training and validation sets were formed by random allocation, the accuracies of direct genomic values ranged from 0.38 to 0.85, with an average of 0.65, reflecting the greater relationship between animals in training and validation. The accuracies of direct genomic values obtained from training on older animals and validating in younger animals were intermediate to the accuracies obtained from K-means clustering and random clustering for most traits. The genetic correlation between deregressed estimated breeding values and direct genomic values ranged from 0.15 to 0.80 for the traits studied.

Conclusions

These results suggest that genomic estimates of genetic merit can be produced in beef cattle at a young age but the recurrent inclusion of genotyped sires in retraining analyses will be necessary to routinely produce for the industry the direct genomic values with the highest accuracy.  相似文献   

19.

Key message

Compared with independent validation, cross-validation simultaneously sampling genotypes and environments provided similar estimates of accuracy for genomic selection, but inflated estimates for marker-assisted selection.

Abstract

Estimates of prediction accuracy of marker-assisted (MAS) and genomic selection (GS) require validations. The main goal of our study was to compare the prediction accuracies of MAS and GS validated in an independent sample with results obtained from fivefold cross-validation using genomic and phenotypic data for Fusarium head blight resistance in wheat. In addition, the applicability of the reliability criterion, a concept originally developed in the context of classic animal breeding and GS, was explored for MAS. We observed that prediction accuracies of MAS were overestimated by 127% using cross-validation sampling genotype and environments in contrast to independent validation. In contrast, prediction accuracies of GS determined in independent samples are similar to those estimated with cross-validation sampling genotype and environments. This can be explained by small population differentiation between the training and validation sets in our study. For European wheat breeding, which is so far characterized by a slow temporal dynamic in allele frequencies, this assumption seems to be realistic. Thus, GS models used to improve European wheat populations are expected to possess a long-lasting validity. Since quantitative trait loci information can be exploited more precisely if the predicted genotype is more related to the training population, the reliability criterion is also a valuable tool to judge the level of prediction accuracy of individual genotypes in MAS.
  相似文献   

20.
In this paper, support vector machines (SVMs) are applied to predict the nucleic-acid-binding proteins. We constructed two classifiers to differentiate DNA/RNA-binding proteins from non-nucleic-acid-binding proteins by using a conjoint triad feature which extract information directly from amino acids sequence of protein. Both self-consistency and jackknife tests show promising results on the protein datasets in which the sequences identity is less than 25%. In the self-consistency test, the predictive accuracy is 90.37% for DNA-binding proteins and 89.70% for RNA-binding proteins. In the jackknife test, the predictive accuracies are 78.93% and 76.75%, respectively. Comparison results show that our method is very competitive by outperforming other previously published sequence-based prediction methods.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号