首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 234 毫秒
1.
大规模蛋白质功能预测方法的进展   总被引:2,自引:0,他引:2  
全基因组测序的快速发展在获得大量序列信息的同时也迫切需要获取功能信息,用生物信息学方法进行大规模蛋白质功能预测在这种需求中获得发展。这些预测方法从基于序列同源性发展到基于genomic-context获得功能相关蛋白质对。基于genomic-context的方法具体有基因融合、染色体邻近、相似系统发生谱等。由于各种方法的偏向性,最新的趋势是整合多种方法的数据,组成蛋白质相互作用网络,通过分析网络的结构进行蛋白质功能预测。  相似文献   

2.
有关蛋白质功能的研究是解析生命奥秘的基础,机器学习技术在该领域已有广泛应用。利用支持向量机(support vectormachine,SVM)方法,构建一个预测蛋白质功能位点的通用平台。该平台先提取非同源蛋白质序列,再对这些序列进行特征编码(包括序列的基本信息、物化特征、结构信息及序列保守性特征等),以编码好的样本作为训练数据,利用SVM进行训练,得到敏感性、特异性、Matthew相关系数、准确率及ROC曲线等评价指标,反复测试,得到评价指标最优的SVM模型后,便可以用来预测蛋白质序列上的功能位点。该平台除了应用在预测蛋白质功能位点之外,还可以应用于疾病相关单核苷酸多态性(SNP)预测分析、预测蛋白质结构域分析、生物分子间的相互作用等。  相似文献   

3.
蛋白质是生物体内最必需也是最通用的大分子,对它们功能的认识对于科学领域和农业领域的发展有着至关重要的作用。随着后基因组时代的发展,NCBI数据库中迅速涌现出大量不明结构与功能的蛋白质序列,这些蛋白质序列甚至一跃成了研究的热点。近几十年来蛋白质功能预测的方法不断被完善。由最初的仅基于蛋白质序列或3D结构信息的方法衍生出更多的基于序列相似性、基于结构基序、基于相互作用网络等新方法,这些新型方法采用新的算法、新的研究思路和技术手段,力求得到准确性与普遍性并存,能够被广泛应用的蛋白质功能预测方法。本文综述了近年来蛋白质功能预测的方法,并将这些研究方法分类归纳,各自阐明了每类方法的优缺点。  相似文献   

4.
随机森林方法预测膜蛋白类型   总被引:2,自引:0,他引:2  
膜蛋白的类型与其功能是密切相关的,因此膜蛋白类型的预测是研究其功能的重要手段,从蛋白质的氨基酸序列出发对膜蛋白的类型进行预测有重要意义。文章基于蛋白质的氨基酸序列,将组合离散增量和伪氨基酸组分信息共同作为预测参数,采用随机森林分类器,对8类膜蛋白进行了预测。在Jackknife检验下的预测精度为86.3%,独立检验的预测精度为93.8%,取得了好于前人的预测结果。  相似文献   

5.
蛋白质的亚细胞定位与蛋白质的功能密切相关,其定位预测有助于人们了解蛋白质功能.文章提出一种分段伪氨基酸组成成分特征提取方法,采用支持向量机算法对Chou构建的两个蛋白质亚细胞定位数据集(C2129,CS2423)进行了分类研究,并采用总分类精度Q3、内容平衡精度指数Q9等参数评估预测分类系统性能.预测结果表明,基于分段伪氨基酸组成成分特征提取方法的预测性能,优于基于完整蛋白质序列的伪氨基酸组成成分特征提取方法.例如,基于分段矩描述子伪氨基酸组成成分特征提取方法,数据集C2129的Q3和Q9分别为84.7%和60.8%,比基于完整蛋白质序列的矩描述子伪氨基酸组成成分特征提取方法分别提高1.8和2.2个百分点,且Q3比现有Xiao等人的方法提高了9.1个百分点.基于分段伪氨基酸组成成分特征提取方法构成的特征向量不仅包含残基之间的位置信息,而且还包含蛋白质子序列之问的耦合信息,另外蛋白质分段子序列可能和蛋白质的功能域有一定的联系,从而使这一方法能够有效地预测蛋白质亚细胞定位.  相似文献   

6.
蛋白质的序列、结构和功能多种多样.大量研究表明蛋白质的结构与其氨基酸序列的排序有关,并且局部的氨基酸序列环境对蛋白质的结构具有一定的影响.本文提出一种新的基于5-mer氨基酸扭转角统计偏好的蛋白质结构类型预测方法,在该方法通过PDB数据库中5-mer中间氨基酸的扭转角统计偏好来进行结构类型的预测.新方法可以通过计算机仿...  相似文献   

7.
经过两年多的努力 ,军事医学科学院基础医学研究所计算生物学中心成功研发了辅助分子生物学实验设计的软件系统BioSun ,该系统主要功能有 :可视化的序列编辑器、完善的序列数据库管理系统、多种方式的序列比较、蛋白质基本性质分析 (MW、PI、… )、蛋白质功能位点及模式特异性分析、多种方式的抗原表位预测、基于随机肽库实验数据的抗原识别、蛋白质二级结构预测、基于一定字长的蛋白质与DNA组成分析、DNA模式分析、PCR实验辅助设计、转录因子结合位点预测、酶切位点分析及酶切图谱制作、基于多种算法的RNA二级结构预测、原核系统外…  相似文献   

8.
目的:用长记忆模型预测未来年份的甲型H1N1流感病毒的蛋白质序列.方法:基于时间序列分析,首先建立CGR混沌游走序列,再进行模型拟合.对所选取的1943年~2012年同源性相对较高的70条流感病毒蛋白质序列,先混沌游走再用ARFIMA(p,d,q)模型对其前10个位置去拟合并且预测.结果:几乎所有原始蛋白质序列的各个位置值都在预报区域内(除极个别之外),表明选择的模型比较科学.结论:可以用来预测未来年份的流感病毒蛋白质序列,对流感病毒的预测和预防有着重要的研究价值.  相似文献   

9.
蛋白质结构类预测是生物信息和蛋白质科学中重要的研究领域.基于Chou提出的伪氨基酸离散模型框架,从蛋白质序列出发,设计一种新的伪氨基酸组成方法表示蛋白质序列样本.抽取氨基酸组合(10-D)在序列中出现的频率和疏水氨基酸模式(6-D)表示蛋白质序列的附加特征,用和传统的氨基酸组成(20-D)一起构成的36维的伪氨基酸组成向量来表示蛋白质序列的特征.使用遗传算法来优化附加特征的权重系数.伪氨基酸组成向量作为输入数据,模糊支持向量机作为预测工具.使用三个常用的标准数据集来验证算法的性能.Jack-knife检验结果说明本方法具有较高的准确率,有望成为潜在的预测蛋白质功能的工具.  相似文献   

10.
基于蛋白质序列组分信息,提出一个离散增量结合二次判别分析法(IDQD)预测蛋白质相互作用的模型,对人类蛋白质相互作用进行预测.自洽检验的识别精度达到75.89%,3-fold交叉检验的敏感性和特异性分别为64.22%和64.68%.结果表明IDQD算法可以用于蛋白质相互作用的预测.  相似文献   

11.
Yu C  Zavaljevski N  Desai V  Reifman J 《Proteins》2009,74(2):449-460
In this article, we present a new method termed CatFam (Catalytic Families) to automatically infer the functions of catalytic proteins, which account for 20-40% of all proteins in living organisms and play a critical role in a variety of biological processes. CatFam is a sequence-based method that generates sequence profiles to represent and infer protein catalytic functions. CatFam generates profiles through a stepwise procedure that carefully controls profile quality and employs nonenzymes as negative samples to establish profile-specific thresholds associated with a predefined nominal false-positive rate (FPR) of predictions. The adjustable FPR allows for fine precision control of each profile and enables the generation of profile databases that meet different needs: function annotation with high precision and hypothesis generation with moderate precision but better recall. Multiple tests of CatFam databases (generated with distinct nominal FPRs) against enzyme and nonenzyme datasets show that the method's predictions have consistently high precision and recall. For example, a 1% FPR database predicts protein catalytic functions for a dataset of enzymes and nonenzymes with 98.6% precision and 95.0% recall. Comparisons of CatFam databases against other established profile-based methods for the functional annotation of 13 bacterial genomes indicate that CatFam consistently achieves higher precision and (in most cases) higher recall, and that (on average) CatFam provides 21.9% additional catalytic functions not inferred by the other similarly reliable methods. These results strongly suggest that the proposed method provides a valuable contribution to the automated prediction of protein catalytic functions. The CatFam databases and the database search program are freely available at http://www.bhsai.org/downloads/catfam.tar.gz.  相似文献   

12.
PIER: protein interface recognition for structural proteomics   总被引:1,自引:0,他引:1  
Recent advances in structural proteomics call for development of fast and reliable automatic methods for prediction of functional surfaces of proteins with known three-dimensional structure, including binding sites for known and unknown protein partners as well as oligomerization interfaces. Despite significant progress the problem is still far from being solved. Most existing methods rely, at least partially, on evolutionary information from multiple sequence alignments projected on protein surface. The common drawback of such methods is their limited applicability to the proteins with a sparse set of sequential homologs, as well as inability to detect interfaces in evolutionary variable regions. In this study, the authors developed an improved method for predicting interfaces from a single protein structure, which is based on local statistical properties of the protein surface derived at the level of atomic groups. The proposed Protein IntErface Recognition (PIER) method achieved the overall precision of 60% at the recall threshold of 50% at the residue level on a diverse benchmark of 490 homodimeric, 62 heterodimeric, and 196 transient interfaces (compared with 25% precision at 50% recall expected from random residue function assignment). For 70% of proteins in the benchmark, the binding patch residues were successfully detected with precision exceeding 50% at 50% recall. The calculation only took seconds for an average 300-residue protein. The authors demonstrated that adding the evolutionary conservation signal only marginally influenced the overall prediction performance on the benchmark; moreover, for certain classes of proteins, using this signal actually resulted in a deteriorated prediction. Thorough benchmarking using other datasets from literature showed that PIER yielded improved performance as compared with several alignment-free or alignment-dependent predictions. The accuracy, efficiency, and dependence on structure alone make PIER a suitable tool for automated high-throughput annotation of protein structures emerging from structural proteomics projects.  相似文献   

13.
Predicting new protein-protein interactions is important for discovering novel functions of various biological pathways. Predicting these interactions is a crucial and challenging task. Moreover, discovering new protein-protein interactions through biological experiments is still difficult. Therefore, it is increasingly important to discover new protein interactions. Many studies have predicted protein-protein interactions, using biological features such as Gene Ontology (GO) functional annotations and structural domains of two proteins. In this paper, we propose an augmented transitive relationships predictor (ATRP), a new method of predicting potential protein interactions using transitive relationships and annotations of protein interactions. In addition, a distillation of virtual direct protein-protein interactions is proposed to deal with unbalanced distribution of different types of interactions in the existing protein-protein interaction databases. Our results demonstrate that ATRP can effectively predict protein-protein interactions. ATRP achieves an 81% precision, a 74% recall and a 77% F-measure in average rate in the prediction of direct protein-protein interactions. Using the generated benchmark datasets from KUPS to evaluate of all types of the protein-protein interaction, ATRP achieved a 93% precision, a 49% recall and a 64% F-measure in average rate. This article is part of a Special Issue entitled: Computational Methods for Protein Interaction and Structural Prediction.  相似文献   

14.
Lin HN  Wu KP  Chang JM  Sung TY  Hsu WL 《Nucleic acids research》2005,33(14):4593-4601
NMR data from different experiments often contain errors; thus, automated backbone resonance assignment is a very challenging issue. In this paper, we present a method called GANA that uses a genetic algorithm to automatically perform backbone resonance assignment with a high degree of precision and recall. Precision is the number of correctly assigned residues divided by the number of assigned residues, and recall is the number of correctly assigned residues divided by the number of residues with known human curated answers. GANA takes spin systems as input data and uses two data structures, candidate lists and adjacency lists, to assign the spin systems to each amino acid of a target protein. Using GANA, almost all spin systems can be mapped correctly onto a target protein, even if the data are noisy. We use the BioMagResBank (BMRB) dataset (901 proteins) to test the performance of GANA. To evaluate the robustness of GANA, we generate four additional datasets from the BMRB dataset to simulate data errors of false positives, false negatives and linking errors. We also use a combination of these three error types to examine the fault tolerance of our method. The average precision rates of GANA on BMRB and the four simulated test cases are 99.61, 99.55, 99.34, 99.35 and 98.60%, respectively. The average recall rates of GANA on BMRB and the four simulated test cases are 99.26, 99.19, 98.85, 98.87 and 97.78%, respectively. We also test GANA on two real wet-lab datasets, hbSBD and hbLBD. The precision and recall rates of GANA on hbSBD are 95.12 and 92.86%, respectively, and those of hbLBD are 100 and 97.40%, respectively.  相似文献   

15.
Summary We have systematically examined how the quality of NMR protein structures depends on (1) the number of NOE distance constraints. (2) their assumed precision, (3) the method of structure calculation and (4) the size of the protein. The test sets of distance constraints have been derived from the crystal structures of crambin (5 kDa) and staphylococcal nuclease (17 kDa). Three methods of structure calculation have been compared: Distance Geometry (DGEOM), Restrained Molecular Dynamics (XPLOR) and the Double Iterated Kalman Filter (DIKF). All three methods can reproduce the general features of the starting structure under all conditions tested. In many instances the apparent precision of the calculated structure (as measured by the RMS dispersion from the average) is greater than its accuracy (as measured by the RMS deviation of the average structure from the starting crystal structure). The global RMS deviations from the reference structures decrease exponentially as the number of constraints is increased, and after using about 30% of all potential constraints, the crrors asymptotically approach a limiting value. Increasing the assumed precision of the constraints has the same qualitative effect as increasing the number of constraints. For comparable numbers of constraints/residue, the precision of the calculated structure is less for the larger than for the smaller protein, regardless of the method of calculation. The accuracy of the average structure calculated by Restrained Molecular Dynamics is greater than that of structures obtained by purely geometric methods (DGEOM and DIKF).  相似文献   

16.
Li X  Jacobson MP  Friesner RA 《Proteins》2004,55(2):368-382
We have developed a new method for predicting helix positions in globular proteins that is intended primarily for comparative modeling and other applications where high precision is required. Unlike helix packing algorithms designed for ab initio folding, we assume that knowledge is available about the qualitative placement of all helices. However, even among homologous proteins, the corresponding helices can demonstrate substantial differences in positions and orientations, and for this reason, improperly positioned helices can contribute significantly to the overall backbone root-mean-square deviation (RMSD) of comparative models. A helix packing algorithm for use in comparative modeling must obtain high precision to be useful, and for this reason we utilize an all-atom protein force field (OPLS) and a Generalized Born continuum solvent model. To reduce the computational expense associated with using a detailed, physics-based energy function, we have developed new hierarchical and multiscale algorithms for sampling the helices and flanking loops. We validate the method using a test suite of 33 cases, which are drawn from a diverse set of high-resolution crystal structures. The helix positions are reproduced with an average backbone RMSD of 0.6 A, while the average backbone RMSD of the complete loop-helix-loop region (i.e., the helix with the surrounding loops, which are also repredicted) is 1.3 A.  相似文献   

17.
Theoretical microscopic titration curves (THEMATICS) is a computational method for the identification of active sites in proteins through deviations in computed titration behavior of ionizable residues. While the sensitivity to catalytic sites is high, the previously reported sensitivity to catalytic residues was not as high, about 50%. Here THEMATICS is combined with support vector machines (SVM) to improve sensitivity for catalytic residue prediction from protein 3D structure alone. For a test set of 64 proteins taken from the Catalytic Site Atlas (CSA), the average recall rate for annotated catalytic residues is 61%; good precision is maintained selecting only 4% of all residues. The average false positive rate, using the CSA annotations is only 3.2%, far lower than other 3D-structure-based methods. THEMATICS-SVM returns higher precision, lower false positive rate, and better overall performance, compared with other 3D-structure-based methods. Comparison is also made with the latest machine learning methods that are based on both sequence alignments and 3D structures. For annotated sets of well-characterized enzymes, THEMATICS-SVM performance compares very favorably with methods that utilize sequence homology. However, since THEMATICS depends only on the 3D structure of the query protein, no decline in performance is expected when applied to novel folds, proteins with few sequence homologues, or even orphan sequences. An extension of the method to predict non-ionizable catalytic residues is also presented. THEMATICS-SVM predicts a local network of ionizable residues with strong interactions between protonation events; this appears to be a special feature of enzyme active sites.  相似文献   

18.
The structures of DNA-protein complexes have illuminated the diversity of DNA-protein binding mechanisms shown by different protein families. This lack of generality could pose a great challenge for predicting DNA-protein interactions. To address this issue, we have developed a knowledge-based method, DNA-binding Domain Hunter (DBD-Hunter), for identifying DNA-binding proteins and associated binding sites. The method combines structural comparison and the evaluation of a statistical potential, which we derive to describe interactions between DNA base pairs and protein residues. We demonstrate that DBD-Hunter is an accurate method for predicting DNA-binding function of proteins, and that DNA-binding protein residues can be reliably inferred from the corresponding templates if identified. In benchmark tests on approximately 4000 proteins, our method achieved an accuracy of 98% and a precision of 84%, which significantly outperforms three previous methods. We further validate the method on DNA-binding protein structures determined in DNA-free (apo) state. We show that the accuracy of our method is only slightly affected on apo-structures compared to the performance on holo-structures cocrystallized with DNA. Finally, we apply the method to approximately 1700 structural genomics targets and predict that 37 targets with previously unknown function are likely to be DNA-binding proteins. DBD-Hunter is freely available at http://cssb.biology.gatech.edu/skolnick/webservice/DBD-Hunter/.  相似文献   

19.
Automated function prediction (AFP) methods increasingly use knowledge discovery algorithms to map sequence, structure, literature, and/or pathway information about proteins whose functions are unknown into functional ontologies, typically (a portion of) the Gene Ontology (GO). While there are a growing number of methods within this paradigm, the general problem of assessing the accuracy of such prediction algorithms has not been seriously addressed. We present first an application for function prediction from protein sequences using the POSet Ontology Categorizer (POSOC) to produce new annotations by analyzing collections of GO nodes derived from annotations of protein BLAST neighborhoods. We then also present hierarchical precision and hierarchical recall as new evaluation metrics for assessing the accuracy of any predictions in hierarchical ontologies, and discuss results on a test set of protein sequences. We show that our method provides substantially improved hierarchical precision (measure of predictions made that are correct) when applied to the nearest BLAST neighbors of target proteins, as compared with simply imputing that neighborhood's annotations to the target. Moreover, when our method is applied to a broader BLAST neighborhood, hierarchical precision is enhanced even further. In all cases, such increased hierarchical precision performance is purchased at a modest expense of hierarchical recall (measure of all annotations that get predicted at all).  相似文献   

20.
The increasing number and diversity of protein sequence families requires new methods to define and predict details regarding function. Here, we present a method for analysis and prediction of functional sub-types from multiple protein sequence alignments. Given an alignment and set of proteins grouped into sub-types according to some definition of function, such as enzymatic specificity, the method identifies positions that are indicative of functional differences by comparison of sub-type specific sequence profiles, and analysis of positional entropy in the alignment. Alignment positions with significantly high positional relative entropy correlate with those known to be involved in defining sub-types for nucleotidyl cyclases, protein kinases, lactate/malate dehydrogenases and trypsin-like serine proteases. We highlight new positions for these proteins that suggest additional experiments to elucidate the basis of specificity. The method is also able to predict sub-type for unclassified sequences. We assess several variations on a prediction method, and compare them to simple sequence comparisons. For assessment, we remove close homologues to the sequence for which a prediction is to be made (by a sequence identity above a threshold). This simulates situations where a protein is known to belong to a protein family, but is not a close relative of another protein of known sub-type. Considering the four families above, and a sequence identity threshold of 30 %, our best method gives an accuracy of 96 % compared to 80 % obtained for sequence similarity and 74 % for BLAST. We describe the derivation of a set of sub-type groupings derived from an automated parsing of alignments from PFAM and the SWISSPROT database, and use this to perform a large-scale assessment. The best method gives an average accuracy of 94 % compared to 68 % for sequence similarity and 79 % for BLAST. We discuss implications for experimental design, genome annotation and the prediction of protein function and protein intra-residue distances.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号