首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Protein remote homology detection is one of the most important problems in bioinformatics. Discriminative methods such as support vector machines (SVM) have shown superior performance. However, the performance of SVM-based methods depends on the vector representations of the protein sequences. Prior works have demonstrated that sequence-order effects are relevant for discrimination, but little work has explored how to incorporate the sequence-order information along with the amino acid physicochemical properties into the prediction. In order to incorporate the sequence-order effects into the protein remote homology detection, the physicochemical distance transformation (PDT) method is proposed. Each protein sequence is converted into a series of numbers by using the physicochemical property scores in the amino acid index (AAIndex), and then the sequence is converted into a fixed length vector by PDT. The sequence-order information can be efficiently included into the feature vector with little computational cost by this approach. Finally, the feature vectors are input into a support vector machine classifier to detect the protein remote homologies. Our experiments on a well-known benchmark show the proposed method SVM-PDT achieves superior or comparable performance with current state-of-the-art methods and its computational cost is considerably superior to those of other methods. When the evolutionary information extracted from the frequency profiles is combined with the PDT method, the profile-based PDT approach can improve the performance by 3.4% and 11.4% in terms of ROC score and ROC50 score respectively. The local sequence-order information of the protein can be efficiently captured by the proposed PDT and the physicochemical properties extracted from the amino acid index are incorporated into the prediction. The physicochemical distance transformation provides a general framework, which would be a valuable tool for protein-level study.  相似文献   

2.
3.
Functional classification of proteins from sequences alone has become a critical bottleneck in understanding the myriad of protein sequences that accumulate in our databases. The great diversity of homologous sequences hides, in many cases, a variety of functional activities that cannot be anticipated. Their identification appears critical for a fundamental understanding of the evolution of living organisms and for biotechnological applications. ProfileView is a sequence-based computational method, designed to functionally classify sets of homologous sequences. It relies on two main ideas: the use of multiple profile models whose construction explores evolutionary information in available databases, and a novel definition of a representation space in which to analyze sequences with multiple profile models combined together. ProfileView classifies protein families by enriching known functional groups with new sequences and discovering new groups and subgroups. We validate ProfileView on seven classes of widespread proteins involved in the interaction with nucleic acids, amino acids and small molecules, and in a large variety of functions and enzymatic reactions. ProfileView agrees with the large set of functional data collected for these proteins from the literature regarding the organization into functional subgroups and residues that characterize the functions. In addition, ProfileView resolves undefined functional classifications and extracts the molecular determinants underlying protein functional diversity, showing its potential to select sequences towards accurate experimental design and discovery of novel biological functions. On protein families with complex domain architecture, ProfileView functional classification reconciles domain combinations, unlike phylogenetic reconstruction. ProfileView proves to outperform the functional classification approach PANTHER, the two k-mer-based methods CUPP and eCAMI and a neural network approach based on Restricted Boltzmann Machines. It overcomes time complexity limitations of the latter.  相似文献   

4.
Experimental protein-protein interaction (PPI) networks are increasingly being exploited in diverse ways for biological discovery. Accordingly, it is vital to discern their underlying natures by identifying and classifying the various types of deterministic (specific) and probabilistic (nonspecific) interactions detected. To this end, we have analyzed PPI networks determined using a range of high-throughput experimental techniques with the aim of systematically quantifying any biases that arise from the varying cellular abundances of the proteins. We confirm that PPI networks determined using affinity purification methods for yeast and Eschericia coli incorporate a correlation between protein degree, or number of interactions, and cellular abundance. The observed correlations are small but statistically significant and occur in both unprocessed (raw) and processed (high-confidence) data sets. In contrast, the yeast two-hybrid system yields networks that contain no such relationship. While previously commented based on mRNA abundance, our more extensive analysis based on protein abundance confirms a systematic difference between PPI networks determined from the two technologies. We additionally demonstrate that the centrality-lethality rule, which implies that higher-degree proteins are more likely to be essential, may be misleading, as protein abundance measurements identify essential proteins to be more prevalent than nonessential proteins. In fact, we generally find that when there is a degree/abundance correlation, the degree distributions of nonessential and essential proteins are also disparate. Conversely, when there is no degree/abundance correlation, the degree distributions of nonessential and essential proteins are not different. However, we show that essentiality manifests itself as a biological property in all of the yeast PPI networks investigated here via enrichments of interactions between essential proteins. These findings provide valuable insights into the underlying natures of the various high-throughput technologies utilized to detect PPIs and should lead to more effective strategies for the inference and analysis of high-quality PPI data sets.  相似文献   

5.

Backgrounds

Despite continuing progress in X-ray crystallography and high-field NMR spectroscopy for determination of three-dimensional protein structures, the number of unsolved and newly discovered sequences grows much faster than that of determined structures. Protein modeling methods can possibly bridge this huge sequence-structure gap with the development of computational science. A grand challenging problem is to predict three-dimensional protein structure from its primary structure (residues sequence) alone. However, predicting residue contact maps is a crucial and promising intermediate step towards final three-dimensional structure prediction. Better predictions of local and non-local contacts between residues can transform protein sequence alignment to structure alignment, which can finally improve template based three-dimensional protein structure predictors greatly.

Methods

CNNcon, an improved multiple neural networks based contact map predictor using six sub-networks and one final cascade-network, was developed in this paper. Both the sub-networks and the final cascade-network were trained and tested with their corresponding data sets. While for testing, the target protein was first coded and then input to its corresponding sub-networks for prediction. After that, the intermediate results were input to the cascade-network to finish the final prediction.

Results

The CNNcon can accurately predict 58.86% in average of contacts at a distance cutoff of 8 Å for proteins with lengths ranging from 51 to 450. The comparison results show that the present method performs better than the compared state-of-the-art predictors. Particularly, the prediction accuracy keeps steady with the increase of protein sequence length. It indicates that the CNNcon overcomes the thin density problem, with which other current predictors have trouble. This advantage makes the method valuable to the prediction of long length proteins. As a result, the effective prediction of long length proteins could be possible by the CNNcon.  相似文献   

6.
Protein functional annotation relies on the identification of accurate relationships, sequence divergence being a key factor. This is especially evident when distant protein relationships are demonstrated only with three-dimensional structures. To address this challenge, we describe a computational approach to purposefully bridge gaps between related protein families through directed design of protein-like “linker” sequences. For this, we represented SCOP domain families, integrated with sequence homologues, as multiple profiles and performed HMM-HMM alignments between related domain families. Where convincing alignments were achieved, we applied a roulette wheel-based method to design 3,611,010 protein-like sequences corresponding to 374 SCOP folds. To analyze their ability to link proteins in homology searches, we used 3024 queries to search two databases, one containing only natural sequences and another one additionally containing designed sequences. Our results showed that augmented database searches showed up to 30% improvement in fold coverage for over 74% of the folds, with 52 folds achieving all theoretically possible connections. Although sequences could not be designed between some families, the availability of designed sequences between other families within the fold established the sequence continuum to demonstrate 373 difficult relationships. Ultimately, as a practical and realistic extension, we demonstrate that such protein-like sequences can be “plugged-into” routine and generic sequence database searches to empower not only remote homology detection but also fold recognition. Our richly statistically supported findings show that complementary searches in both databases will increase the effectiveness of sequence-based searches in recognizing all homologues sharing a common fold.  相似文献   

7.
为提高主基因+多基因混合遗传分析的精度,降低试验误差,采用重复内分组随机区组设计,对低遗传力性状的B1:2和B2:2或F2:3家系平均数资料进行遗传分析.通过AIC准则和适合性检验比较无主基因(A-0)、1对主基因(A)、2对主基因(B)、多基因(C)、1对主基因+多基因(D)和2对主基因+多基因(E)模型以鉴定其遗传模式.采用IECM算法估计混合模型参数.通过油菜HSTC14×宁油7号初花期F2:3家系平均数资料阐明该方法。 abstract:To improve the precision in the genetic analysis of quantitative traits,the B1:2 and 132:2,or F2:3 families in a randomized blocks design were used to identify the mixed major gene plus polygene inheritance model while error variance was estimated from the analysis of variance.Five kinds of genetic models were established,including:one-major-gene model,two-major-gene model,polygene model,mixed one-major-gene plus polygene model,and mixed two-major-gene plus polygene model.The AIC value and a set of tests of goodness-of-fit were used to identify the most fitted model among the possible ones.The iterated ECM (IECM) algorithm was used to obtain maximum likelihood estimates of the parameters in sample likelihood function.An example of the genetic analysis of days from planting to flowering of a rape cross was used to illuminate the above procedure.  相似文献   

8.
章元明  盖钧镒  戚存扣 《遗传》2001,23(4):329-332
为提高主基因 多基因混合遗传分析的精度 ,降低试验误差 ,采用重复内分组随机区组设计 ,对低遗传力性状的B1∶2 和B2∶2 或F2∶3 家系平均数资料进行遗传分析。通过AIC准则和适合性检验比较无主基因 (A - 0 )、1对主基因 (A)、2对主基因 (B)、多基因 (C)、1对主基因 多基因 (D)和 2对主基因 多基因 (E)模型以鉴定其遗传模式。采用IECM算法估计混合模型参数。通过油菜HSTC14×宁油 7号初花期F2∶3 家系平均数资料阐明该方法。  相似文献   

9.
10.
With the development of bioinformatics, more and more protein sequence information has become available. Meanwhile, the number of known protein–protein interactions (PPIs) is still very limited. In this article, we propose a new method for predicting interacting protein pairs using a Bayesian method based on a new feature representation. We trained our model using data on 6,459 PPI pairs from the yeast Saccharomyces cerevisiae core subset. Using six species of DIP database, our model demonstrates an average prediction accuracy of 93.67%. The result showed that our method is superior to other methods in both computing time and prediction accuracy.  相似文献   

11.
The functional annotation of the new protein sequences represents a major drawback for genomic science. The best way to suggest the function of a protein from its sequence is by finding a related one for which biological information is available. Current alignment algorithms display a list of protein sequence stretches presenting significant similarity to different protein targets, ordered by their respective mathematical scores. However, statistical and biological significance do not always coincide, therefore, the rearrangement of the program output according to more biological characteristics than the mathematical scoring would help functional annotation. A new method that predicts the putative function for the protein integrating the results from the PSI-BLAST program and a fuzzy logic algorithm is described. Several protein sequence characteristics have been checked in their ability to rearrange a PSI-BLAST profile according more to their biological functions. Four of them: amino acid content, matched segment length and hydropathic and flexibility profiles positively contributed, upon being integrated by a fuzzy logic algorithm into a program, BYPASS, to the accurate prediction of the function of a protein from its sequence. Antonio Gómez and Juan Cedano contributed equally to this work.  相似文献   

12.
The metabolic cycle of Saccharomyces cerevisiae consists of alternating oxidative (respiration) and reductive (glycolysis) energy-yielding reactions. The intracellular concentrations of amino acid precursors generated by these reactions oscillate accordingly, attaining maximal concentration during the middle of their respective yeast metabolic cycle phases. Typically, the amino acids themselves are most abundant at the end of their precursor’s phase. We show that this metabolic cycling has likely biased the amino acid composition of proteins across the S. cerevisiae genome. In particular, we observed that the metabolic source of amino acids is the single most important source of variation in the amino acid compositions of functionally related proteins and that this signal appears only in (facultative) organisms using both oxidative and reductive metabolism. Periodically expressed proteins are enriched for amino acids generated in the preceding phase of the metabolic cycle. Proteins expressed during the oxidative phase contain more glycolysis-derived amino acids, whereas proteins expressed during the reductive phase contain more respiration-derived amino acids. Rare amino acids (e.g., tryptophan) are greatly overrepresented or underrepresented, relative to the proteomic average, in periodically expressed proteins, whereas common amino acids vary by a few percent. Genome-wide, we infer that 20,000 to 60,000 residues have been modified by this previously unappreciated pressure. This trend is strongest in ancient proteins, suggesting that oscillating endogenous amino acid availability exerted genome-wide selective pressure on protein sequences across evolutionary time. Electronic supplementary material  The online version of this article (doi:) contains supplementary material, which is available to authorized users. Benjamin L. de Bivort and Ethan O. Perlstein have contributed equally to this work.  相似文献   

13.
The dramatic increase in heterogeneous types of biological data—in particular, the abundance of new protein sequences—requires fast and user-friendly methods for organizing this information in a way that enables functional inference. The most widely used strategy to link sequence or structure to function, homology-based function prediction, relies on the fundamental assumption that sequence or structural similarity implies functional similarity. New tools that extend this approach are still urgently needed to associate sequence data with biological information in ways that accommodate the real complexity of the problem, while being accessible to experimental as well as computational biologists. To address this, we have examined the application of sequence similarity networks for visualizing functional trends across protein superfamilies from the context of sequence similarity. Using three large groups of homologous proteins of varying types of structural and functional diversity—GPCRs and kinases from humans, and the crotonase superfamily of enzymes—we show that overlaying networks with orthogonal information is a powerful approach for observing functional themes and revealing outliers. In comparison to other primary methods, networks provide both a good representation of group-wise sequence similarity relationships and a strong visual and quantitative correlation with phylogenetic trees, while enabling analysis and visualization of much larger sets of sequences than trees or multiple sequence alignments can easily accommodate. We also define important limitations and caveats in the application of these networks. As a broadly accessible and effective tool for the exploration of protein superfamilies, sequence similarity networks show great potential for generating testable hypotheses about protein structure-function relationships.  相似文献   

14.
运用聚合酶链式反应(polymerasechainreaction,PCR)技术对3个Duchenne型肌营养不良症(DMD)家系中的患者进行dystrophin基因内9个外显子缺失检测,在2个家系中检测到外显子45、48、51缺失,同时运用PCR技术扩增位于dystrophin基因内内含子短串联重复序列,对非缺失型DMD家系进行了产前诊断,胎儿为正常女性.dystrophin基因外显子缺失检测方法快速、敏感、准确,可在临床推广中应用;短串联重复序列(STR)多态性分析方法可用于DMD家系的产前基因诊断和携带者检出.  相似文献   

15.
16.
Given sufficient large protein families, and using a global statistical inference approach, it is possible to obtain sufficient accuracy in protein residue contact predictions to predict the structure of many proteins. However, these approaches do not consider the fact that the contacts in a protein are neither randomly, nor independently distributed, but actually follow precise rules governed by the structure of the protein and thus are interdependent. Here, we present PconsC2, a novel method that uses a deep learning approach to identify protein-like contact patterns to improve contact predictions. A substantial enhancement can be seen for all contacts independently on the number of aligned sequences, residue separation or secondary structure type, but is largest for β-sheet containing proteins. In addition to being superior to earlier methods based on statistical inferences, in comparison to state of the art methods using machine learning, PconsC2 is superior for families with more than 100 effective sequence homologs. The improved contact prediction enables improved structure prediction.  相似文献   

17.
We undertook this project in response to the rapidly increasing number of protein structures with unknown functions in the Protein Data Bank. Here, we combined a genetic algorithm with a support vector machine to predict protein–protein binding sites. In an experiment on a testing dataset, we predicted the binding sites for 66% of our datasets, made up of 50 testing hetero-complexes. This classifier achieved greater sensitivity (60.17%), specificity (58.17%), accuracy (64.08%), and F-measure (54.79%), and a higher correlation coefficient (0.2502) than those of the support vector machine. This result can be used to guide biologists in designing specific experiments for protein analysis.  相似文献   

18.
19.
泛素化是目前广受关注的一种翻译后修饰过程,对蛋白质降解、DNA修复等多种细胞过程都具有重要的调控作用。本文根据国内外蛋白质泛素化位点预测的研究,分析了预测泛素化位点的特征属性,总结了对这些特征进行优化的特征选择方法,并对预测过程中所使用的各种机器学习分类器进行了概述。  相似文献   

20.
多序列比对的量子点荧光探针检测金黄色葡萄球菌的研究   总被引:1,自引:0,他引:1  
利用以量子点(Quantum dot,QD)作为供体、有机荧光染料作为受体的荧光能量共振转移(Fluores—cence resonance energy transfer,FRET)体系检测核酸等大分子是一种非常重要的检测手段。本文构建了一种检测金黄色葡萄球菌种特异性16SrDNA的新方法。此方法以羧基修饰的525nm量子点与氨基修饰的DNA在EDC的作用下通过脱水连接形成QD—DNA复合物作为荧光能量共振转移体系的供体、有机荧光基团ROX修饰的DNA作为荧光能量共振转移体系的受体组成能与金黄色葡萄球菌种特异性16SrDNA杂交的检测探针。当探针与靶序列发生杂交时,作为供体的525nmQD与作为受体的ROX之间的距离被缩短至能有效发生荧光能量共振转移的距离之内。此时,以不能致ROX发光的波长激发量子点发光,其荧光强度下降,而ROX的荧光强度上升。在不存在靶序列的情况下,不会发生这种荧光强度的变化。QD与ROX荧光强度的变化是实现本检测体系快速、简单的重要保证。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号