首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Protein–protein interactions play a key role in many biological systems. High‐throughput methods can directly detect the set of interacting proteins in yeast, but the results are often incomplete and exhibit high false‐positive and false‐negative rates. Recently, many different research groups independently suggested using supervised learning methods to integrate direct and indirect biological data sources for the protein interaction prediction task. However, the data sources, approaches, and implementations varied. Furthermore, the protein interaction prediction task itself can be subdivided into prediction of (1) physical interaction, (2) co‐complex relationship, and (3) pathway co‐membership. To investigate systematically the utility of different data sources and the way the data is encoded as features for predicting each of these types of protein interactions, we assembled a large set of biological features and varied their encoding for use in each of the three prediction tasks. Six different classifiers were used to assess the accuracy in predicting interactions, Random Forest (RF), RF similarity‐based k‐Nearest‐Neighbor, Naïve Bayes, Decision Tree, Logistic Regression, and Support Vector Machine. For all classifiers, the three prediction tasks had different success rates, and co‐complex prediction appears to be an easier task than the other two. Independently of prediction task, however, the RF classifier consistently ranked as one of the top two classifiers for all combinations of feature sets. Therefore, we used this classifier to study the importance of different biological datasets. First, we used the splitting function of the RF tree structure, the Gini index, to estimate feature importance. Second, we determined classification accuracy when only the top‐ranking features were used as an input in the classifier. We find that the importance of different features depends on the specific prediction task and the way they are encoded. Strikingly, gene expression is consistently the most important feature for all three prediction tasks, while the protein interactions identified using the yeast‐2‐hybrid system were not among the top‐ranking features under any condition. Proteins 2006. © 2006 Wiley‐Liss, Inc.  相似文献   

2.
Proteins do not carry out their functions alone. Instead, they often act by participating in macromolecular complexes and play different functional roles depending on the other members of the complex. It is therefore interesting to identify co-complex relationships. Although protein complexes can be identified in a high-throughput manner by experimental technologies such as affinity purification coupled with mass spectrometry (APMS), these large-scale datasets often suffer from high false positive and false negative rates. Here, we present a computational method that predicts co-complexed protein pair (CCPP) relationships using kernel methods from heterogeneous data sources. We show that a diffusion kernel based on random walks on the full network topology yields good performance in predicting CCPPs from protein interaction networks. In the setting of direct ranking, a diffusion kernel performs much better than the mutual clustering coefficient. In the setting of SVM classifiers, a diffusion kernel performs much better than a linear kernel. We also show that combination of complementary information improves the performance of our CCPP recognizer. A summation of three diffusion kernels based on two-hybrid, APMS, and genetic interaction networks and three sequence kernels achieves better performance than the sequence kernels or diffusion kernels alone. Inclusion of additional features achieves a still better ROC(50) of 0.937. Assuming a negative-to-positive ratio of 600ratio1, the final classifier achieves 89.3% coverage at an estimated false discovery rate of 10%. Finally, we applied our prediction method to two recently described APMS datasets. We find that our predicted positives are highly enriched with CCPPs that are identified by both datasets, suggesting that our method successfully identifies true CCPPs. An SVM classifier trained from heterogeneous data sources provides accurate predictions of CCPPs in yeast. This computational method thereby provides an inexpensive method for identifying protein complexes that extends and complements high-throughput experimental data.  相似文献   

3.
Chao Fang  Yi Shang  Dong Xu 《Proteins》2018,86(5):592-598
Protein secondary structure prediction can provide important information for protein 3D structure prediction and protein functions. Deep learning offers a new opportunity to significantly improve prediction accuracy. In this article, a new deep neural network architecture, named the Deep inception‐inside‐inception (Deep3I) network, is proposed for protein secondary structure prediction and implemented as a software tool MUFOLD‐SS. The input to MUFOLD‐SS is a carefully designed feature matrix corresponding to the primary amino acid sequence of a protein, which consists of a rich set of information derived from individual amino acid, as well as the context of the protein sequence. Specifically, the feature matrix is a composition of physio‐chemical properties of amino acids, PSI‐BLAST profile, and HHBlits profile. MUFOLD‐SS is composed of a sequence of nested inception modules and maps the input matrix to either eight states or three states of secondary structures. The architecture of MUFOLD‐SS enables effective processing of local and global interactions between amino acids in making accurate prediction. In extensive experiments on multiple datasets, MUFOLD‐SS outperformed the best existing methods and other deep neural networks significantly. MUFold‐SS can be downloaded from http://dslsrv8.cs.missouri.edu/~cf797/MUFoldSS/download.html .  相似文献   

4.
The ultimate goal of functional genomics is to define the function of all the genes in the genome of an organism. A large body of information of the biological roles of genes has been accumulated and aggregated in the past decades of research, both from traditional experiments detailing the role of individual genes and proteins, and from newer experimental strategies that aim to characterize gene function on a genomic scale.It is clear that the goal of functional genomics can only be achieved by integrating information and data sources from the variety of these different experiments. Integration of different data is thus an important challenge for bioinformatics.The integration of different data sources often helps to uncover non-obvious relationships between genes, but there are also two further benefits. First, it is likely that whenever information from multiple independent sources agrees, it should be more valid and reliable. Secondly, by looking at the union of multiple sources, one can cover larger parts of the genome. This is obvious for integrating results from multiple single gene or protein experiments, but also necessary for many of the results from genome-wide experiments since they are often confined to certain (although sizable) subsets of the genome.In this paper, we explore an example of such a data integration procedure. We focus on the prediction of membership in protein complexes for individual genes. For this, we recruit six different data sources that include expression profiles, interaction data, essentiality and localization information. Each of these data sources individually contains some weakly predictive information with respect to protein complexes, but we show how this prediction can be improved by combining all of them. Supplementary information is available at http://bioinfo.mbb.yale.edu/integrate/interactions/.Abbreviations: TP: true possitive; TN: true negative; FP: false positive; FN: false negative; Y2H: yeast two-hybrid.  相似文献   

5.
Large efforts have been made in classifying residues as binding sites in proteins using machine learning methods. The prediction task can be translated into the computational challenge of assigning each residue the label binding site or non‐binding site. Observational data comes from various possibly highly correlated sources. It includes the structure of the protein but not the structure of the complex. The model class of conditional random fields (CRFs) has previously successfully been used for protein binding site prediction. Here, a new CRF‐approach is presented that models the dependencies of residues using a general graphical structure defined as a neighborhood graph and thus our model makes fewer independence assumptions on the labels than sequential labeling approaches. A novel node feature “change in free energy” is introduced into the model, which is then denoted by ΔF‐CRF. Parameters are trained with an online large‐margin algorithm. Using the standard feature class relative accessible surface area alone, the general graph‐structure CRF already achieves higher prediction accuracy than the linear chain CRF of Li et al. ΔF‐CRF performs significantly better on a large range of false positive rates than the support‐vector‐machine‐based program PresCont of Zellner et al. on a homodimer set containing 128 chains. ΔF‐CRF has a broader scope than PresCont since it is not constrained to protein subgroups and requires no multiple sequence alignment. The improvement is attributed to the advantageous combination of the novel node feature with the standard feature and to the adopted parameter training method. Proteins 2015; 83:844–852. © 2015 Wiley Periodicals, Inc.  相似文献   

6.
Investigation of protein‐ligand interactions obtained from experiments has a crucial part in the design of newly discovered and effective drugs. Analyzing the data extracted from known interactions could help scientists to predict the binding affinities of promising ligands before conducting experiments. The objective of this study is to advance the CIFAP (compressed images for affinity prediction) method, which is relevant to a protein‐ligand model, identifying 2D electrostatic potential images by separating the binding site of protein‐ligand complexes and using the images for predicting the computational affinity information represented by pIC50 values. The CIFAP method has 2 phases, namely, data modeling and prediction. In data modeling phase, the separated 3D structure of the binding pocket with the ligand inside is fitted into an electrostatic potential grid box, which is then compressed through 3 orthogonal directions into three 2D images for each protein‐ligand complex. Sequential floating forward selection technique is performed for acquiring prediction patterns from the images. In the prediction phase, support vector regression (SVR) and partial least squares regression are used for testing the quality of the CIFAP method for predicting the binding affinity of 45 CHK1 inhibitors derived from 2‐aminothiazole‐4‐carboxamide. The results show that the CIFAP method using both support vector regression and partial least squares regression is very effective for predicting the binding affinities of CHK1‐ligand complexes with low‐error values and high correlation. As a future work, the results could be improved by working on the pose of the ligands inside the grid.  相似文献   

7.
SUMMARY: Recent advances in high-throughput technology have increased the quantity of available data on protein complexes and stimulated the development of many new prediction methods. In this article, we present ProCope, a Java software suite for the prediction and evaluation of protein complexes from affinity purification experiments which integrates the major methods for calculating interaction scores and predicting protein complexes published over the last years. Methods can be accessed via a graphical user interface, command line tools and a Java API. Using ProCope, existing algorithms can be applied quickly and reproducibly on new experimental results, individual steps of the different algorithms can be combined in new and innovative ways and new methods can be implemented and integrated in the existing prediction framework. AVAILABILITY: Source code and executables are available at http://www.bio.ifi.lmu.de/Complexes/ProCope/.  相似文献   

8.
Since protein complexes play a crucial role in biological cells, one of the major goals in bioinformatics is the elucidation of protein complexes. A general approach is to build a prediction rule based on multiple data sources, e.g. gene expression data and protein interaction data, to assess the likelihood of two proteins having complex association. We critically revisit the step of predictor construction, i.e. the determination of a proper training set, an optimal classifier, and, most importantly, an optimal feature set. We use an exhaustive set of features, which includes the 2hop-feature as introduced by Wong et al. for predicting synthetic sick or lethal interactions. Post-processing of the likelihoods of protein interaction is then required to extract protein complexes. We propose a new protocol for combining these likelihood estimates. The protocol interprets the probabilities of complex association as output by the prediction rule as distances and employs hierarchical clustering to find groups of interacting proteins. In contrast to the computationally expensive search-and-score approach of Sharan et al., this protocol is very fast and can be applied to fully connected graphs. The protocol identifies trusted protein complexes with high confidence. We show that the 2hop-feature is relevant for predicting protein complexes. Furthermore, several interesting hypotheses about new protein complexes have been generated. For example, our approach linked the protein FYV4 to the mitochondrial ribosomal subunit. Interestingly, it is known that this protein is located in the mitochondrion, but its biological role is unknown. Vid22 and YGR071C were also linked, which corresponds to the new TAP data of Krogan et al.  相似文献   

9.
We develop an integrated probabilistic model to combine protein physical interactions, genetic interactions, highly correlated gene expression networks, protein complex data, and domain structures of individual proteins to predict protein functions. The model is an extension of our previous model for protein function prediction based on Markovian random field theory. The model is flexible in that other protein pairwise relationship information and features of individual proteins can be easily incorporated. Two features distinguish the integrated approach from other available methods for protein function prediction. One is that the integrated approach uses all available sources of information with different weights for different sources of data. It is a global approach that takes the whole network into consideration. The second feature is that the posterior probability that a protein has the function of interest is assigned. The posterior probability indicates how confident we are about assigning the function to the protein. We apply our integrated approach to predict functions of yeast proteins based upon MIPS protein function classifications and upon the interaction networks based on MIPS physical and genetic interactions, gene expression profiles, tandem affinity purification (TAP) protein complex data, and protein domain information. We study the recall and precision of the integrated approach using different sources of information by the leave-one-out approach. In contrast to using MIPS physical interactions only, the integrated approach combining all of the information increases the recall from 57% to 87% when the precision is set at 57%-an increase of 30%.  相似文献   

10.
Predicting protein binding affinities from structural data has remained elusive, a difficulty owing to the variety of protein binding modes. Using the structure‐affinity‐benchmark (SAB, 144 cases with bound/unbound crystal structures and experimental affinity measurements), prediction has been undertaken either by fitting a model using a handfull of predefined variables, or by training a complex model from a large pool of parameters (typically hundreds). The former route unnecessarily restricts the model space, while the latter is prone to overfitting. We design models in a third tier, using 12 variables describing enthalpic and entropic variations upon binding, and a model selection procedure identifying the best sparse model built from a subset of these variables. Using these models, we report three main results. First, we present models yielding a marked improvement of affinity predictions. For the whole dataset, we present a model predicting Kd within 1 and 2 orders of magnitude for 48% and 79% of cases, respectively. These statistics jump to 62% and 89% respectively, for the subset of the SAB consisting of high resolution structures. Second, we show that these performances owe to a new parameter encoding interface morphology and packing properties of interface atoms. Third, we argue that interface flexibility and prediction hardness do not correlate, and that for flexible cases, a performance matching that of the whole SAB can be achieved. Overall, our work suggests that the affinity prediction problem could be partly solved using databases of high resolution complexes whose affinity is known. Proteins 2016; 84:9–20. © 2015 Wiley Periodicals, Inc.  相似文献   

11.
Quantitative prediction of protein–protein binding affinity is essential for understanding protein–protein interactions. In this article, an atomic level potential of mean force (PMF) considering volume correction is presented for the prediction of protein–protein binding affinity. The potential is obtained by statistically analyzing X‐ray structures of protein–protein complexes in the Protein Data Bank. This approach circumvents the complicated steps of the volume correction process and is very easy to implement in practice. It can obtain more reasonable pair potential compared with traditional PMF and shows a classic picture of nonbonded atom pair interaction as Lennard‐Jones potential. To evaluate the prediction ability for protein–protein binding affinity, six test sets are examined. Sets 1–5 were used as test set in five published studies, respectively, and set 6 was the union set of sets 1–5, with a total of 86 protein–protein complexes. The correlation coefficient (R) and standard deviation (SD) of fitting predicted affinity to experimental data were calculated to compare the performance of ours with that in literature. Our predictions on sets 1–5 were as good as the best prediction reported in the published studies, and for union set 6, R = 0.76, SD = 2.24 kcal/mol. Furthermore, we found that the volume correction can significantly improve the prediction ability. This approach can also promote the research on docking and protein structure prediction.  相似文献   

12.
13.
Interactions between proteins and other molecules play essential roles in all biological processes. Although it is widely held that a protein's ligand specificity is determined primarily by its three‐dimensional structure, the general principles by which structure determines ligand binding remain poorly understood. Here we use statistical analyses of a large number of protein?ligand complexes with associated binding‐affinity measurements to quantitatively characterize how combinations of atomic interactions contribute to ligand affinity. We find that there are significant differences in how atomic interactions determine ligand affinity for proteins that bind small chemical ligands, those that bind DNA/RNA and those that interact with other proteins. Although protein‐small molecule and protein‐DNA/RNA binding affinities can be accurately predicted from structural data, models predicting one type of interaction perform poorly on the others. Additionally, the particular combinations of atomic interactions required to predict binding affinity differed between small‐molecule and DNA/RNA data sets, consistent with the conclusion that the structural bases determining ligand affinity differ among interaction types. In contrast to what we observed for small‐molecule and DNA/RNA interactions, no statistical models were capable of predicting protein?protein affinity with >60% correlation. We demonstrate the potential usefulness of protein‐DNA/RNA binding prediction as a possible tool for high‐throughput virtual screening to guide laboratory investigations, suggesting that quantitative characterization of diverse molecular interactions may have practical applications as well as fundamentally advancing our understanding of how molecular structure translates into function. Proteins 2015; 83:2100–2114. © 2015 The Authors. Proteins: Structure, Function, and Bioinformatics Published by Wiley Periodicals, Inc.  相似文献   

14.
Since the development of affinity chromatography, affinity purification technology has been applied to many aspects of biological research, becoming an indispensable tool. Efficient strategies for the identification of biologically active compounds based on biochemical specificity have not yet been established, despite widespread interest in identifying chemicals that directly alter biomolecular functions. Here, we report a novel method for purifying chemicals that specifically interact with a target biomolecule using reverse affinity beads, a receptor-immobilized high-performance solid-phase matrix. When FK506-binding protein 12 (FKBP12) immobilized beads were used in this process, FK506 was efficiently purified in one step either from a mixture of chemical compounds or from fermented broth extract. The reverse affinity beads facilitated identification of drug/receptor complex binding proteins by reconstitution of immobilized ligand/receptor complexes on the beads. When FKBP12/FK506 and FKBP12/rapamycin complexes were immobilized, calcineurin and FKBP/rapamycin-associated protein were purified from a crude cell extract, respectively. These data indicate that reverse affinity beads are powerful tools for identification of both specific ligands and proteins that interact with receptor/ligand complexes.  相似文献   

15.
Zhao XM  Wang Y  Chen L  Aihara K 《Proteins》2008,72(1):461-473
Domains are structural and functional units of proteins and play an important role in functional genomics. Theoretically, the functions of a protein can be directly inferred if the biological functions of its component domains are determined. Despite the important role that domains play, only a small number of domains have been annotated so far, and few works have been performed to predict the functions of domains. Hence, it is necessary to develop automatic methods for predicting domain functions based on various available data. In this article, two new methods, that is, the threshold-based classification method and the support vector machines method, are proposed for protein domain function prediction by integrating heterogeneous information sources, including protein-domain mapping features, domain-domain interactions, and domain coexisting features. We show that the integration of heterogeneous information sources improves not only prediction accuracy but also annotation reliability when compared with the methods using only individual information sources.  相似文献   

16.
Biological processes are commonly controlled by precise protein‐protein interactions. These connections rely on specific amino acids at the binding interfaces. Here we predict the binding residues of such interprotein complexes. We have developed a suite of methods, i‐Patch, which predict the interprotein contact sites by considering the two proteins as a network, with residues as nodes and contacts as edges. i‐Patch starts with two proteins, A and B, which are assumed to interact, but for which the structure of the complex is not available. However, we assume that for each protein, we have a reference structure and a multiple sequence alignment of homologues. i‐Patch then uses the propensities of patches of residues to interact, to predict interprotein contact sites. i‐Patch outperforms several other tested algorithms for prediction of interprotein contact sites. It gives 59% precision with 20% recall on a blind test set of 31 protein pairs. Combining the i‐Patch scores with an existing correlated mutation algorithm, McBASC, using a logistic model gave little improvement. Results from a case study, on bacterial chemotaxis protein complexes, demonstrate that our predictions can identify contact residues, as well as suggesting unknown interfaces in multiprotein complexes. Proteins 2010. © 2010 Wiley‐Liss, Inc.  相似文献   

17.
Computational prediction of RNA‐binding residues is helpful in uncovering the mechanisms underlying protein‐RNA interactions. Traditional algorithms individually applied feature‐ or template‐based prediction strategy to recognize these crucial residues, which could restrict their predictive power. To improve RNA‐binding residue prediction, herein we propose the first integrative algorithm termed RBRDetector (RNA‐Binding Residue Detector) by combining these two strategies. We developed a feature‐based approach that is an ensemble learning predictor comprising multiple structure‐based classifiers, in which well‐defined evolutionary and structural features in conjunction with sequential or structural microenvironment were used as the inputs of support vector machines. Meanwhile, we constructed a template‐based predictor to recognize the putative RNA‐binding regions by structurally aligning the query protein to the RNA‐binding proteins with known structures. The final RBRDetector algorithm is an ingenious fusion of our feature‐ and template‐based approaches based on a piecewise function. By validating our predictors with diverse types of structural data, including bound and unbound structures, native and simulated structures, and protein structures binding to different RNA functional groups, we consistently demonstrated that RBRDetector not only had clear advantages over its component methods, but also significantly outperformed the current state‐of‐the‐art algorithms. Nevertheless, the major limitation of our algorithm is that it performed relatively well on DNA‐binding proteins and thus incorrectly predicted the DNA‐binding regions as RNA‐binding interfaces. Finally, we implemented the RBRDetector algorithm as a user‐friendly web server, which is freely accessible at http://ibi.hzau.edu.cn/rbrdetector . Proteins 2014; 82:2455–2471. © 2014 Wiley Periodicals, Inc.  相似文献   

18.
Isatin (indol‐2,3‐dione) is an endogenous non‐peptide regulator exhibiting a wide range of biological and pharmacological activities, which are poorly characterized in terms of their molecular mechanisms. Identification of many isatin‐binding proteins in the mammalian brain and liver suggests that isatin may influence their functions. We have hypothesized that besides direct action on particular protein targets, isatin can act as a regulator of protein–protein interactions (PPIs). In this surface plasmon resonance‐based biosensor study we have found that physiologically relevant concentrations of isatin (25‐100 μM) increase affinity of interactions between human recombinant ferrochelatase (FECH) and NADPH‐dependent adrenodoxin reductase (ADR). In the presence of increasing concentrations of isatin the Kd values demonstrated a significant (up to 6‐fold) decrease. It is especially important that the interaction of isatin with each individual protein (FECH, ADR) was basically negligible and therefore could not contribute to the observed effect. This effect was specific only for the FECH/ADR complex formation and was not observed for other protein complexes studied: FECH/cytochrome b5(CYB5A) and FECH/SMAD4.  相似文献   

19.
Substrate binding to Hsp70 chaperones is involved in many biological processes, and the identification of potential substrates is important for a comprehensive understanding of these events. We present a multi‐scale pipeline for an accurate, yet efficient prediction of peptides binding to the Hsp70 chaperone BiP by combining sequence‐based prediction with molecular docking and MMPBSA calculations. First, we measured the binding of 15mer peptides from known substrate proteins of BiP by peptide array (PA) experiments and performed an accuracy assessment of the PA data by fluorescence anisotropy studies. Several sequence‐based prediction models were fitted using this and other peptide binding data. A structure‐based position‐specific scoring matrix (SB‐PSSM) derived solely from structural modeling data forms the core of all models. The matrix elements are based on a combination of binding energy estimations, molecular dynamics simulations, and analysis of the BiP binding site, which led to new insights into the peptide binding specificities of the chaperone. Using this SB‐PSSM, peptide binders could be predicted with high selectivity even without training of the model on experimental data. Additional training further increased the prediction accuracies. Subsequent molecular docking (DynaDock) and MMGBSA/MMPBSA‐based binding affinity estimations for predicted binders allowed the identification of the correct binding mode of the peptides as well as the calculation of nearly quantitative binding affinities. The general concept behind the developed multi‐scale pipeline can readily be applied to other protein‐peptide complexes with linearly bound peptides, for which sufficient experimental binding data for the training of classical sequence‐based prediction models is not available. Proteins 2016; 84:1390–1407. © 2016 Wiley Periodicals, Inc.  相似文献   

20.
Protein engineering and synthetic biology stand to benefit immensely from recent advances in silico tools for structural and functional analyses of proteins. In the context of designing novel proteins, current in silico tools inform the user on individual parameters of a query protein, with output scores/metrics unique to each parameter. In reality, proteins feature multiple “parts”/functions and modification of a protein aimed at altering a given part, typically has collateral impact on other protein parts. A system for prediction of the combined effect of design parameters on the overall performance of the final protein does not exist. Function2Form Bridge (F2F-Bridge) attempts to address this by combining the scores of different design parameters pertaining to the protein being analyzed into a single easily interpreted output describing overall performance. The strategy comprises of (a) a mathematical strategy combining data from a myriad of in silico tools into an OP-score (a singular score informing on a user-defined overall performance) and (b) the F2F Plot, a graphical means of informing the wetlab biologist holistically on designed construct suitability in the context of multiple parameters, highlighting scope for improvement. F2F predictive output was compared with wetlab data from a range of synthetic proteins designed, built, and tested for this study. Statistical/machine learning approaches for predicting overall performance, for use alongside the F2F plot, were also examined. Comparisons between wetlab performance and F2F predictions demonstrated close and reliable correlations. This user-friendly strategy represents a pivotal enabler in increasing the accessibility of synthetic protein building and de novo protein design.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号