首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 93 毫秒
1.
Guo J  Wu X  Zhang DY  Lin K 《Nucleic acids research》2008,36(6):2002-2011
High-throughput studies of protein interactions may have produced, experimentally and computationally, the most comprehensive protein–protein interaction datasets in the completely sequenced genomes. It provides us an opportunity on a proteome scale, to discover the underlying protein interaction patterns. Here, we propose an approach to discovering motif pairs at interaction sites (often 38 residues) that are essential for understanding protein functions and helpful for the rational design of protein engineering and folding experiments. A gold standard positive (interacting) dataset and a gold standard negative (non-interacting) dataset were mined to infer the interacting motif pairs that are significantly overrepresented in the positive dataset compared to the negative dataset. Four negative datasets assembled by different strategies were evaluated and the one with the best performance was used as the gold standard negatives for further analysis. Meanwhile, to assess the efficiency of our method in detecting potential interacting motif pairs, other approaches developed previously were compared, and we found that our method achieved the highest prediction accuracy. In addition, many uncharacterized motif pairs of interest were found to be functional with experimental evidence in other species. This investigation demonstrates the important effects of a high-quality negative dataset on the performance of such statistical inference.  相似文献   

2.
Accurate and large‐scale prediction of protein–protein interactions directly from amino‐acid sequences is one of the great challenges in computational biology. Here we present a new Bayesian network method that predicts interaction partners using only multiple alignments of amino‐acid sequences of interacting protein domains, without tunable parameters, and without the need for any training examples. We first apply the method to bacterial two‐component systems and comprehensively reconstruct two‐component signaling networks across all sequenced bacteria. Comparisons of our predictions with known interactions show that our method infers interaction partners genome‐wide with high accuracy. To demonstrate the general applicability of our method we show that it also accurately predicts interaction partners in a recent dataset of polyketide synthases. Analysis of the predicted genome‐wide two‐component signaling networks shows that cognates (interacting kinase/regulator pairs, which lie adjacent on the genome) and orphans (which lie isolated) form two relatively independent components of the signaling network in each genome. In addition, while most genes are predicted to have only a small number of interaction partners, we find that 10% of orphans form a separate class of ‘hub’ nodes that distribute and integrate signals to and from up to tens of different interaction partners.  相似文献   

3.
4.
Protein–protein interactions (PPIs) play very important roles in many cellular processes, and provide rich information for discovering biological facts and knowledge. Although various experimental approaches have been developed to generate large amounts of PPI data for different organisms, high-throughput experimental data usually suffers from high error rates, and as a consequence, the biological knowledge discovered from this data is distorted or incorrect. Therefore, it is vital to assess the quality of protein interaction data and extract reliable protein interactions from the high-throughput experimental data. In this paper, we propose a new Semantic Reliability (SR) method to assess the reliability of each protein interaction and identify potential false-positive protein interactions in a dataset. For each pair of target interacting proteins, the SR method takes into account the semantic influence between proteins that interact with the target proteins, and the semantic influence between the target proteins themselves when assessing the interaction reliability. Evaluations on real protein interaction datasets demonstrated that our method outperformed other existing methods in terms of extracting more reliable interactions from original protein interaction datasets.  相似文献   

5.
Identifying the genes that change their expressions between two conditions (such as normal versus cancer) is a crucial task that can help in understanding the causes of diseases. Differential networking has emerged as a powerful approach to detect the changes in network structures and to identify the differentially connected genes among two networks. However, existing differential network-based methods primarily depend on pairwise comparisons of the genes based on their connectivity. Therefore, these methods cannot capture the essential topological changes in the network structures. In this paper, we propose a novel algorithm, DiffRank, which ranks the genes based on their contribution to the differences between the two networks. To achieve this goal, we define two novel structural scoring measures: a local structure measure (differential connectivity) and a global structure measure (differential betweenness centrality). These measures are optimized by propagating the scores through the network structure and then ranking the genes based on these propagated scores. We demonstrate the effectiveness of DiffRank on synthetic and real datasets. For the synthetic datasets, we developed a simulator for generating synthetic differential scale-free networks, and we compared our method with existing methods. The comparisons show that our algorithm outperforms these existing methods. For the real datasets, we apply the proposed algorithm on several gene expression datasets and demonstrate that the proposed method provides biologically interesting results.  相似文献   

6.

Background  

Several studies have demonstrated that synthetic lethal genetic interactions between gene mutations provide an indication of functional redundancy between molecular complexes and pathways. These observations help explain the finding that organisms are able to tolerate single gene deletions for a large majority of genes. For example, system-wide gene knockout/knockdown studies in S. cerevisiae and C. elegans revealed non-viable phenotypes for a mere 18% and 10% of the genome, respectively. It has been postulated that the low percentage of essential genes reflects the extensive amount of genetic buffering that occurs within genomes. Consistent with this hypothesis, systematic double-knockout screens in S. cerevisiae and C. elegans show that, on average, 0.5% of tested gene pairs are synthetic sick or synthetic lethal. While knowledge of synthetic lethal interactions provides valuable insight into molecular functionality, testing all combinations of gene pairs represents a daunting task for molecular biologists, as the combinatorial nature of these relationships imposes a large experimental burden. Still, the task of mapping pairwise interactions between genes is essential to discovering functional relationships between molecular complexes and pathways, as they form the basis of genetic robustness. Towards the goal of alleviating the experimental workload, computational techniques that accurately predict genetic interactions can potentially aid in targeting the most likely candidate interactions. Building on previous studies that analyzed properties of network topology to predict genetic interactions, we apply random walks on biological networks to accurately predict pairwise genetic interactions. Furthermore, we incorporate all published non-interactions into our algorithm for measuring the topological relatedness between two genes. We apply our method to S. cerevisiae and C. elegans datasets and, using a decision tree classifier, integrate diverse biological networks and show that our method outperforms established methods.  相似文献   

7.
JH Oh  HP Wong  X Wang  JO Deasy 《PloS one》2012,7(6):e38870
The number of biomarker candidates is often much larger than the number of clinical patient data points available, which motivates the use of a rational candidate variable filtering methodology. The goal of this paper is to apply such a bioinformatics filtering process to isolate a modest number (<10) of key interacting genes and their associated single nucleotide polymorphisms involved in radiation response, and to ultimately serve as a basis for using clinical datasets to identify new biomarkers. In step 1, we surveyed the literature on genetic and protein correlates to radiation response, in vivo or in vitro, across cellular, animal, and human studies. In step 2, we analyzed two publicly available microarray datasets and identified genes in which mRNA expression changed in response to radiation. Combining results from Step 1 and Step 2, we identified 20 genes that were common to all three sources. As a final step, a curated database of protein interactions was used to generate the most statistically reliable protein interaction network among any subset of the 20 genes resulting from Steps 1 and 2, resulting in identification of a small, tightly interacting network with 7 out of 20 input genes. We further ranked the genes in terms of likely importance, based on their location within the network using a graph-based scoring function. The resulting core interacting network provides an attractive set of genes likely to be important to radiation response.  相似文献   

8.
Cell-cell communication is mediated by many soluble mediators, including over 40 cytokines. Cytokines, e.g. TNF, IL1β, IL5, IL6, IL12 and IL23, represent important therapeutic targets in immune-mediated inflammatory diseases (IMIDs), such as inflammatory bowel disease (IBD), psoriasis, asthma, rheumatoid and juvenile arthritis. The identification of cytokines that are causative drivers of, and not just associated with, inflammation is fundamental for selecting therapeutic targets that should be studied in clinical trials. As in vitro models of cytokine interactions provide a simplified framework to study complex in vivo interactions, and can easily be perturbed experimentally, they are key for identifying such targets. We present a method to extract a minimal, weighted cytokine interaction network, given in vitro data on the effects of the blockage of single cytokine receptors on the secretion rate of other cytokines. Existing biological network inference methods typically consider the correlation structure of the underlying dataset, but this can make them poorly suited for highly connected, non-linear cytokine interaction data. Our method uses ordinary differential equation systems to represent cytokine interactions, and efficiently computes the configuration with the lowest Akaike information criterion value for all possible network configurations. It enables us to study indirect cytokine interactions and quantify inhibition effects. The extracted network can also be used to predict the combined effects of inhibiting various cytokines simultaneously. The model equations can easily be adjusted to incorporate more complicated dynamics and accommodate temporal data. We validate our method using synthetic datasets and apply our method to an experimental dataset on the regulation of IL23, a cytokine with therapeutic relevance in psoriasis and IBD. We validate several model predictions against experimental data that were not used for model fitting. In summary, we present a novel method specifically designed to efficiently infer cytokine interaction networks from cytokine perturbation data in the context of IMIDs.  相似文献   

9.
DREAM is an initiative that allows researchers to assess how well their methods or approaches can describe and predict networks of interacting molecules [1]. Each year, recently acquired datasets are released to predictors ahead of publication. Researchers typically have about three months to predict the masked data or network of interactions, using any predictive method. Predictions are assessed prior to an annual conference where the best predictions are unveiled and discussed. Here we present the strategy we used to make a winning prediction for the DREAM3 phosphoproteomics challenge. We used Amelia II, a multiple imputation software method developed by Gary King, James Honaker and Matthew Blackwell[2] in the context of social sciences to predict the 476 out of 4624 measurements that had been masked for the challenge. To chose the best possible multiple imputation parameters to apply for the challenge, we evaluated how transforming the data and varying the imputation parameters affected the ability to predict additionally masked data. We discuss the accuracy of our findings and show that multiple imputations applied to this dataset is a powerful method to accurately estimate the missing data. We postulate that multiple imputations methods might become an integral part of experimental design as a mean to achieve cost savings in experimental design or to increase the quantity of samples that could be handled for a given cost.  相似文献   

10.
11.
Ma X  Tarone AM  Li W 《PloS one》2008,3(4):e1922

Background

Synthetic lethal genetic interaction analysis has been successfully applied to predicting the functions of genes and their pathway identities. In the context of synthetic lethal interaction data alone, the global similarity of synthetic lethal interaction patterns between two genes is used to predict gene function. With physical interaction data, such as protein-protein interactions, the enrichment of physical interactions within subsets of genes and the enrichment of synthetic lethal interactions between those subsets of genes are used as an indication of compensatory pathways.

Result

In this paper, we propose a method of mapping genetically compensatory pathways from synthetic lethal interactions. Our method is designed to discover pairs of gene-sets in which synthetic lethal interactions are depleted among the genes in an individual set and where such gene-set pairs are connected by many synthetic lethal interactions. By its nature, our method could select compensatory pathway pairs that buffer the deleterious effect of the failure of either one, without the need of physical interaction data. By focusing on compensatory pathway pairs where genes in each individual pathway have a highly homogenous cellular function, we show that many cellular functions have genetically compensatory properties.

Conclusion

We conclude that synthetic lethal interaction data are a powerful source to map genetically compensatory pathways, especially in systems lacking physical interaction information, and that the cellular function network contains abundant compensatory properties.  相似文献   

12.
One of the central goals of human genetics is the identification of loci with alleles or genotypes that confer increased susceptibility. The availability of dense maps of single-nucleotide polymorphisms (SNPs) along with high-throughput genotyping technologies has set the stage for routine genome-wide association studies that are expected to significantly improve our ability to identify susceptibility loci. Before this promise can be realized, there are some significant challenges that need to be addressed. We address here the challenge of detecting epistasis or gene–gene interactions in genome-wide association studies. Discovering epistatic interactions in high dimensional datasets remains a challenge due to the computational complexity resulting from the analysis of all possible combinations of SNPs. One potential way to overcome the computational burden of a genome-wide epistasis analysis would be to devise a logical way to prioritize the many SNPs in a dataset so that the data may be analyzed more efficiently and yet still retain important biological information. One of the strongest demonstrations of the functional relationship between genes is protein-protein interaction. Thus, it is plausible that the expert knowledge extracted from protein interaction databases may allow for a more efficient analysis of genome-wide studies as well as facilitate the biological interpretation of the data. In this review we will discuss the challenges of detecting epistasis in genome-wide genetic studies and the means by which we propose to apply expert knowledge extracted from protein interaction databases to facilitate this process. We explore some of the fundamentals of protein interactions and the databases that are publicly available.  相似文献   

13.
Cells respond to variable environments by changing gene expression and gene interactions. To study how human cells response to stress, we analyzed the expression of >5000 genes in cultured B cells from nearly 100 normal individuals following endoplasmic reticulum stress and exposure to ionizing radiation. We identified thousands of genes that are induced or repressed. Then, we constructed coexpression networks and inferred interactions among genes. We used coexpression and machine learning analyses to study how genes interact with each other in response to stress. The results showed that for most genes, their interactions with each other are the same at baseline and in response to different stresses; however, a small set of genes acquired new interacting partners to engage in stress-specific responses. These genes with altered interacting partners are associated with diseases in which endoplasmic reticulum stress response or sensitivity to radiation has been implicated. Thus, our findings showed that to understand disease-specific pathways, it is important to identify not only genes that change expression levels but also those that alter interactions with other genes.  相似文献   

14.

Background

The problems of correlation and classification are long-standing in the fields of statistics and machine learning, and techniques have been developed to address these problems. We are now in the era of high-dimensional data, which is data that can concern billions of variables. These data present new challenges. In particular, it is difficult to discover predictive variables, when each variable has little marginal effect. An example concerns Genome-wide Association Studies (GWAS) datasets, which involve millions of single nucleotide polymorphism (SNPs), where some of the SNPs interact epistatically to affect disease status. Towards determining these interacting SNPs, researchers developed techniques that addressed this specific problem. However, the problem is more general, and so these techniques are applicable to other problems concerning interactions. A difficulty with many of these techniques is that they do not distinguish whether a learned interaction is actually an interaction or whether it involves several variables with strong marginal effects.

Methodology/Findings

We address this problem using information gain and Bayesian network scoring. First, we identify candidate interactions by determining whether together variables provide more information than they do separately. Then we use Bayesian network scoring to see if a candidate interaction really is a likely model. Our strategy is called MBS-IGain. Using 100 simulated datasets and a real GWAS Alzheimer’s dataset, we investigated the performance of MBS-IGain.

Conclusions/Significance

When analyzing the simulated datasets, MBS-IGain substantially out-performed nine previous methods at locating interacting predictors, and at identifying interactions exactly. When analyzing the real Alzheimer’s dataset, we obtained new results and results that substantiated previous findings. We conclude that MBS-IGain is highly effective at finding interactions in high-dimensional datasets. This result is significant because we have increasingly abundant high-dimensional data in many domains, and to learn causes and perform prediction/classification using these data, we often must first identify interactions.  相似文献   

15.
Drug-drug interactions account for up to 30% of adverse drug reactions. Increasing prevalence of electronic health records (EHRs) offers a unique opportunity to build machine learning algorithms to identify drug-drug interactions that drive adverse events. In this study, we investigated hospitalizations’ data to study drug interactions with non-steroidal anti-inflammatory drugs (NSAIDS) that result in drug-induced liver injury (DILI). We propose a logistic regression based machine learning algorithm that unearths several known interactions from an EHR dataset of about 400,000 hospitalization. Our proposed modeling framework is successful in detecting 87.5% of the positive controls, which are defined by drugs known to interact with diclofenac causing an increased risk of DILI, and correctly ranks aggregate risk of DILI for eight commonly prescribed NSAIDs. We found that our modeling framework is particularly successful in inferring associations of drug-drug interactions from relatively small EHR datasets. Furthermore, we have identified a novel and potentially hepatotoxic interaction that might occur during concomitant use of meloxicam and esomeprazole, which are commonly prescribed together to allay NSAID-induced gastrointestinal (GI) bleeding. Empirically, we validate our approach against prior methods for signal detection on EHR datasets, in which our proposed approach outperforms all the compared methods across most metrics, such as area under the receiver operating characteristic curve (AUROC) and area under the precision-recall curve (AUPRC).  相似文献   

16.
Recent studies have shown evidence for the coevolution of functionally-related genes. This coevolution is a result of constraints to maintain functional relationships between interacting proteins. The studies have focused on the correlation in gene tree branch lengths of proteins that are directly interacting with each other. We here hypothesize that the correlation in branch lengths is not limited only to proteins that directly interact, but also to proteins that operate within the same pathway. Using generalized linear models as a basis of identifying correlation, we attempted to predict the gene ontology (GO) terms of a gene based on its gene tree branch lengths. We applied our method to a dataset consisting of proteins from ten prokaryotic species. We found that the degree of accuracy to which we could predict the function of the proteins from their gene tree varied substantially with different GO terms. In particular, our model could accurately predict genes involved in translation and certain ribosomal activities with the area of the receiver-operator curve of up to 92%. Further analysis showed that the similarity between the trees of genes labeled with similar GO terms was not limited to genes that physically interacted, but also extended to genes functioning within the same pathway. We discuss the relevance of our findings as it relates to the use of phylogenetic methods in comparative genomics.  相似文献   

17.
18.
Identifying protein–protein interactions (PPIs) is critical for understanding the cellular function of the proteins and the machinery of a proteome. Data of PPIs derived from high-throughput technologies are often incomplete and noisy. Therefore, it is important to develop computational methods and high-quality interaction dataset for predicting PPIs. A sequence-based method is proposed by combining correlation coefficient (CC) transformation and support vector machine (SVM). CC transformation not only adequately considers the neighboring effect of protein sequence but describes the level of CC between two protein sequences. A gold standard positives (interacting) dataset MIPS Core and a gold standard negatives (non-interacting) dataset GO-NEG of yeast Saccharomyces cerevisiae were mined to objectively evaluate the above method and attenuate the bias. The SVM model combined with CC transformation yielded the best performance with a high accuracy of 87.94% using gold standard positives and gold standard negatives datasets. The source code of MATLAB and the datasets are available on request under smgsmg@mail.ustc.edu.cn.  相似文献   

19.
BACKGROUND: Complex diseases are commonly caused by multiple genes and their interactions with each other. Genome-wide association (GWA) studies provide us the opportunity to capture those disease associated genes and gene-gene interactions through panels of SNP markers. However, a proper filtering procedure is critical to reduce the search space prior to the computationally intensive gene-gene interaction identification step. In this study, we show that two commonly used SNP-SNP interaction filtering algorithms, ReliefF and tuned ReliefF (TuRF), are sensitive to the order of the samples in the dataset, giving rise to unstable and suboptimal results. However, we observe that the 'unstable' results from multiple runs of these algorithms can provide valuable information about the dataset. We therefore hypothesize that aggregating results from multiple runs of the algorithm may improve the filtering performance. RESULTS: We propose a simple and effective ensemble approach in which the results from multiple runs of an unstable filter are aggregated based on the general theory of ensemble learning. The ensemble versions of the ReliefF and TuRF algorithms, referred to as ReliefF-E and TuRF-E, are robust to sample order dependency and enable a more informative investigation of data characteristics. Using simulated and real datasets, we demonstrate that both the ensemble of ReliefF and the ensemble of TuRF can generate a much more stable SNP ranking than the original algorithms. Furthermore, the ensemble of TuRF achieved the highest success rate in comparison to many state-of-the-art algorithms as well as traditional χ2-test and odds ratio methods in terms of retaining gene-gene interactions.  相似文献   

20.
Wu X  Zhu L  Guo J  Zhang DY  Lin K 《Nucleic acids research》2006,34(7):2137-2150
A map of protein–protein interactions provides valuable insight into the cellular function and machinery of a proteome. By measuring the similarity between two Gene Ontology (GO) terms with a relative specificity semantic relation, here, we proposed a new method of reconstructing a yeast protein–protein interaction map that is solely based on the GO annotations. The method was validated using high-quality interaction datasets for its effectiveness. Based on a Z-score analysis, a positive dataset and a negative dataset for protein–protein interactions were derived. Moreover, a gold standard positive (GSP) dataset with the highest level of confidence that covered 78% of the high-quality interaction dataset and a gold standard negative (GSN) dataset with the lowest level of confidence were derived. In addition, we assessed four high-throughput experimental interaction datasets using the positives and the negatives as well as GSPs and GSNs. Our predicted network reconstructed from GSPs consists of 40753 interactions among 2259 proteins, and forms 16 connected components. We mapped all of the MIPS complexes except for homodimers onto the predicted network. As a result, ~35% of complexes were identified interconnected. For seven complexes, we also identified some nonmember proteins that may be functionally related to the complexes concerned. This analysis is expected to provide a new approach for predicting the protein–protein interaction maps from other completely sequenced genomes with high-quality GO-based annotations.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号