首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
Context-sensitive data integration and prediction of biological networks   总被引:4,自引:0,他引:4  
MOTIVATION: Several recent methods have addressed the problem of heterogeneous data integration and network prediction by modeling the noise inherent in high-throughput genomic datasets, which can dramatically improve specificity and sensitivity and allow the robust integration of datasets with heterogeneous properties. However, experimental technologies capture different biological processes with varying degrees of success, and thus, each source of genomic data can vary in relevance depending on the biological process one is interested in predicting. Accounting for this variation can significantly improve network prediction, but to our knowledge, no previous approaches have explicitly leveraged this critical information about biological context. RESULTS: We confirm the presence of context-dependent variation in functional genomic data and propose a Bayesian approach for context-sensitive integration and query-based recovery of biological process-specific networks. By applying this method to Saccharomyces cerevisiae, we demonstrate that leveraging contextual information can significantly improve the precision of network predictions, including assignment for uncharacterized genes. We expect that this general context-sensitive approach can be applied to other organisms and prediction scenarios. AVAILABILITY: A software implementation of our approach is available on request from the authors. SUPPLEMENTARY INFORMATION: Supplementary data are available at http://avis.princeton.edu/contextPIXIE/  相似文献   

2.

Background  

Machine-learning tools have gained considerable attention during the last few years for analyzing biological networks for protein function prediction. Kernel methods are suitable for learning from graph-based data such as biological networks, as they only require the abstraction of the similarities between objects into the kernel matrix. One key issue in kernel methods is the selection of a good kernel function. Diffusion kernels, the discretization of the familiar Gaussian kernel of Euclidean space, are commonly used for graph-based data.  相似文献   

3.
Dramatic improvements in high throughput sequencing technologies have led to a staggering growth in the number of predicted genes. However, a large fraction of these newly discovered genes do not have a functional assignment. Fortunately, a variety of novel high-throughput genome-wide functional screening technologies provide important clues that shed light on gene function. The integration of heterogeneous data to predict protein function has been shown to improve the accuracy of automated gene annotation systems. In this paper, we propose and evaluate a probabilistic approach for protein function prediction that integrates protein-protein interaction (PPI) data, gene expression data, protein motif information, mutant phenotype data, and protein localization data. First, functional linkage graphs are constructed from PPI data and gene expression data, in which an edge between nodes (proteins) represents evidence for functional similarity. The assumption here is that graph neighbors are more likely to share protein function, compared to proteins that are not neighbors. The functional linkage graph model is then used in concert with protein domain, mutant phenotype and protein localization data to produce a functional prediction. Our method is applied to the functional prediction of Saccharomyces cerevisiae genes, using Gene Ontology (GO) terms as the basis of our annotation. In a cross validation study we show that the integrated model increases recall by 18%, compared to using PPI data alone at the 50% precision. We also show that the integrated predictor is significantly better than each individual predictor. However, the observed improvement vs. PPI depends on both the new source of data and the functional category to be predicted. Surprisingly, in some contexts integration hurts overall prediction accuracy. Lastly, we provide a comprehensive assignment of putative GO terms to 463 proteins that currently have no assigned function.  相似文献   

4.
In an era exploding with genome-scale data, a major challenge for developmental biologists is how to extract significant clues from these publicly available data to benefit our studies of individual genes, and how to use them to improve our understanding of development at a systems level. Several studies have successfully demonstrated new approaches to classic developmental questions by computationally integrating various genome-wide data sets. Such computational approaches have shown great potential for facilitating research: instead of testing 20,000 genes, researchers might test 200 to the same effect. We discuss the nature and state of this art as it applies to developmental research.  相似文献   

5.
As the protein databases continue to expand at an exponential rate, fed by daily uploads from multiple large scale genomic and metagenomic projects, the problem of assigning a function to each new protein has become the focus of significant research interest in recent times. Herein, we review the most recent advances in the field of automated function prediction (AFP). We begin by defining what is meant by biological “function” and the means of describing such functions using standardised machine readable ontologies. We then focus on the various function-prediction programs available, both sequence and structure based, and outline their associated strengths and weaknesses. Finally, we conclude with a brief overview of the future challenges and outstanding questions in the field, which still remain unanswered.  相似文献   

6.

Background  

Determining the function of uncharacterized proteins is a major challenge in the post-genomic era due to the problem's complexity and scale. Identifying a protein's function contributes to an understanding of its role in the involved pathways, its suitability as a drug target, and its potential for protein modifications. Several graph-theoretic approaches predict unidentified functions of proteins by using the functional annotations of better-characterized proteins in protein-protein interaction networks. We systematically consider the use of literature co-occurrence data, introduce a new method for quantifying the reliability of co-occurrence and test how performance differs across species. We also quantify changes in performance as the prediction algorithms annotate with increased specificity.  相似文献   

7.
Although Arabidopsis (Arabidopsis thaliana) is the best studied plant species, the biological role of one-third of its proteins is still unknown. We developed a probabilistic protein function prediction method that integrates information from sequences, protein-protein interactions, and gene expression. The method was applied to proteins from Arabidopsis. Evaluation of prediction performance showed that our method has improved performance compared with single source-based prediction approaches and two existing integration approaches. An innovative feature of our method is that it enables transfer of functional information between proteins that are not directly associated with each other. We provide novel function predictions for 5,807 proteins. Recent experimental studies confirmed several of the predictions. We highlight these in detail for proteins predicted to be involved in flowering and floral organ development.  相似文献   

8.
MOTIVATION: Inferring networks of proteins from biological data is a central issue of computational biology. Most network inference methods, including Bayesian networks, take unsupervised approaches in which the network is totally unknown in the beginning, and all the edges have to be predicted. A more realistic supervised framework, proposed recently, assumes that a substantial part of the network is known. We propose a new kernel-based method for supervised graph inference based on multiple types of biological datasets such as gene expression, phylogenetic profiles and amino acid sequences. Notably, our method assigns a weight to each type of dataset and thereby selects informative ones. Data selection is useful for reducing data collection costs. For example, when a similar network inference problem must be solved for other organisms, the dataset excluded by our algorithm need not be collected. RESULTS: First, we formulate supervised network inference as a kernel matrix completion problem, where the inference of edges boils down to estimation of missing entries of a kernel matrix. Then, an expectation-maximization algorithm is proposed to simultaneously infer the missing entries of the kernel matrix and the weights of multiple datasets. By introducing the weights, we can integrate multiple datasets selectively and thereby exclude irrelevant and noisy datasets. Our approach is favorably tested in two biological networks: a metabolic network and a protein interaction network. AVAILABILITY: Software is available on request.  相似文献   

9.
Discovery of biological networks from diverse functional genomic data   总被引:1,自引:0,他引:1  
We have developed a general probabilistic system for query-based discovery of pathway-specific networks through integration of diverse genome-wide data. This framework was validated by accurately recovering known networks for 31 biological processes in Saccharomyces cerevisiae and experimentally verifying predictions for the process of chromosomal segregation. Our system, bioPIXIE, a public, comprehensive system for integration, analysis, and visualization of biological network predictions for S. cerevisiae, is freely accessible over the worldwide web.  相似文献   

10.
Membrane proteins account for about 30% of the genomes sequenced to date and play important roles in a variety of cellular functions. However, determining the three-dimensional structures of membrane proteins continues to pose a major challenge for structural biologists due to difficulties in recombinant expression and purification. We describe here a high throughput pipeline for Escherichia coli based membrane protein expression and purification. A ligation-independent cloning (LIC)-based vector encoding a C-terminal green fluorescence protein (GFP) tag was used for cloning in a high throughput mode. The GFP tag facilitated expression screening in E. coli through both cell culture fluorescence measurements and in-gel fluorescence imaging. Positive candidates from the GFP screening were subsequently sub-cloned into a LIC-based, GFP free vector for further expression and purification. The expressed, C-terminal His-tagged membrane proteins were purified via membrane enrichment and Ni-affinity chromatography. Thermofluor technique was applied to screen optimal buffers and detergents for the purified membrane proteins. This pipeline has been successfully tested for membrane proteins from E. coli and can be potentially expanded to other prokaryotes.  相似文献   

11.
How DNA is repaired after retrovirus integration is not well understood. DNA-dependent protein kinase (DNA-PK) is known to play a central role in the repair of double-stranded DNA breaks. Recently, a role for DNA-PK in retroviral DNA integration has been proposed (R. Daniel, R. A. Katz, and A. M. Skalka, Science 284:644-647, 1999). Reduced transduction efficiency and increased cell death by apoptosis were observed upon retrovirus infection of cultured scid cells. We have used a human immunodeficiency virus (HIV) type 1 (HIV-1)-derived lentivirus vector system to further investigate the role of DNA-PK during integration. We measured lentivirus transduction of scid mouse embryonic fibroblasts (MEF) and xrs-5 or xrs-6 cells. These cells are deficient in the catalytic subunit of DNA-PK and in Ku, the DNA-binding subunit of DNA-PK, respectively. At low vector titers, efficient and stable lentivirus transduction was obtained, excluding an essential role for DNA-PK in lentivirus integration. Likewise, the efficiency of transduction of HIV-derived vectors in scid mouse brain was as efficient as that in control mice, without evidence of apoptosis. We observed increased cell death in scid MEF and xrs-5 or xrs-6 cells, but only after transduction with high vector titers (multiplicity of infection [MOI], >1 transducing unit [TU]/cell) and subsequent passage of the transduced cells. At an MOI of <1 TU/cell, however, transduction efficiency was even higher in DNA-PK-deficient cells than in control cells. Taken together, the data suggest a protective role of DNA-PK against cellular toxicity induced by high levels of retrovirus integrase or integration. Another candidate cellular enzyme that has been claimed to play an important role during retrovirus integration is poly(ADP-ribose) polymerase (PARP). However, no inhibition of lentivirus vector-mediated transduction or HIV-1 replication by 3-methoxybenzamide, a known PARP inhibitor, was observed. In conclusion, DNA-PK and PARP are not essential for lentivirus integration.  相似文献   

12.
13.
Informational spectra method was applied to the analysis of lymphotoxin and tumor necrosis factor. The correlation between the information contained in primary structure of these tumor toxins and some oncogen transforming proteins was established. This correlation implies the possibility of a competitive action between these two groups of proteins. "Hot spots" positions in the primary structure of lymphotoxin and tumor necrosis factor for the functional "up" and "down" mutations were predicted.  相似文献   

14.
Background:

The wide availability of genome-scale data for several organisms has stimulated interest in computational approaches to gene function prediction. Diverse machine learning methods have been applied to unicellular organisms with some success, but few have been extensively tested on higher level, multicellular organisms. A recent mouse function prediction project (MouseFunc) brought together nine bioinformatics teams applying a diverse array of methodologies to mount the first large-scale effort to predict gene function in the laboratory mouse.

Results:

In this paper, we describe our contribution to this project, an ensemble framework based on the support vector machine that integrates diverse datasets in the context of the Gene Ontology hierarchy. We carry out a detailed analysis of the performance of our ensemble and provide insights into which methods work best under a variety of prediction scenarios. In addition, we applied our method to Saccharomyces cerevisiae and have experimentally confirmed functions for a novel mitochondrial protein.

Conclusion:

Our method consistently performs among the top methods in the MouseFunc evaluation. Furthermore, it exhibits good classification performance across a variety of cellular processes and functions in both a multicellular organism and a unicellular organism, indicating its ability to discover novel biology in diverse settings.

  相似文献   

15.
New directions in biology are being driven by the complete sequencing of genomes, which has given us the protein repertoires of diverse organisms from all kingdoms of life. In tandem with this accumulation of sequence data, worldwide structural genomics initiatives, advanced by the development of improved technologies in X-ray crystallography and NMR, are expanding our knowledge of structural families and increasing our fold libraries. Methods for detecting remote sequence similarities have also been made more sensitive and this means that we can map domains from these structural families onto genome sequences to understand how these families are distributed throughout the genomes and reveal how they might influence the functional repertoires and biological complexities of the organisms. We have used robust protocols to assign sequences from completed genomes to domain structures in the CATH database, allowing up to 60% of domain sequences in these genomes, depending on the organism, to be assigned to a domain family of known structure. Analysis of the distribution of these families throughout bacterial genomes identified more than 300 universal families, some of which had expanded significantly in proportion to genome size. These highly expanded families are primarily involved in metabolism and regulation and appear to make major contributions to the functional repertoire and complexity of bacterial organisms. When comparisons are made across all kingdoms of life, we find a smaller set of universal domain families (approx. 140), of which families involved in protein biosynthesis are the largest conserved component. Analysis of the behaviour of other families reveals that some (e.g. those involved in metabolism, regulation) have remained highly innovative during evolution, making it harder to trace their evolutionary ancestry. Structural analyses of metabolic families provide some insights into the mechanisms of functional innovation, which include changes in domain partnerships and significant structural embellishments leading to modulation of active sites and protein interactions.  相似文献   

16.

Background  

The omics fields promise to revolutionize our understanding of biology and biomedicine. However, their potential is compromised by the challenge to analyze the huge datasets produced. Analysis of omics data is plagued by the curse of dimensionality, resulting in imprecise estimates of model parameters and performance. Moreover, the integration of omics data with other data sources is difficult to shoehorn into classical statistical models. This has resulted in ad hoc approaches to address specific problems.  相似文献   

17.
18.
Chemical genetic analysis of protein kinases involves engineering kinases to be uniquely sensitive to inhibitors and ATP analogs that are not recognized by wild-type kinases. Despite the successful application of this approach to over two dozen kinases, several kinases do not tolerate the necessary modification to the ATP binding pocket, as they lose catalytic activity or cellular function upon mutation of the 'gatekeeper' residue that governs inhibitor and nucleotide substrate specificity. Here we describe the identification of second-site suppressor mutations to rescue the activity of 'intolerant' kinases. A bacterial genetic selection for second-site suppressors using an aminoglycoside kinase APH(3')-IIIa revealed several suppressor hotspots in the kinase domain. Informed by results from this selection, we focused on the beta sheet in the N-terminal subdomain and generated a structure-based sequence alignment of protein kinases in this region. From this alignment, we identified second-site suppressors for several divergent kinases including Cdc5, MEKK1, GRK2 and Pto. The ability to identify second-site suppressors to rescue the activity of intolerant kinases should facilitate chemical genetic analysis of the majority of protein kinases in the genome.  相似文献   

19.

Background  

Clearly visualized biopathways provide a great help in understanding biological systems. However, manual drawing of large-scale biopathways is time consuming. We proposed a grid layout algorithm that can handle gene-regulatory networks and signal transduction pathways by considering edge-edge crossing, node-edge crossing, distance measure between nodes, and subcellular localization information from Gene Ontology. Consequently, the layout algorithm succeeded in drastically reducing these crossings in the apoptosis model. However, for larger-scale networks, we encountered three problems: (i) the initial layout is often very far from any local optimum because nodes are initially placed at random, (ii) from a biological viewpoint, human layouts still exceed automatic layouts in understanding because except subcellular localization, it does not fully utilize biological information of pathways, and (iii) it employs a local search strategy in which the neighborhood is obtained by moving one node at each step, and automatic layouts suggest that simultaneous movements of multiple nodes are necessary for better layouts, while such extension may face worsening the time complexity.  相似文献   

20.
Protein data, from sequence and structure to interaction, is being generated through many diverse methodologies; it is stored and reported in numerous forms and multiple places. The magnitude of the data limits researchers abilities to utilize all information generated. Effective integration of protein data can be accomplished through better data modeling. We demonstrate this through the MIPD project.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号