首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 7 毫秒
1.
Data with a large p (number of covariates) and/or a large n (sample size) are now commonly encountered. For many problems, regularization especially penalization is adopted for estimation and variable selection. The straightforward application of penalization to large datasets demands a “big computer” with high computational power. To improve computational feasibility, we develop bootstrap penalization, which dissects a big penalized estimation into a set of small ones, which can be executed in a highly parallel manner and each only demands a “small computer”. The proposed approach takes different strategies for data with different characteristics. For data with a large p but a small to moderate n, covariates are first clustered into relatively homogeneous blocks. The proposed approach consists of two sequential steps. In each step and for each bootstrap sample, we select blocks of covariates and run penalization. The results from multiple bootstrap samples are pooled to generate the final estimate. For data with a large n but a small to moderate p, we bootstrap a small number of subjects, apply penalized estimation, and then conduct a weighted average over multiple bootstrap samples. For data with a large p and a large n, the natural marriage of the previous two methods is applied. Numerical studies, including simulations and data analysis, show that the proposed approach has computational and numerical advantages over the straightforward application of penalization. An R package has been developed to implement the proposed methods.  相似文献   

2.
YV Sun 《Human genetics》2012,131(10):1677-1686
Millions of genetic variants have been assessed for their effects on the trait of interest in genome-wide association studies (GWAS). The complex traits are affected by a set of inter-related genes. However, the typical GWAS only examine the association of a single genetic variant at a time. The individual effects of a complex trait are usually small, and the simple sum of these individual effects may not reflect the holistic effect of the genetic system. High-throughput methods enable genomic studies to produce a large amount of data to expand the knowledge base of the biological systems. Biological networks and pathways are built to represent the functional or physical connectivity among genes. Integrated with GWAS data, the network- and pathway-based methods complement the approach of single genetic variant analysis, and may improve the power to identify trait-associated genes. Taking advantage of the biological knowledge, these approaches are valuable to interpret the functional role of the genetic variants, and to further understand the molecular mechanism influencing the traits. The network- and pathway-based methods have demonstrated their utilities, and will be increasingly important to address a number of challenges facing the mainstream GWAS.  相似文献   

3.

Background  

Scientific literature is a source of the most reliable and comprehensive knowledge about molecular interaction networks. Formalization of this knowledge is necessary for computational analysis and is achieved by automatic fact extraction using various text-mining algorithms. Most of these techniques suffer from high false positive rates and redundancy of the extracted information. The extracted facts form a large network with no pathways defined.  相似文献   

4.
We describe a method for extracting Boolean implications (if-then relationships) in very large amounts of gene expression microarray data. A meta-analysis of data from thousands of microarrays for humans, mice, and fruit flies finds millions of implication relationships between genes that would be missed by other methods. These relationships capture gender differences, tissue differences, development, and differentiation. New relationships are discovered that are preserved across all three species.  相似文献   

5.
6.
7.
We present an analytical framework to analyze lists of proteins with large undirected graphs representing their known functional relationships. We consider edge-count variables such as the number of interactions between a protein and a list, the size of a subgraph induced by a list, and the number of interactions bridging two lists. We derive approximate analytical expressions for the probability distributions of these variables in a model of a random graph with given expected degrees. Probabilities obtained with the analytical expressions are used to mine a protein interaction network for functional modules, characterize the connectedness of protein functional categories, and measure the strength of relations between modules.  相似文献   

8.
MOTIVATION: Association rule analysis methods are important techniques applied to gene expression data for finding expression relationships between genes. However, previous methods implicitly assume that all genes have similar importance, or they ignore the individual importance of each gene. The relation intensity between any two items has never been taken into consideration. Therefore, we proposed a technique named REMMAR (RElational-based Multiple Minimum supports Association Rules) algorithm to tackle this problem. This method adjusts the minimum relation support (MRS) for each gene pair depending on the regulatory relation intensity to discover more important association rules with stronger biological meaning. RESULTS: In the actual case study of this research, REMMAR utilized the shortest distance between any two genes in the Saccharomyces cerevisiae gene regulatory network (GRN) as the relation intensity to discover the association rules from two S.cerevisiae gene expression datasets. Under experimental evaluation, REMMAR can generate more rules with stronger relation intensity, and filter out rules without biological meaning in the protein-protein interaction network (PPIN). Furthermore, the proposed method has a higher precision (100%) than the precision of reference Apriori method (87.5%) for the discovered rules use a literature survey. Therefore, the proposed REMMAR algorithm can discover stronger association rules in biological relationships dissimilated by traditional methods to assist biologists in complicated genetic exploration.  相似文献   

9.
In recent years, the study of species' occurrence has benefited from the increased availability of large-scale citizen-science data. While abundance data from standardized monitoring schemes are biased toward well-studied taxa and locations, opportunistic data are available for many taxonomic groups, from a large number of locations and across long timescales. Hence, these data provide opportunities to measure species' changes in occurrence, particularly through the use of occupancy models, which account for imperfect detection. These opportunistic datasets can be substantially large, numbering hundreds of thousands of sites, and hence present a challenge from a computational perspective, especially within a Bayesian framework. In this paper, we develop a unifying framework for Bayesian inference in occupancy models that account for both spatial and temporal autocorrelation. We make use of the Pólya-Gamma scheme, which allows for fast inference, and incorporate spatio-temporal random effects using Gaussian processes (GPs), for which we consider two efficient approximations: subset of regressors and nearest neighbor GPs. We apply our model to data on two UK butterfly species, one common and widespread and one rare, using records from the Butterflies for the New Millennium database, producing occupancy indices spanning 45 years. Our framework can be applied to a wide range of taxa, providing measures of variation in species' occurrence, which are used to assess biodiversity change.  相似文献   

10.
The advent of the "omics" era in biology research has brought new challenges and requires the development of novel strategies to answer previously intractable questions. Molecular interaction networks provide a framework to visualize cellular processes, but their complexity often makes their interpretation an overwhelming task. The inherently artificial nature of interaction detection methods and the incompleteness of currently available interaction maps call for a careful and well-informed utilization of this valuable data. In this tutorial, we aim to give an overview of the key aspects that any researcher needs to consider when working with molecular interaction data sets and we outline an example for interactome analysis. Using the molecular interaction database IntAct, the software platform Cytoscape, and its plugins BiNGO and clusterMaker, and taking as a starting point a list of proteins identified in a mass spectrometry-based proteomics experiment, we show how to build, visualize, and analyze a protein-protein interaction network.  相似文献   

11.
Pathways are typically the central concept in the analysis of biochemical reaction networks. A pathway can be interpreted as a chain of enzymatical reactions performing a specific biological function. A common way to study metabolic networks are minimal pathways that can operate at steady state called elementary modes. The theory of chemical organizations has recently been used to decompose biochemical networks into algebraically closed and self-maintaining subnetworks termed organizations. The aim of this paper is to elucidate the relation between these two concepts. Whereas elementary modes represent the boundaries of the potential behavior of the network, organizations define metabolite compositions that are likely to be present in biological feasible situations. Hence, steady state organizations consist of combinations of elementary modes. On the other hand, it is possible to assign a unique (and possibly empty) set of organizations to each elementary mode, indicating the metabolites accompanying the active pathway in a feasible steady state.  相似文献   

12.
Inferring gene regulatory networks from multiple microarray datasets   总被引:1,自引:0,他引:1  
MOTIVATION: Microarray gene expression data has increasingly become the common data source that can provide insights into biological processes at a system-wide level. One of the major problems with microarrays is that a dataset consists of relatively few time points with respect to a large number of genes, which makes the problem of inferring gene regulatory network an ill-posed one. On the other hand, gene expression data generated by different groups worldwide are increasingly accumulated on many species and can be accessed from public databases or individual websites, although each experiment has only a limited number of time-points. RESULTS: This paper proposes a novel method to combine multiple time-course microarray datasets from different conditions for inferring gene regulatory networks. The proposed method is called GNR (Gene Network Reconstruction tool) which is based on linear programming and a decomposition procedure. The method theoretically ensures the derivation of the most consistent network structure with respect to all of the datasets, thereby not only significantly alleviating the problem of data scarcity but also remarkably improving the prediction reliability. We tested GNR using both simulated data and experimental data in yeast and Arabidopsis. The result demonstrates the effectiveness of GNR in terms of predicting new gene regulatory relationship in yeast and Arabidopsis. AVAILABILITY: The software is available from http://zhangorup.aporc.org/bioinfo/grninfer/, http://digbio.missouri.edu/grninfer/ and http://intelligent.eic.osaka-sandai.ac.jp or upon request from the authors.  相似文献   

13.
14.
Nute  Michael  Warnow  Tandy 《BMC genomics》2016,17(10):764-144

Background

Multiple sequence alignment is an important task in bioinformatics, and alignments of large datasets containing hundreds or thousands of sequences are increasingly of interest. While many alignment methods exist, the most accurate alignments are likely to be based on stochastic models where sequences evolve down a tree with substitutions, insertions, and deletions. While some methods have been developed to estimate alignments under these stochastic models, only the Bayesian method BAli-Phy has been able to run on even moderately large datasets, containing 100 or so sequences. A technique to extend BAli-Phy to enable alignments of thousands of sequences could potentially improve alignment and phylogenetic tree accuracy on large-scale data beyond the best-known methods today.

Results

We use simulated data with up to 10,000 sequences representing a variety of model conditions, including some that are significantly divergent from the statistical models used in BAli-Phy and elsewhere. We give a method for incorporating BAli-Phy into PASTA and UPP, two strategies for enabling alignment methods to scale to large datasets, and give alignment and tree accuracy results measured against the ground truth from simulations. Comparable results are also given for other methods capable of aligning this many sequences.

Conclusions

Extensions of BAli-Phy using PASTA and UPP produce significantly more accurate alignments and phylogenetic trees than the current leading methods.
  相似文献   

15.
16.

Background  

Experimental screening of large sets of peptides with respect to their MHC binding capabilities is still very demanding due to the large number of possible peptide sequences and the extensive polymorphism of the MHC proteins. Therefore, there is significant interest in the development of computational methods for predicting the binding capability of peptides to MHC molecules, as a first step towards selecting peptides for actual screening.  相似文献   

17.
18.
19.
Networks are proving to be central to the study of gene function, protein-protein interaction, and biochemical pathway data. Visualization of networks is important for their study, but visualization tools are often inadequate for working with very large biological networks. Here, we present an algorithm, called large graph layout (LGL), which can be used to dynamically visualize large networks on the order of hundreds of thousands of vertices and millions of edges. LGL applies a force-directed iterative layout guided by a minimal spanning tree of the network in order to generate coordinates for the vertices in two or three dimensions, which are subsequently visualized and interactively navigated with companion programs. We demonstrate the use of LGL in visualizing an extensive protein map summarizing the results of approximately 21 billion sequence comparisons between 145579 proteins from 50 genomes. Proteins are positioned in the map according to sequence homology and gene fusions, with the map ultimately serving as a theoretical framework that integrates inferences about gene function derived from sequence homology, remote homology, gene fusions, and higher-order fusions. We confirm that protein neighbors in the resulting map are functionally related, and that distinct map regions correspond to distinct cellular systems, enabling a computational strategy for discovering proteins' functions on the basis of the proteins' map positions. Using the map produced by LGL, we infer general functions for 23 uncharacterized protein families.  相似文献   

20.
Proteins are essential macromolecules of life that carry out most cellular processes. Since proteins aggregate to perform function, and since protein-protein interaction (PPI) networks model these aggregations, one would expect to uncover new biology from PPI network topology. Hence, using PPI networks to predict protein function and role of protein pathways in disease has received attention. A debate remains open about whether network properties of "biologically central (BC)" genes (i.e., their protein products), such as those involved in aging, cancer, infectious diseases, or signaling and drug-targeted pathways, exhibit some topological centrality compared to the rest of the proteins in the human PPI network.To help resolve this debate, we design new network-based approaches and apply them to get new insight into biological function and disease. We hypothesize that BC genes have a topologically central (TC) role in the human PPI network. We propose two different concepts of topological centrality. We design a new centrality measure to capture complex wirings of proteins in the network that identifies as TC those proteins that reside in dense extended network neighborhoods. Also, we use the notion of domination and find dominating sets (DSs) in the PPI network, i.e., sets of proteins such that every protein is either in the DS or is a neighbor of the DS. Clearly, a DS has a TC role, as it enables efficient communication between different network parts. We find statistically significant enrichment in BC genes of TC nodes and outperform the existing methods indicating that genes involved in key biological processes occupy topologically complex and dense regions of the network and correspond to its "spine" that connects all other network parts and can thus pass cellular signals efficiently throughout the network. To our knowledge, this is the first study that explores domination in the context of PPI networks.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号