首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
3.
Extracting three-way gene interactions from microarray data   总被引:1,自引:0,他引:1  
MOTIVATION: It is an important and difficult task to extract gene network information from high-throughput genomic data. A common approach is to cluster genes using pairwise correlation as a distance metric. However, pairwise correlation is clearly too simplistic to describe the complex relationships among real genes since co-expression relationships are often restricted to a specific set of biological conditions/processes. In this study, we described a three-way gene interaction model that captures the dynamic nature of co-expression relationship between a gene pair through the introduction of a controller gene. RESULTS: We surveyed 0.4 billion possible three-way interactions among 1000 genes in a microarray dataset containing 678 human cancer samples. To test the reproducibility and statistical significance of our results, we randomly split the samples into a training set and a testing set. We found that the gene triplets with the strongest interactions (i.e. with the smallest P-values from appropriate statistical tests) in the training set also had the strongest interactions in the testing set. A distinctive pattern of three-way interaction emerged from these gene triplets: depending on the third gene being expressed or not, the remaining two genes can be either co-expressed or mutually exclusive (i.e. expression of either one of them would repress the other). Such three-way interactions can exist without apparent pairwise correlations. The identified three-way interactions may constitute candidates for further experimentation using techniques such as RNA interference, so that novel gene network or pathways could be identified.  相似文献   

4.
Static expression experiments analyze samples from many individuals. These samples are often snapshots of the progression of a certain disease such as cancer. This raises an intriguing question: Can we determine a temporal order for these samples? Such an ordering can lead to better understanding of the dynamics of the disease and to the identification of genes associated with its progression. In this paper we formally prove, for the first time, that under a model for the dynamics of the expression levels of a single gene, it is indeed possible to recover the correct ordering of the static expression datasets by solving an instance of the traveling salesman problem (TSP). In addition, we devise an algorithm that combines a TSP heuristic and probabilistic modeling for inferring the underlying temporal order of the microarray experiments. This algorithm constructs probabilistic continuous curves to represent expression profiles leading to accurate temporal reconstruction for human data. Applying our method to cancer expression data we show that the ordering derived agrees well with survival duration. A classifier that utilizes this ordering improves upon other classifiers suggested for this task. The set of genes displaying consistent behavior for the determined ordering are enriched for genes associated with cancer progression.  相似文献   

5.
A genetic map is an ordering of genetic markers calculated from a population of known lineage.While traditionally a map has been generated from a single population for each species, recently researchers have created maps from multiple populations. In the face of these new data, we address the need to find a consensus map--a map that combines the information from multiple partial and possibly inconsistent input maps. We model each input map as a partial order and formulate the consensus problem as finding a median partial order. Finding the median of multiple total orders (preferences or rankings)is a well studied problem in social choice. We choose to find the median using the weighted symmetric difference distance, a more general version of both the symmetric difference distance and the Kemeny distance. Finding a median order using this distance is NP-hard. We show that for our chosen weight assignment, a median order satisfies the positive responsiveness, extended Condorcet,and unanimity criteria. Our solution involves finding the maximum acyclic subgraph of a weighted directed graph.We present a method that dynamically switches between an exact branch and bound algorithm and a heuristic algorithm, and show that for real data from closely related organisms, an exact median can often be found.We present experimental results using seven populations of the crop plant Zea mays.  相似文献   

6.
Protein structure prediction techniques proceed in two steps, namely the generation of many structural models for the protein of interest, followed by an evaluation of all these models to identify those that are native‐like. In theory, the second step is easy, as native structures correspond to minima of their free energy surfaces. It is well known however that the situation is more complicated as the current force fields used for molecular simulations fail to recognize native states from misfolded structures. In an attempt to solve this problem, we follow an alternate approach and derive a new potential from geometric knowledge extracted from native and misfolded conformers of protein structures. This new potential, Metric Protein Potential (MPP), has two main features that are key to its success. Firstly, it is composite in that it includes local and nonlocal geometric information on proteins. At the short range level, it captures and quantifies the mapping between the sequences and structures of short (7‐mer) fragments of protein backbones through the introduction of a new local energy term. The local energy term is then augmented with a nonlocal residue‐based pairwise potential, and a solvent potential. Secondly, it is optimized to yield a maximized correlation between the energy of a structural model and its root mean square (RMS) to the native structure of the corresponding protein. We have shown that MPP yields high correlation values between RMS and energy and that it is able to retrieve the native structure of a protein from a set of high‐resolution decoys. Proteins 2013. © 2012 Wiley Periodicals, Inc.  相似文献   

7.
8.

Background  

Time series gene expression data analysis is used widely to study the dynamics of various cell processes. Most of the time series data available today consist of few time points only, thus making the application of standard clustering techniques difficult.  相似文献   

9.
The accurate extraction of species-abundance information from DNA-based data (metabarcoding, metagenomics) could contribute usefully to diet analysis and food-web reconstruction, the inference of species interactions, the modelling of population dynamics and species distributions, the biomonitoring of environmental state and change, and the inference of false positives and negatives. However, multiple sources of bias and noise in sampling and processing combine to inject error into DNA-based data sets. To understand how to extract abundance information, it is useful to distinguish two concepts. (i) Within-sample across-species quantification describes relative species abundances in one sample. (ii) Across-sample within-species quantification describes how the abundance of each individual species varies from sample to sample, such as over a time series, an environmental gradient or different experimental treatments. First, we review the literature on methods to recover across-species abundance information (by removing what we call “species pipeline biases”) and within-species abundance information (by removing what we call “pipeline noise”). We argue that many ecological questions can be answered with just within-species quantification, and we therefore demonstrate how to use a “DNA spike-in” to correct for pipeline noise and recover within-species abundance information. We also introduce a model-based estimator that can be used on data sets without a physical spike-in to approximate and correct for pipeline noise.  相似文献   

10.
Biological imaging continues to improve, capturing continually longer-term, richer, and more complex data, penetrating deeper into live tissue. How do we gain insight into the dynamic processes of disease and development from terabytes of multidimensional image data? Here I describe a collaborative approach to extracting meaning from biological imaging data. The collaboration consists of teams of biologists and engineers working together. Custom computational tools are built to best exploit application-specific knowledge in order to visualize and analyze large and complex data sets. The image data are summarized, extracting and modeling the features that capture the objects and relationships in the data. The summarization is validated, the results visualized, and errors corrected as needed. Finally, the customized analysis and visualization tools together with the image data and the summarization results are shared. This Perspective provides a brief guide to the mathematical ideas that rigorously quantify the notion of extracting meaning from biological image, and to the practical approaches that have been used to apply these ideas to a wide range of applications in cell and tissue optical imaging.  相似文献   

11.
12.
There is a growing realization that multi-way chromatin contacts formed in chromosome structures are fundamental units of gene regulation. However, due to the paucity and complexity of such contacts, it is challenging to detect and identify them using experiments. Based on an assumption that chromosome structures can be mapped onto a network of Gaussian polymer, here we derive analytic expressions for n-body contact probabilities (n > 2) among chromatin loci based on pairwise genomic contact frequencies available in Hi-C, and show that multi-way contact probability maps can in principle be extracted from Hi-C. The three-body (triplet) contact probabilities, calculated from our theory, are in good correlation with those from measurements including Tri-C, MC-4C and SPRITE. Maps of multi-way chromatin contacts calculated from our analytic expressions can not only complement experimental measurements, but also can offer better understanding of the related issues, such as cell-line dependent assemblies of multiple genes and enhancers to chromatin hubs, competition between long-range and short-range multi-way contacts, and condensates of multiple CTCF anchors.  相似文献   

13.
The production of high-throughput gene expression data has generated a crucial need for bioinformatics tools to generate biologically interesting hypotheses. Whereas many tools are available for extracting global patterns, less attention has been focused on local pattern discovery. We propose here an original way to discover knowledge from gene expression data by means of the so-called formal concepts which hold in derived Boolean gene expression datasets. We first encoded the over-expression properties of genes in human cells using human SAGE data. It has given rise to a Boolean matrix from which we extracted the complete collection of formal concepts, i.e., all the largest sets of over-expressed genes associated to a largest set of biological situations in which their over-expression is observed. Complete collections of such patterns tend to be huge. Since their interpretation is a time-consuming task, we propose a new method to rapidly visualize clusters of formal concepts. This designates a reasonable number of Quasi-Synexpression-Groups (QSGs) for further analysis. The interest of our approach is illustrated using human SAGE data and interpreting one of the extracted QSGs. The assessment of its biological relevancy leads to the formulation of both previously proposed and new biological hypotheses.  相似文献   

14.
Extracting binary signals from microarray time-course data   总被引:1,自引:0,他引:1  
This article presents a new method for analyzing microarray time courses by identifying genes that undergo abrupt transitions in expression level, and the time at which the transitions occur. The algorithm matches the sequence of expression levels for each gene against temporal patterns having one or two transitions between two expression levels. The algorithm reports a P-value for the matching pattern of each gene, and a global false discovery rate can also be computed. After matching, genes can be sorted by the direction and time of transitions. Genes can be partitioned into sets based on the direction and time of change for further analysis, such as comparison with Gene Ontology annotations or binding site motifs. The method is evaluated on simulated and actual time-course data. On microarray data for budding yeast, it is shown that the groups of genes that change in similar ways and at similar times have significant and relevant Gene Ontology annotations.  相似文献   

15.

Background  

A common observation in the analysis of gene expression data is that many genes display similarity in their expression patterns and therefore appear to be co-regulated. However, the variation associated with microarray data and the complexity of the experimental designs make the acquisition of co-expressed genes a challenge. We developed a novel method for Extracting microarray gene expression Patterns and Identifying co-expressed Genes, designated as EPIG. The approach utilizes the underlying structure of gene expression data to extract patterns and identify co-expressed genes that are responsive to experimental conditions.  相似文献   

16.

Background  

Much of the public access cancer microarray data is asymmetric, belonging to datasets containing no samples from normal tissue. Asymmetric data cannot be used in standard meta-analysis approaches (such as the inverse variance method) to obtain large sample sizes for statistical power enrichment. Noting that plenty of normal tissue microarray samples exist in studies not involving cancer, we investigated the viability and accuracy of an integrated microarray analysis approach based on significance analysis of microarrays (merged SAM) using a collection of data from separate diseased and normal samples.  相似文献   

17.
Spruill SE  Lu J  Hardy S  Weir B 《BioTechniques》2002,33(4):916-20, 922-3
Experiments using microarrays abound in genomic research, yet one factor remains in question. Without replication, how much stock can we put into the findings of microarray experiments? In addition, there is a growing desire to integrate microarray data with other molecular databases. To accomplish this in a scientifically acceptable manner, we must be able to measure the validity and quality of microarray data. Otherwise, it would be the weakest link in any integration process. Validating and evaluating the quality of data requires the ability to determine the reproducibility of results. Data obtained from a microarray experiment designed as a feasibility test provided a unique opportunity to partition and quantify several sources of variation that are likely to be present in most microarray experiments. We use this opportunity to discuss the origins of variability observed in microarray experiments and provide some suggestions for how to minimize or avoid them when designing an experiment.  相似文献   

18.
Chen Y  Wang W  Zhou Y  Shields R  Chanda SK  Elston RC  Li J 《PloS one》2011,6(6):e21137
Identifying disease genes is crucial to the understanding of disease pathogenesis, and to the improvement of disease diagnosis and treatment. In recent years, many researchers have proposed approaches to prioritize candidate genes by considering the relationship of candidate genes and existing known disease genes, reflected in other data sources. In this paper, we propose an expandable framework for gene prioritization that can integrate multiple heterogeneous data sources by taking advantage of a unified graphic representation. Gene-gene relationships and gene-disease relationships are then defined based on the overall topology of each network using a diffusion kernel measure. These relationship measures are in turn normalized to derive an overall measure across all networks, which is utilized to rank all candidate genes. Based on the informativeness of available data sources with respect to each specific disease, we also propose an adaptive threshold score to select a small subset of candidate genes for further validation studies. We performed large scale cross-validation analysis on 110 disease families using three data sources. Results have shown that our approach consistently outperforms other two state of the art programs. A case study using Parkinson disease (PD) has identified four candidate genes (UBB, SEPT5, GPR37 and TH) that ranked higher than our adaptive threshold, all of which are involved in the PD pathway. In particular, a very recent study has observed a deletion of TH in a patient with PD, which supports the importance of the TH gene in PD pathogenesis. A web tool has been implemented to assist scientists in their genetic studies.  相似文献   

19.
Current analyses of co-expressed genes are often based on global approaches such as clustering or bi-clustering. An alternative way is to employ local methods and search for patterns--sets of genes displaying specific expression properties in a set of situations. The main bottleneck of this type of analysis is twofold--computational costs and an overwhelming number of candidate patterns which can hardly be further exploited. A timely application of background knowledge available in literature databases, biological ontologies and other sources can help to focus on the most plausible patterns only. The paper proposes, implements and tests a flexible constraint-based framework that enables the effective mining and representation of meaningful over-expression patterns representing intrinsic associations among genes and biological situations. The framework can be simultaneously applied to a wide spectrum of genomic data and we demonstrate that it allows to generate new biological hypotheses with clinical implications.  相似文献   

20.
Field monitoring can vary from simple volunteer opportunistic observations to professional standardised monitoring surveys, leading to a trade-off between data quality and data collection costs. Such variability in data quality may result in biased predictions obtained from species distribution models (SDMs). We aimed to identify the limitations of different monitoring data sources for developing species distribution maps and to evaluate their potential for spatial data integration in a conservation context. Using Maxent, SDMs were generated from three different bird data sources in Catalonia, which differ in the degree of standardisation and available sample size. In addition, an alternative approach for modelling species distributions was applied, which combined the three data sources at a large spatial scale, but then downscaling to the required resolution. Finally, SDM predictions were used to identify species richness and high quality areas (hotspots) from different treatments. Models were evaluated by using high quality Atlas information. We show that both sample size and survey methodology used to collect the data are important in delivering robust information on species distributions. Models based on standardized monitoring provided higher accuracy with a lower sample size, especially when modelling common species. Accuracy of models from opportunistic observations substantially increased when modelling uncommon species, giving similar accuracy to a more standardized survey. Although downscaling data through a SDM approach appears to be a useful tool in cases of data shortage or low data quality and heterogeneity, it will tend to overestimate species distributions. In order to identify distributions of species, data with different quality may be appropriate. However, to identify biodiversity hotspots high quality information is needed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号