首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
2.
3.

Background

The quality of gene expression data can vary dramatically from platform to platform, study to study, and sample to sample. As reliable statistical analysis rests on reliable data, determining such quality is of the utmost importance. Quality measures to spot problematic samples exist, but they are platform-specific, and cannot be used to compare studies.

Results

As a proxy for quality, we propose a signal-to-noise ratio for microarray data, the “Signal-to-Noise Applied to Gene Expression Experiments”, or SNAGEE. SNAGEE is based on the consistency of gene-gene correlations. We applied SNAGEE to a compendium of 80 large datasets on 37 platforms, for a total of 24,380 samples, and assessed the signal-to-noise ratio of studies and samples. This allowed us to discover serious issues with three studies. We show that signal-to-noise ratios of both studies and samples are linked to the statistical significance of the biological results.

Conclusions

We showed that SNAGEE is an effective way to measure data quality for most types of gene expression studies, and that it often outperforms existing techniques. Furthermore, SNAGEE is platform-independent and does not require raw data files. The SNAGEE R package is available in BioConductor.  相似文献   

4.
5.
6.
MOTIVATION: The increasing availability of gene expression microarray technology has resulted in the publication of thousands of microarray gene expression datasets investigating various biological conditions. This vast repository is still underutilized due to the lack of methods for fast, accurate exploration of the entire compendium. RESULTS: We have collected Saccharomyces cerevisiae gene expression microarray data containing roughly 2400 experimental conditions. We analyzed the functional coverage of this collection and we designed a context-sensitive search algorithm for rapid exploration of the compendium. A researcher using our system provides a small set of query genes to establish a biological search context; based on this query, we weight each dataset's relevance to the context, and within these weighted datasets we identify additional genes that are co-expressed with the query set. Our method exhibits an average increase in accuracy of 273% compared to previous mega-clustering approaches when recapitulating known biology. Further, we find that our search paradigm identifies novel biological predictions that can be verified through further experimentation. Our methodology provides the ability for biological researchers to explore the totality of existing microarray data in a manner useful for drawing conclusions and formulating hypotheses, which we believe is invaluable for the research community. AVAILABILITY: Our query-driven search engine, called SPELL, is available at http://function.princeton.edu/SPELL. SUPPLEMENTARY INFORMATION: Several additional data files, figures and discussions are available at http://function.princeton.edu/SPELL/supplement.  相似文献   

7.
Genotype imputation methods are now being widely used in the analysis of genome-wide association studies. Most imputation analyses to date have used the HapMap as a reference dataset, but new reference panels (such as controls genotyped on multiple SNP chips and densely typed samples from the 1,000 Genomes Project) will soon allow a broader range of SNPs to be imputed with higher accuracy, thereby increasing power. We describe a genotype imputation method (IMPUTE version 2) that is designed to address the challenges presented by these new datasets. The main innovation of our approach is a flexible modelling framework that increases accuracy and combines information across multiple reference panels while remaining computationally feasible. We find that IMPUTE v2 attains higher accuracy than other methods when the HapMap provides the sole reference panel, but that the size of the panel constrains the improvements that can be made. We also find that imputation accuracy can be greatly enhanced by expanding the reference panel to contain thousands of chromosomes and that IMPUTE v2 outperforms other methods in this setting at both rare and common SNPs, with overall error rates that are 15%–20% lower than those of the closest competing method. One particularly challenging aspect of next-generation association studies is to integrate information across multiple reference panels genotyped on different sets of SNPs; we show that our approach to this problem has practical advantages over other suggested solutions.  相似文献   

8.
9.
10.
11.
MOTIVATION: Protein-protein interactions have proved to be a valuable starting point for understanding the inner workings of the cell. Computational methodologies have been built which both predict interactions and use interaction datasets in order to predict other protein features. Such methods require gold standard positive (GSP) and negative (GSN) interaction sets. Here we examine and demonstrate the usefulness of homologous interactions in predicting good quality positive and negative interaction datasets. RESULTS: We generate GSP interaction sets as subsets from experimental data using only interaction and sequence information. We can therefore produce sets for several species (many of which at present have no identified GSPs). Comprehensive error rate testing demonstrates the power of the method. We also show how the use of our datasets significantly improves the predictive power of algorithms for interaction prediction and function prediction. Furthermore, we generate GSN interaction sets for yeast and examine the use of homology along with other protein properties such as localization, expression and function. Using a novel method to assess the accuracy of a negative interaction set, we find that the best single selector for negative interactions is a lack of co-function. However, an integrated method using all the characteristics shows significant improvement over any current method for identifying GSN interactions. The nature of homologous interactions is also examined and we demonstrate that interologs are found more commonly within species than across species. CONCLUSION: GSP sets built using our homologous verification method are demonstrably better than standard sets in terms of predictive ability. We can build such GSP sets for several species. When generating GSNs we show a combination of protein features and lack of homologous interactions gives the highest quality interaction sets. AVAILABILITY: GSP and GSN datasets for all the studied species can be downloaded from http://www.stats.ox.ac.uk/~deane/HPIV.  相似文献   

12.
13.
14.
15.
The increasing availability of time series expression datasets, although promising, raises a number of new computational challenges. Accordingly, the development of suitable classification methods to make reliable and sound predictions is becoming a pressing issue. We propose, here, a new method to classify time series gene expression via integration of biological networks. We evaluated our approach on 2 different datasets and showed that the use of a hidden Markov model/Gaussian mixture models hybrid explores the time-dependence of the expression data, thereby leading to better prediction results. We demonstrated that the biclustering procedure identifies function-related genes as a whole, giving rise to high accordance in prognosis prediction across independent time series datasets. In addition, we showed that integration of biological networks into our method significantly improves prediction performance. Moreover, we compared our approach with several state-of–the-art algorithms and found that our method outperformed previous approaches with regard to various criteria. Finally, our approach achieved better prediction results on early-stage data, implying the potential of our method for practical prediction.  相似文献   

16.
Complex multi-dimensional datasets are now pervasive in science and elsewhere in society. Better interactive tools are needed for visual data exploration so that patterns in such data may be easily discovered, data can be proofread, and subsets of data can be chosen for algorithmic analysis. In particular, synthetic research such as ecological interaction research demands effective ways to examine multiple datasets. This paper describes our integration of hundreds of food-web datasets into a common platform, and the visualization software, EcoLens, we developed for exploring this information. This publicly-available application and integrated dataset have been useful for our research predicting large complex food webs, and EcoLens is favorably reviewed by other researchers. Many habitats are not well represented in our large database. We confirm earlier results about the small size and lack of taxonomic resolution in early food webs but find that they and a non-food-web source provide trophic information about a large number of taxa absent from more modern studies. Corroboration of Tuesday Lake trophic links across studies is usually possible, but lack of links among congeners may have several explanations. While EcoLens does not provide all kinds of analytical support, its label- and item-based approach is effective at addressing concerns about the comparability and taxonomic resolution of food-web data.  相似文献   

17.
Population genomics is a useful tool to support integrated pest management as it can elucidate population dynamics, demography, and histories of invasion. Here, we use a restriction site‐associated DNA sequencing approach combined with whole‐genome amplification (WGA) to assess genomic population structure of a newly described pest of canola, the diminutive canola flower midge, Contarinia brassicola. Clustering analyses recovered little geographic structure across the main canola production region but differentiated several geographically disparate populations at edges of the agricultural zone. Given a lack of alternative hypotheses for this pattern, we suggest these data support alternative hosts for this species and thus our canola‐centric view of this midge as a pest has limited our understanding of its biology. These results speak to the need for increased surveying efforts across multiple habitats and other potential hosts within Brassicaceae to improve both our ecological and evolutionary knowledge of this species and contribute to effective management strategies. We additionally found that use of WGA prior to library preparation was an effective method for increasing DNA quantity of these small insects prior to restriction site‐associated DNA sequencing and had no discernible impact on genotyping consistency for population genetic analysis; WGA is therefore likely to be tractable for other similar studies that seek to randomly sample markers across the genome in small organisms.  相似文献   

18.
19.
20.
The concept of “housekeeping gene” has been used for four decades but remains loosely defined. Housekeeping genes are commonly described as “essential for cellular existence regardless of their specific function in the tissue or organism”, and “stably expressed irrespective of tissue type, developmental stage, cell cycle state, or external signal”. However, experimental support for the tenet that gene essentiality is linked to stable expression across cell types, conditions, and organisms has been limited. Here we use genome-scale functional genomic screens together with bulk and single-cell sequencing technologies to test this link and optimize a quantitative and experimentally validated definition of housekeeping gene. Using the optimized definition, we identify, characterize, and provide as resources, housekeeping gene lists extracted from several human datasets, and 10 other animal species that include primates, chicken, and C. elegans. We find that stably expressed genes are not necessarily essential, and that the individual genes that are essential and stably expressed can considerably differ across organisms; yet the pathways enriched among these genes are conserved. Further, the level of conservation of housekeeping genes across the analyzed organisms captures their taxonomic groups, showing evolutionary relevance for our definition. Therefore, we present a quantitative and experimentally supported definition of housekeeping genes that can contribute to better understanding of their unique biological and evolutionary characteristics.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号