首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
lumi: a pipeline for processing Illumina microarray   总被引:2,自引:0,他引:2  
Illumina microarray is becoming a popular microarray platform. The BeadArray technology from Illumina makes its preprocessing and quality control different from other microarray technologies. Unfortunately, most other analyses have not taken advantage of the unique properties of the BeadArray system, and have just incorporated preprocessing methods originally designed for Affymetrix microarrays. lumi is a Bioconductor package especially designed to process the Illumina microarray data. It includes data input, quality control, variance stabilization, normalization and gene annotation portions. In specific, the lumi package includes a variance-stabilizing transformation (VST) algorithm that takes advantage of the technical replicates available on every Illumina microarray. Different normalization method options and multiple quality control plots are provided in the package. To better annotate the Illumina data, a vendor independent nucleotide universal identifier (nuID) was devised to identify the probes of Illumina microarray. The nuID annotation packages and output of lumi processed results can be easily integrated with other Bioconductor packages to construct a statistical data analysis pipeline for Illumina data. Availability: The lumi Bioconductor package, www.bioconductor.org  相似文献   

2.
MOTIVATION: Authors of several recent papers have independently introduced a family of transformations (the generalized-log family), which stabilizes the variance of microarray data up to the first order. However, for data from two-color arrays, tests for differential expression may require that the variance of the difference of transformed observations be constant, rather than that of the transformed observations themselves. RESULTS: We introduce a transformation within the generalized-log family which stabilizes, to the first order, the variance of the difference of transformed observations. We also introduce transformations from the 'started-log' and log-linear-hybrid families which provide good approximate variance stabilization of differences. Examples using control-control data show that any of these transformations may provide sufficient variance stabilization for practical applications, and all perform well compared to log ratios.  相似文献   

3.
Conventional statistical methods for interpreting microarray data require large numbers of replicates in order to provide sufficient levels of sensitivity. We recently described a method for identifying differentially-expressed genes in one-channel microarray data 1. Based on the idea that the variance structure of microarray data can itself be a reliable measure of noise, this method allows statistically sound interpretation of as few as two replicates per treatment condition. Unlike the one-channel array, the two-channel platform simultaneously compares gene expression in two RNA samples. This leads to covariation of the measured signals. Hence, by accounting for covariation in the variance model, we can significantly increase the power of the statistical test. We believe that this approach has the potential to overcome limitations of existing methods. We present here a novel approach for the analysis of microarray data that involves modeling the variance structure of paired expression data in the context of a Bayesian framework. We also describe a novel statistical test that can be used to identify differentially-expressed genes. This method, bivariate microarray analysis (BMA), demonstrates dramatically improved sensitivity over existing approaches. We show that with only two array replicates, it is possible to detect gene expression changes that are at best detected with six array replicates by other methods. Further, we show that combining results from BMA with Gene Ontology annotation yields biologically significant results in a ligand-treated macrophage cell system.  相似文献   

4.
MOTIVATION: Many standard statistical techniques are effective on data that are normally distributed with constant variance. Microarray data typically violate these assumptions since they come from non-Gaussian distributions with a non-trivial mean-variance relationship. Several methods have been proposed that transform microarray data to stabilize variance and draw its distribution towards the Gaussian. Some methods, such as log or generalized log, rely on an underlying model for the data. Others, such as the spread-versus-level plot, do not. We propose an alternative data-driven multiscale approach, called the Data-Driven Haar-Fisz for microarrays (DDHFm) with replicates. DDHFm has the advantage of being 'distribution-free' in the sense that no parametric model for the underlying microarray data is required to be specified or estimated; hence, DDHFm can be applied very generally, not just to microarray data. RESULTS: DDHFm achieves very good variance stabilization of microarray data with replicates and produces transformed intensities that are approximately normally distributed. Simulation studies show that it performs better than other existing methods. Application of DDHFm to real one-color cDNA data validates these results. AVAILABILITY: The R package of the Data-Driven Haar-Fisz transform (DDHFm) for microarrays is available in Bioconductor and CRAN.  相似文献   

5.
Analysis of variance components in gene expression data   总被引:5,自引:0,他引:5  
MOTIVATION: A microarray experiment is a multi-step process, and each step is a potential source of variation. There are two major sources of variation: biological variation and technical variation. This study presents a variance-components approach to investigating animal-to-animal, between-array, within-array and day-to-day variations for two data sets. The first data set involved estimation of technical variances for pooled control and pooled treated RNA samples. The variance components included between-array, and two nested within-array variances: between-section (the upper- and lower-sections of the array are replicates) and within-section (two adjacent spots of the same gene are printed within each section). The second experiment was conducted on four different weeks. Each week there were reference and test samples with a dye-flip replicate in two hybridization days. The variance components included week-to-week, animal-to-animal and between-array and within-array variances. RESULTS: We applied the linear mixed-effects model to quantify different sources of variation. In the first data set, we found that the between-array variance is greater than the between-section variance, which, in turn, is greater than the within-section variance. In the second data set, for the reference samples, the week-to-week variance is larger than the between-array variance, which, in turn, is slightly larger than the within-array variance. For the test samples, the week-to-week variance has the largest variation. The animal-to-animal variance is slightly larger than the between-array and within-array variances. However, in a gene-by-gene analysis, the animal-to-animal variance is smaller than the between-array variance in four out of five housekeeping genes. In summary, the largest variation observed is the week-to-week effect. Another important source of variability is the animal-to-animal variation. Finally, we describe the use of variance-component estimates to determine optimal numbers of animals, arrays per animal and sections per array in planning microarray experiments.  相似文献   

6.
MOTIVATION: Standard statistical techniques often assume that data are normally distributed, with constant variance not depending on the mean of the data. Data that violate these assumptions can often be brought in line with the assumptions by application of a transformation. Gene-expression microarray data have a complicated error structure, with a variance that changes with the mean in a non-linear fashion. Log transformations, which are often applied to microarray data, can inflate the variance of observations near background. RESULTS: We introduce a transformation that stabilizes the variance of microarray data across the full range of expression. Simulation studies also suggest that this transformation approximately symmetrizes microarray data.  相似文献   

7.
The Illumina Infinium HumanMethylation27 BeadChip (Illumina 27k) microarray is a high-throughput platform capable of interrogating the human DNA methylome. In a search for autosomal sex-specific DNA methylation using this microarray, we discovered autosomal CpG loci showing significant methylation differences between the sexes. However, we found that the majority of these probes cross-reacted with sequences from sex chromosomes. Moreover, we determined that 6-10% of the microarray probes are non-specific and map to highly homologous genomic sequences. Using probes targeting different CpGs that are exact duplicates of each other, we investigated the precision of these repeat measurements and concluded that the overall precision of this microarray is excellent. In addition, we identified a small number of probes targeting CpGs that include single-nucleotide polymorphisms. Overall, our findings address several technical issues associated with the Illumina 27k microarray that, once considered, will enhance the analysis and interpretation of data generated from this platform.  相似文献   

8.
We introduce a novel experimental methodology for the reverse‐phase protein microarray platform which reduces the typical measurement CV as much as 70%. The methodology, referred to as array microenvironment normalization, increases the statistical power of the platform. In the experiment, it enabled the detection of a 1.1‐fold shift in prostate specific antigen concentration using approximately six technical replicates rather than the 37 replicates previously required. The improved reproducibility and statistical power should facilitate clinical implementation of the platform.  相似文献   

9.
Transformation and normalization of oligonucleotide microarray data   总被引:3,自引:0,他引:3  
MOTIVATION: Most methods of analyzing microarray data or doing power calculations have an underlying assumption of constant variance across all levels of gene expression. The most common transformation, the logarithm, results in data that have constant variance at high levels but not at low levels. Rocke and Durbin showed that data from spotted arrays fit a two-component model and Durbin, Hardin, Hawkins, and Rocke, Huber et al. and Munson provided a transformation that stabilizes the variance as well as symmetrizes and normalizes the error structure. We wish to evaluate the applicability of this transformation to the error structure of GeneChip microarrays. RESULTS: We demonstrate in an example study a simple way to use the two-component model of Rocke and Durbin and the data transformation of Durbin, Hardin, Hawkins and Rocke, Huber et al. and Munson on Affymetrix GeneChip data. In addition we provide a method for normalization of Affymetrix GeneChips simultaneous with the determination of the transformation, producing a data set without chip or slide effects but with constant variance and with symmetric errors. This transformation/normalization process can be thought of as a machine calibration in that it requires a few biologically constant replicates of one sample to determine the constant needed to specify the transformation and normalize. It is hypothesized that this constant needs to be found only once for a given technology in a lab, perhaps with periodic updates. It does not require extensive replication in each study. Furthermore, the variance of the transformed pilot data can be used to do power calculations using standard power analysis programs. AVAILABILITY: SPLUS code for the transformation/normalization for four replicates is available from the first author upon request. A program written in C is available from the last author.  相似文献   

10.
INTRODUCTION: Microarray experiments often have complex designs that include sample pooling, biological and technical replication, sample pairing and dye-swapping. This article demonstrates how statistical modelling can illuminate issues in the design and analysis of microarray experiments, and this information can then be used to plan effective studies. METHODS: A very detailed statistical model for microarray data is introduced, to show the possible sources of variation that are present in even the simplest microarray experiments. Based on this model, the efficacy of common experimental designs, normalisation methodologies and analyses is determined. RESULTS: When the cost of the arrays is high compared with the cost of samples, sample pooling and spot replication are shown to be efficient variance reduction methods, whereas technical replication of whole arrays is demonstrated to be very inefficient. Dye-swap designs can use biological replicates rather than technical replicates to improve efficiency and simplify analysis. When the cost of samples is high and technical variation is a major portion of the error, technical replication can be cost effective. Normalisation by centreing on a small number of spots may reduce array effects, but can introduce considerable variation in the results. Centreing using the bulk of spots on the array is less variable. Similarly, normalisation methods based on regression methods can introduce variability. Except for normalisation methods based on spiking controls, all normalisation requires that most genes do not differentially express. Methods based on spatial location and/or intensity also require that the nondifferentially expressing genes are at random with respect to location and intensity. Spotting designs should be carefully done so that spot replicates are widely spaced on the array, and genes with similar expression patterns are not clustered. DISCUSSION: The tools for statistical design of experiments can be applied to microarray experiments to improve both efficiency and validity of the studies. Given the high cost of microarray experiments, the benefits of statistical input prior to running the experiment cannot be over-emphasised.  相似文献   

11.
MOTIVATION: Microarray data are susceptible to a wide-range of artifacts, many of which occur on physical scales comparable to the spatial dimensions of the array. These artifacts introduce biases that are spatially correlated. The ability of current methodologies to detect and correct such biases is limited. RESULTS: We introduce a new approach for analyzing spatial artifacts, termed 'conditional residual analysis for microarrays' (CRAM). CRAM requires a microarray design that contains technical replicates of representative features and a limited number of negative controls, but is free of the assumptions that constrain existing analytical procedures. The key idea is to extract residuals from sets of matched replicates to generate residual images. The residual images reveal spatial artifacts with single-feature resolution. Surprisingly, spatial artifacts were found to coexist independently as additive and multiplicative errors. Efficient procedures for bias estimation were devised to correct the spatial artifacts on both intensity scales. In a survey of 484 published single-channel datasets, variance fell 4- to 12-fold in 5% of the datasets after bias correction. Thus, inclusion of technical replicates in a microarray design affords benefits far beyond what one might expect with a conventional 'n = 5' averaging, and should be considered when designing any microarray for which randomization is feasible. AVAILABILITY: CRAM is implemented as version 2 of the hoptag software package for R, which is included in the Supplementary information.  相似文献   

12.
Quality control of a microarray experiment has become an important issue for both research and regulation. External RNA controls (ERCs), which can be either added to the total RNA level (tERCs) or introduced right before hybridization (cERCs), are designed and recommended by commercial microarray platforms for assessment of performance of a microarray experiment. However, the utility of ERCs has not been fully realized mainly due to the lack of sufficient data resources. The US Food and Drug Administration (FDA)-led community-wide Microarray Quality Control (MAQC) study generates a large amount of microarray data with implementation of ERCs across several commercial microarray platforms. The utility of ERCs in quality control by assessing the ERCs’ concentration-response behavior was investigated in the MAQC study. In this work, an ERC-based correlation analysis was conducted to assess the quality of a microarray experiment. We found that the pairwise correlations of tERCs are sample independent, indicating that the array data obtained from different biological samples can be treated as technical replicates in analysis of tERCs. Consequently, the commonly used quality control method of applying correlation analysis on technical replicates can be adopted for assessing array performance based on different biological samples using tERCs. The proposed approach is sensitive to identifying outlying assays and is not dependent on the choice of normalization method.  相似文献   

13.
The proper identification of differentially methylated CpGs is central in most epigenetic studies. The Illumina HumanMethylation450 BeadChip is widely used to quantify DNA methylation; nevertheless, the design of an appropriate analysis pipeline faces severe challenges due to the convolution of biological and technical variability and the presence of a signal bias between Infinium I and II probe design types. Despite recent attempts to investigate how to analyze DNA methylation data with such an array design, it has not been possible to perform a comprehensive comparison between different bioinformatics pipelines due to the lack of appropriate data sets having both large sample size and sufficient number of technical replicates. Here we perform such a comparative analysis, targeting the problems of reducing the technical variability, eliminating the probe design bias and reducing the batch effect by exploiting two unpublished data sets, which included technical replicates and were profiled for DNA methylation either on peripheral blood, monocytes or muscle biopsies. We evaluated the performance of different analysis pipelines and demonstrated that: (1) it is critical to correct for the probe design type, since the amplitude of the measured methylation change depends on the underlying chemistry; (2) the effect of different normalization schemes is mixed, and the most effective method in our hands were quantile normalization and Beta Mixture Quantile dilation (BMIQ); (3) it is beneficial to correct for batch effects. In conclusion, our comparative analysis using a comprehensive data set suggests an efficient pipeline for proper identification of differentially methylated CpGs using the Illumina 450K arrays.  相似文献   

14.
We propose an extension to quantile normalization that removes unwanted technical variation using control probes. We adapt our algorithm, functional normalization, to the Illumina 450k methylation array and address the open problem of normalizing methylation data with global epigenetic changes, such as human cancers. Using data sets from The Cancer Genome Atlas and a large case–control study, we show that our algorithm outperforms all existing normalization methods with respect to replication of results between experiments, and yields robust results even in the presence of batch effects. Functional normalization can be applied to any microarray platform, provided suitable control probes are available.

Electronic supplementary material

The online version of this article (doi:10.1186/s13059-014-0503-2) contains supplementary material, which is available to authorized users.  相似文献   

15.
Optimal experimental design is important for the efficient use of modern highthroughput technologies such as microarrays and proteomics. Multiple factors including the reliability of measurement system, which itself must be estimated from prior experimental work, could influence design decisions. In this study, we describe how the optimal number of replicate measures (technical replicates) for each biological sample (biological replicate) can be determined. Different allocations of biological and technical replicates were evaluated by minimizing the variance of the ratio of technical variance (measurement error) to the total variance (sum of sampling error and measurement error). We demonstrate that if the number of biological replicates and the number of technical replicates per biological sample are variable, while the total number of available measures is fixed, then the optimal allocation of replicates for measurement evaluation experiments requires two technical replicates for each biological replicate. Therefore, it is recommended to use two technical replicates for each biological replicate if the goal is to evaluate the reproducibility of measurements.  相似文献   

16.
MOTIVATION: Due to advances in experimental technologies, such as microarray, mass spectrometry and nuclear magnetic resonance, it is feasible to obtain large-scale data sets, in which measurements for a large number of features can be simultaneously collected. However, the sample sizes of these data sets are usually small due to their relatively high costs, which leads to the issue of concordance among different data sets collected for the same study: features should have consistent behavior in different data sets. There is a lack of rigorous statistical methods for evaluating this concordance or discordance. METHODS: Based on a three-component normal-mixture model, we propose two likelihood ratio tests for evaluating the concordance and discordance between two large-scale data sets with two sample groups. The parameter estimation is achieved through the expectation-maximization (E-M) algorithm. A normal-distribution-quantile-based method is used for data transformation. RESULTS: To evaluate the proposed tests, we conducted some simulation studies, which suggested their satisfactory performances. As applications, the proposed tests were applied to three SELDI-MS data sets with replicates. One data set has replicates from different platforms and the other two have replicates from the same platform. We found that data generated by SELDI-MS showed satisfactory concordance between replicates from the same platform but unsatisfactory concordance between replicates from different platforms. AVAILABILITY: The R codes are freely available at http://home.gwu.edu/~ylai/research/Concordance.  相似文献   

17.
SUMMARY: With their many replicates and their random layouts, Illumina BeadArrays provide greater scope fordetecting spatial artefacts than do other microarray technologies. They are also robust to artefact exclusion, yet there is a lack of tools that can perform these tasks for Illumina. We present BASH, a tool for this purpose. BASH adopts the concepts of Harshlight, but implements them in a manner that utilizes the unique characteristics of the Illumina technology. Using bead-level data, spatial artefacts of various kinds can thus be identified and excluded from further analyses. AVAILABILITY: The beadarray Bioconductor package (version 1.10 onwards), www.bioconductor.org  相似文献   

18.
Rosetta error model for gene expression analysis   总被引:4,自引:0,他引:4  
MOTIVATION: In microarray gene expression studies, the number of replicated microarrays is usually small because of cost and sample availability, resulting in unreliable variance estimation and thus unreliable statistical hypothesis tests. The unreliable variance estimation is further complicated by the fact that the technology-specific variance is intrinsically intensity-dependent. RESULTS: The Rosetta error model captures the variance-intensity relationship for various types of microarray technologies, such as single-color arrays and two-color arrays. This error model conservatively estimates intensity error and uses this value to stabilize the variance estimation. We present two commonly used error models: the intensity error-model for single-color microarrays and the ratio error model for two-color microarrays or ratios built from two single-color arrays. We present examples to demonstrate the strength of our error models in improving statistical power of microarray data analysis, particularly, in increasing expression detection sensitivity and specificity when the number of replicates is limited.  相似文献   

19.
Accurately identifying differentially expressed genes from microarray data is not a trivial task, partly because of poor variance estimates of gene expression signals. Here, after analyzing 380 replicated microarray experiments, we found that probesets have typical, distinct variances that can be estimated based on a large number of microarray experiments. These probeset-specific variances depend at least in part on the function of the probed gene: genes for ribosomal or structural proteins often have a small variance, while genes implicated in stress responses often have large variances. We used these variance estimates to develop a statistical test for differentially expressed genes called EVE (external variance estimation). The EVE algorithm performs better than the t-test and LIMMA on some real-world data, where external information from appropriate databases is available. Thus, EVE helps to maximize the information gained from a typical microarray experiment. Nonetheless, only a large number of replicates will guarantee to identify nearly all truly differentially expressed genes. However, our simulation studies suggest that even limited numbers of replicates will usually result in good coverage of strongly differentially expressed genes.  相似文献   

20.
ABSTRACT: BACKGROUND: In the field of mouse genetics the advent of technologies like microarray based expression profiling dramatically increased data availability and sensitivity, yet these advanced methods are often vulnerable to the unavoidable heterogeneity of in vivo material and might therefore reflect differentially expressed genes between mouse strains of no relevance to a targeted experiment. The aim of this study was not to elaborate on the usefulness of microarray analysis in general, but to expand our knowledge regarding this potential "background noise" for the widely used Illumina microarray platform surpassing existing data which focused primarily on the adult sensory and nervous system, by analyzing patterns of gene expression at different embryonic stages using wild type strains and modern transgenic models of often non-isogenic backgrounds. RESULTS: Wild type embryos of 11 mouse strains commonly used in transgenic and molecular genetic studies at three developmental time points were subjected to Illumina microarray expression profiling in a strain-by-strain comparison. Our data robustly reflects known gene expression patterns during mid-gestation development. Decreasing diversity of the input tissue and/or increasing strain diversity raised the sensitivity of the array towards the genetic background. Consistent strain sensitivity of some probes was attributed to genetic polymorphisms or probe design related artifacts. CONCLUSION: Our study provides an extensive reference list of gene expression profiling background noise of value to anyone in the field of developmental biology and transgenic research performing microarray expression profiling with the widely used Illumina microarray platform. Probes identified as strain specific background noise further allow for microarray expression profiling on its own to be a valuable tool for establishing genealogies of mouse inbred strains.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号