首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 98 毫秒
1.
2.
3.
Advancements in mass spectrometry‐based proteomics have enabled experiments encompassing hundreds of samples. While these large sample sets deliver much‐needed statistical power, handling them introduces technical variability known as batch effects. Here, we present a step‐by‐step protocol for the assessment, normalization, and batch correction of proteomic data. We review established methodologies from related fields and describe solutions specific to proteomic challenges, such as ion intensity drift and missing values in quantitative feature matrices. Finally, we compile a set of techniques that enable control of batch effect adjustment quality. We provide an R package, "proBatch", containing functions required for each step of the protocol. We demonstrate the utility of this methodology on five proteomic datasets each encompassing hundreds of samples and consisting of multiple experimental designs. In conclusion, we provide guidelines and tools to make the extraction of true biological signal from large proteomic studies more robust and transparent, ultimately facilitating reliable and reproducible research in clinical proteomics and systems biology.  相似文献   

4.
5.
Genome-wide analysis of gene expression or protein binding patterns using different array or sequencing based technologies is now routinely performed to compare different populations, such as treatment and reference groups. It is often necessary to normalize the data obtained to remove technical variation introduced in the course of conducting experimental work, but standard normalization techniques are not capable of eliminating technical bias in cases where the distribution of the truly altered variables is skewed, i.e. when a large fraction of the variables are either positively or negatively affected by the treatment. However, several experiments are likely to generate such skewed distributions, including ChIP-chip experiments for the study of chromatin, gene expression experiments for the study of apoptosis, and SNP-studies of copy number variation in normal and tumour tissues. A preliminary study using spike-in array data established that the capacity of an experiment to identify altered variables and generate unbiased estimates of the fold change decreases as the fraction of altered variables and the skewness increases. We propose the following work-flow for analyzing high-dimensional experiments with regions of altered variables: (1) Pre-process raw data using one of the standard normalization techniques. (2) Investigate if the distribution of the altered variables is skewed. (3) If the distribution is not believed to be skewed, no additional normalization is needed. Otherwise, re-normalize the data using a novel HMM-assisted normalization procedure. (4) Perform downstream analysis. Here, ChIP-chip data and simulated data were used to evaluate the performance of the work-flow. It was found that skewed distributions can be detected by using the novel DSE-test (Detection of Skewed Experiments). Furthermore, applying the HMM-assisted normalization to experiments where the distribution of the truly altered variables is skewed results in considerably higher sensitivity and lower bias than can be attained using standard and invariant normalization methods.  相似文献   

6.
7.

Background  

It is well known that the normalization step of microarray data makes a difference in the downstream analysis. All normalization methods rely on certain assumptions, so differences in results can be traced to different sensitivities to violation of the assumptions. Illustrating the lack of robustness, in a striking spike-in experiment all existing normalization methods fail because of an imbalance between up- and down-regulated genes. This means it is still important to develop a normalization method that is robust against violation of the standard assumptions  相似文献   

8.

Background

The normalization of DNA microarrays allows comparison among samples by adjusting for individual hybridization intensities. The approaches most commonly used are global normalization methods that are based on the expression of all genes on the slide and on the modulation of a small proportion of genes. Alternative approaches must be developed for microarrays where the proportion of modulated genes and their distribution are unknown and they may be biased towards up- or down-modulated trends.

Results

The aim of the work is to study the use of spike-in controls to normalize low-density microarrays. Our test-array was designed to analyze gene modulation in response to hypoxia (a condition of low oxygen tension) in a macrophage cell line. RNA was extracted from controls and cells exposed to hypoxia, mixed with spike RNA, labeled and hybridized to our test-array. We used eight bacterial RNAs as source of spikes. The test-array contained the oligonucleotides specific for 178 mouse genes and those specific for the eight spikes. We assessed the quality of the spike signals, the reproducibility of the results and, in general, the nature of the variability. The small values of the coefficients of variation revealed high reproducibility of our platform either in replicated spots or in technical replicates. We demonstrated that the spike-in system was suitable for normalizing our platform and determining the threshold for discriminating the hypoxia modulated genes. We assessed the application of the spike-in normalization method to microarrays in which the distribution of the expression values was symmetric or asymmetric. We found that this system is accurate, reproducible and comparable to other normalization methods when the distribution of the expression values is symmetric. In contrast, we found that the use of the spike-in normalization method is superior and necessary when the distribution of the gene expression is asymmetric and biased towards up-regulated genes.

Conclusion

We demonstrate that spike-in controls based normalization is a reliable and reproducible method that has the major advantage to be applicable also to biased platform where the distribution of the up- and down-regulated genes is asymmetric as it may occur in diagnostic chips.  相似文献   

9.
Mass spectrometry-based global proteomics experiments generate large sets of data that can be converted into useful information only with an appropriate statistical approach. We present Diffprot - a software tool for statistical analysis of MS-derived quantitative data. With implemented resampling-based statistical test and local variance estimate, Diffprot allows to draw significant results from small scale experiments and effectively eliminates false positive results. To demonstrate the advantages of this software, we performed two spike-in tests with complex biological matrices, one label-free and one based on iTRAQ quantification; in addition, we performed an iTRAQ experiment on bacterial samples. In the spike-in tests, protein ratios were estimated and were in good agreement with theoretical values; statistical significance was assigned to spiked proteins and single or no false positive results were obtained with Diffprot. We compared the performance of Diffprot with other statistical tests - widely used t-test and non-parametric Wilcoxon test. In contrast to Diffprot, both generated many false positive hits in the spike-in experiment. This proved the superiority of the resampling-based method in terms of specificity, making Diffprot a rational choice for small scale high-throughput experiments, when the need to control the false positive rate is particularly pressing.  相似文献   

10.
11.
Gene set analysis methods are popular tools for identifying differentially expressed gene sets in microarray data. Most existing methods use a permutation test to assess significance for each gene set. The permutation test's assumption of exchangeable samples is often not satisfied for time‐series data and complex experimental designs, and in addition it requires a certain number of samples to compute p‐values accurately. The method presented here uses a rotation test rather than a permutation test to assess significance. The rotation test can compute accurate p‐values also for very small sample sizes. The method can handle complex designs and is particularly suited for longitudinal microarray data where the samples may have complex correlation structures. Dependencies between genes, modeled with the use of gene networks, are incorporated in the estimation of correlations between samples. In addition, the method can test for both gene sets that are differentially expressed and gene sets that show strong time trends. We show on simulated longitudinal data that the ability to identify important gene sets may be improved by taking the correlation structure between samples into account. Applied to real data, the method identifies both gene sets with constant expression and gene sets with strong time trends.  相似文献   

12.

Background  

Many of the most popular pre-processing methods for Affymetrix expression arrays, such as RMA, gcRMA, and PLIER, simultaneously analyze data across a set of predetermined arrays to improve precision of the final measures of expression. One problem associated with these algorithms is that expression measurements for a particular sample are highly dependent on the set of samples used for normalization and results obtained by normalization with a different set may not be comparable. A related problem is that an organization producing and/or storing large amounts of data in a sequential fashion will need to either re-run the pre-processing algorithm every time an array is added or store them in batches that are pre-processed together. Furthermore, pre-processing of large numbers of arrays requires loading all the feature-level data into memory which is a difficult task even with modern computers. We utilize a scheme that produces all the information necessary for pre-processing using a very large training set that can be used for summarization of samples outside of the training set. All subsequent pre-processing tasks can be done on an individual array basis. We demonstrate the utility of this approach by defining a new version of the Robust Multi-chip Averaging (RMA) algorithm which we refer to as refRMA.  相似文献   

13.
There are many options in handling microarray data that can affect study conclusions, sometimes drastically. Working with a two-color platform, this study uses ten spike-in microarray experiments to evaluate the relative effectiveness of some of these options for the experimental goal of detecting differential expression. We consider two data transformations, background subtraction and intensity normalization, as well as six different statistics for detecting differentially expressed genes. Findings support the use of an intensity-based normalization procedure and also indicate that local background subtraction can be detrimental for effectively detecting differential expression. We also verify that robust statistics outperform t-statistics in identifying differentially expressed genes when there are few replicates. Finally, we find that choice of image analysis software can also substantially influence experimental conclusions.  相似文献   

14.
Systematic variations can occur at various steps of a cDNA microarray experiment and affect the measurement of gene expression levels. Accepted standards integrated into every cDNA microarray analysis can assess these variabilities and aid the interpretation of cDNA microarray experiments from different sources. A universally applicable approach to evaluate parameters such as input and output ratios, signal linearity, hybridization specificity and consistency across an array, as well as normalization strategies, is the utilization of exogenous control genes as spike-in and negative controls. We suggest that the use of such control sets, together with a sufficient number of experimental repeats, in-depth statistical analysis and thorough data validation should be made mandatory for the publication of cDNA microarray data.  相似文献   

15.
We propose an extension to quantile normalization that removes unwanted technical variation using control probes. We adapt our algorithm, functional normalization, to the Illumina 450k methylation array and address the open problem of normalizing methylation data with global epigenetic changes, such as human cancers. Using data sets from The Cancer Genome Atlas and a large case–control study, we show that our algorithm outperforms all existing normalization methods with respect to replication of results between experiments, and yields robust results even in the presence of batch effects. Functional normalization can be applied to any microarray platform, provided suitable control probes are available.

Electronic supplementary material

The online version of this article (doi:10.1186/s13059-014-0503-2) contains supplementary material, which is available to authorized users.  相似文献   

16.
17.
While the importance of random sequencing errors decreases at higher DNA or RNA sequencing depths, systematic sequencing errors (SSEs) dominate at high sequencing depths and can be difficult to distinguish from biological variants. These SSEs can cause base quality scores to underestimate the probability of error at certain genomic positions, resulting in false positive variant calls, particularly in mixtures such as samples with RNA editing, tumors, circulating tumor cells, bacteria, mitochondrial heteroplasmy, or pooled DNA. Most algorithms proposed for correction of SSEs require a data set used to calculate association of SSEs with various features in the reads and sequence context. This data set is typically either from a part of the data set being "recalibrated" (Genome Analysis ToolKit, or GATK) or from a separate data set with special characteristics (SysCall). Here, we combine the advantages of these approaches by adding synthetic RNA spike-in standards to human RNA, and use GATK to recalibrate base quality scores with reads mapped to the spike-in standards. Compared to conventional GATK recalibration that uses reads mapped to the genome, spike-ins improve the accuracy of Illumina base quality scores by a mean of 5 Phred-scaled quality score units, and by as much as 13 units at CpG sites. In addition, since the spike-in data used for recalibration are independent of the genome being sequenced, our method allows run-specific recalibration even for the many species without a comprehensive and accurate SNP database. We also use GATK with the spike-in standards to demonstrate that the Illumina RNA sequencing runs overestimate quality scores for AC, CC, GC, GG, and TC dinucleotides, while SOLiD has less dinucleotide SSEs but more SSEs for certain cycles. We conclude that using these DNA and RNA spike-in standards with GATK improves base quality score recalibration.  相似文献   

18.
The proper identification of differentially methylated CpGs is central in most epigenetic studies. The Illumina HumanMethylation450 BeadChip is widely used to quantify DNA methylation; nevertheless, the design of an appropriate analysis pipeline faces severe challenges due to the convolution of biological and technical variability and the presence of a signal bias between Infinium I and II probe design types. Despite recent attempts to investigate how to analyze DNA methylation data with such an array design, it has not been possible to perform a comprehensive comparison between different bioinformatics pipelines due to the lack of appropriate data sets having both large sample size and sufficient number of technical replicates. Here we perform such a comparative analysis, targeting the problems of reducing the technical variability, eliminating the probe design bias and reducing the batch effect by exploiting two unpublished data sets, which included technical replicates and were profiled for DNA methylation either on peripheral blood, monocytes or muscle biopsies. We evaluated the performance of different analysis pipelines and demonstrated that: (1) it is critical to correct for the probe design type, since the amplitude of the measured methylation change depends on the underlying chemistry; (2) the effect of different normalization schemes is mixed, and the most effective method in our hands were quantile normalization and Beta Mixture Quantile dilation (BMIQ); (3) it is beneficial to correct for batch effects. In conclusion, our comparative analysis using a comprehensive data set suggests an efficient pipeline for proper identification of differentially methylated CpGs using the Illumina 450K arrays.  相似文献   

19.
A benchmark for Affymetrix GeneChip expression measures   总被引:11,自引:0,他引:11  
  相似文献   

20.
BackgroundAltered levels of circulating extracellular miRNA in plasma and serum have shown promise as non-invasive biomarkers of disease. However, unlike the assessment of cellular miRNA levels for which there are accepted housekeeping genes, analogous reference controls for normalization of circulating miRNA are lacking. Here, we provide an approach to identify and validate circulating miRNA reference controls on a de novo basis, and demonstrate the advantages of these customized internal controls in different disease settings. Importantly, these internal controls overcome key limitations of external spike-in controls.MethodsUsing a global RT-qPCR screen of 1066 miRNAs in plasma from pulmonary hypertension patients (PAH) and healthy subjects as a case example, we identified a large pool of initial candidate miRNAs that were systematically ranked according to their plasma level stability using a predefined algorithm. The performance of the top candidates was validated against multiple comparators, and in a second independent cohort of PAH and control subjects. The broader utility of this approach was demonstrated in a completely different disease setting with 372 miRNAs screened in plasma from septic shock patients and healthy controls.ResultsNormalization of data with specific internal reference controls significantly reduced the overall variation in circulating miRNA levels between subjects (relative to raw data), provided a more balanced distribution of up- and down-regulated miRNAs, replicated the results obtained by the benchmark geometric averaging of all detected miRNAs, and outperformed the commonly used external spike-in strategy.ConclusionsWe demonstrate the feasibility of identifying circulating reference controls that can reduce extraneous technical variations, and improve the assessment of disease-related changes in plasma miRNA levels. This study provides a novel conceptual framework that addresses a critical and previously unmet need if circulating miRNAs are to advance as reliable diagnostic tools in medicine.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号