期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Evaluation of normalization methods for microarray data

Taesung?Park Email author Sung-Gon?Yi Sung-Hyun?Kang SeungYeoun?Lee Yong-Sung?Lee Richard?Simon 《BMC bioinformatics》2003,4(1):33

Background

Microarray technology allows the monitoring of expression levels for thousands of genes simultaneously. This novel technique helps us to understand gene regulation as well as gene by gene interactions more systematically. In the microarray experiment, however, many undesirable systematic variations are observed. Even in replicated experiment, some variations are commonly observed. Normalization is the process of removing some sources of variation which affect the measured gene expression levels. Although a number of normalization methods have been proposed, it has been difficult to decide which methods perform best. Normalization plays an important role in the earlier stage of microarray data analysis. The subsequent analysis results are highly dependent on normalization.

Results

In this paper, we use the variability among the replicated slides to compare performance of normalization methods. We also compare normalization methods with regard to bias and mean square error using simulated data.

Conclusions

Our results show that intensity-dependent normalization often performs better than global normalization methods, and that linear and nonlinear normalization methods perform similarly. These conclusions are based on analysis of 36 cDNA microarrays of 3,840 genes obtained in an experiment to search for changes in gene expression profiles during neuronal differentiation of cortical stem cells. Simulation studies confirm our findings.

相似文献

2.

A statistical selection strategy for normalization procedures in LC-MS proteomics experiments through dataset-dependent ranking of normalization scaling factors

Webb-Robertson BJ Matzke MM Jacobs JM Pounds JG Waters KM 《Proteomics》2011,11(24):4736-4741

Quantification of LC-MS peak intensities assigned during peptide identification in a typical comparative proteomics experiment will deviate from run-to-run of the instrument due to both technical and biological variation. Thus, normalization of peak intensities across an LC-MS proteomics dataset is a fundamental step in pre-processing. However, the downstream analysis of LC-MS proteomics data can be dramatically affected by the normalization method selected. Current normalization procedures for LC-MS proteomics data are presented in the context of normalization values derived from subsets of the full collection of identified peptides. The distribution of these normalization values is unknown a priori. If they are not independent from the biological factors associated with the experiment the normalization process can introduce bias into the data, possibly affecting downstream statistical biomarker discovery. We present a novel approach to evaluate normalization strategies, which includes the peptide selection component associated with the derivation of normalization values. Our approach evaluates the effect of normalization on the between-group variance structure in order to identify the most appropriate normalization methods that improve the structure of the data without introducing bias into the normalized peak intensities. 相似文献

3.

A statistical framework for the design of microarray experiments and effective detection of differential gene expression 总被引：3，自引：0，他引：3

Zhang SD Gant TW 《Bioinformatics (Oxford, England)》2004,20(16):2821-2828

MOTIVATION: Microarray experiments generate a high data volume. However, often due to financial or experimental considerations, e.g. lack of sample, there is little or no replication of the experiments or hybridizations. These factors combined with the intrinsic variability associated with the measurement of gene expression can result in an unsatisfactory detection rate of differential gene expression (DGE). Our motivation was to provide an easy to use measure of the success rate of DGE detection that could find routine use in the design of microarray experiments or in post-experiment assessment. RESULTS: In this study, we address the problem of both random errors and systematic biases in microarray experimentation. We propose a mathematical model for the measured data in microarray experiments and on the basis of this model present a t-based statistical procedure to determine DGE. We have derived a formula to determine the success rate of DGE detection that takes into account the number of microarrays, the number of genes, the magnitude of DGE, and the variance from biological and technical sources. The formula and look-up tables based on the formula, can be used to assist in the design of microarray experiments. We also propose an ad hoc method for estimating the fraction of non-differentially expressed genes within a set of genes being tested. This will help to increase the power of DGE detection. AVAILABILITY: The functions to calculate the success rate of DGE detection have been implemented as a Java application, which is accessible at http://www.le.ac.uk/mrctox/microarray_lab/Microarray_Softwares/Microarray_Softwares.htm 相似文献

4.

Data variance and statistical significance in 2D-gel electrophoresis and DIGE experiments: comparison of the effects of normalization methods

Keeping AJ Collins RA 《Journal of proteome research》2011,10(3):1353-1360

Identifying changes in the relative abundance of proteins between different biological samples is often confounded by technical noise. In this work, we compared eight normalization methods commonly used in two-dimensional gel electrophoresis and difference gel electrophoresis (DIGE) experiments for their ability to reduce noise and for their influence on the list of proteins whose difference in abundance between two samples is determined to be statistically significant. With respect to reducing noise we find that, while all methods improve upon unnormalized data, cyclic linear normalization is the least well suited to gel-based proteomics and the performances of the other methods are similar. We also find in DIGE data that the choice of normalization method has less of an impact on the noise than does the decision to use an internal reference in the experimental design and that both normalization and standardization using the internal reference are required to maximally reduce variance. Despite the similar noise reduction achieved by most normalization methods, the list of proteins whose abundance was determined to differ significantly between biological groups differed depending on the choice of normalization method. This work provides a direct comparison of the impact of normalization methods in the context of common experimental designs. 相似文献

5.

Bayesian normalization and identification for differential gene expression data.

Dabao Zhang Martin T Wells Christine D Smart William E Fry 《Journal of computational biology》2005,12(4):391-406

Commonly accepted intensity-dependent normalization in spotted microarray studies takes account of measurement errors in the differential expression ratio but ignores measurement errors in the total intensity, although the definitions imply the same measurement error components are involved in both statistics. Furthermore, identification of differentially expressed genes is usually considered separately following normalization, which is statistically problematic. By incorporating the measurement errors in both total intensities and differential expression ratios, we propose a measurement-error model for intensity-dependent normalization and identification of differentially expressed genes. This model is also flexible enough to incorporate intra-array and inter-array effects. A Bayesian framework is proposed for the analysis of the proposed measurement-error model to avoid the potential risk of using the common two-step procedure. We also propose a Bayesian identification of differentially expressed genes to control the false discovery rate instead of the ad hoc thresholding of the posterior odds ratio. The simulation study and an application to real microarray data demonstrate promising results. 相似文献

6.

Evaluation of normalization methods in mammalian microRNA-Seq data

Garmire LX Subramaniam S 《RNA (New York, N.Y.)》2012,18(6):1279-1288

Simple total tag count normalization is inadequate for microRNA sequencing data generated from the next generation sequencing technology. However, so far systematic evaluation of normalization methods on microRNA sequencing data is lacking. We comprehensively evaluate seven commonly used normalization methods including global normalization, Lowess normalization, Trimmed Mean Method (TMM), quantile normalization, scaling normalization, variance stabilization, and invariant method. We assess these methods on two individual experimental data sets with the empirical statistical metrics of mean square error (MSE) and Kolmogorov-Smirnov (K-S) statistic. Additionally, we evaluate the methods with results from quantitative PCR validation. Our results consistently show that Lowess normalization and quantile normalization perform the best, whereas TMM, a method applied to the RNA-Sequencing normalization, performs the worst. The poor performance of TMM normalization is further evidenced by abnormal results from the test of differential expression (DE) of microRNA-Seq data. Comparing with the models used for DE, the choice of normalization method is the primary factor that affects the results of DE. In summary, Lowess normalization and quantile normalization are recommended for normalizing microRNA-Seq data, whereas the TMM method should be used with caution. 相似文献

7.

Evaluation of microarray data normalization procedures using spike-in experiments

Patrik Rydén Henrik Andersson Mattias Landfors Linda Näslund Blanka Hartmanová Laila Noppa Anders Sjöstedt 《BMC bioinformatics》2006,7(1):300-17

Background

Recently, a large number of methods for the analysis of microarray data have been proposed but there are few comparisons of their relative performances. By using so-called spike-in experiments, it is possible to characterize the analyzed data and thereby enable comparisons of different analysis methods. 相似文献

8.

Microarray probe expression measures, data normalization and statistical validation

Saviozzi S Calogero RA 《Comparative and Functional Genomics》2003,4(4):442-446

DNA microarray technology is a high-throughput method for gaining information on gene function. Microarray technology is based on deposition/synthesis, in an ordered manner, on a solid surface, of thousands of EST sequences/genes/oligonucleotides. Due to the high number of generated datapoints, computational tools are essential in microarray data analysis and mining to grasp knowledge from experimental results. In this review, we will focus on some of the methodologies actually available to define gene expression intensity measures, microarray data normalization, and statistical validation of differential expression. 相似文献

9.

A scaling normalization method for differential expression analysis of RNA-seq data 总被引：1，自引：0，他引：1

Mark D Robinson Alicia Oshlack 《Genome biology》2010,11(3):R25

相似文献

10.

Statistical tests for differential expression in cDNA microarray experiments 总被引：13，自引：0，他引：13

Cui X Churchill GA 《Genome biology》2003,4(4):210

Extracting biological information from microarray data requires appropriate statistical methods. The simplest statistical method for detecting differential expression is the t test, which can be used to compare two conditions when there is replication of samples. With more than two conditions, analysis of variance (ANOVA) can be used, and the mixed ANOVA model is a general and powerful approach for microarray experiments with multiple factors and/or several sources of variation. 相似文献

11.

Empirical comparison of cross-platform normalization methods for gene expression data

Jason Rudy Faramarz Valafar 《BMC bioinformatics》2011,12(1):1-22

Background

The prediction of secondary structure, i.e. the set of canonical base pairs between nucleotides, is a first step in developing an understanding of the function of an RNA sequence. The most accurate computational methods predict conserved structures for a set of homologous RNA sequences. These methods usually suffer from high computational complexity. In this paper, TurboFold, a novel and efficient method for secondary structure prediction for multiple RNA sequences, is presented.

Results

TurboFold takes, as input, a set of homologous RNA sequences and outputs estimates of the base pairing probabilities for each sequence. The base pairing probabilities for a sequence are estimated by combining intrinsic information, derived from the sequence itself via the nearest neighbor thermodynamic model, with extrinsic information, derived from the other sequences in the input set. For a given sequence, the extrinsic information is computed by using pairwise-sequence-alignment-based probabilities for co-incidence with each of the other sequences, along with estimated base pairing probabilities, from the previous iteration, for the other sequences. The extrinsic information is introduced as free energy modifications for base pairing in a partition function computation based on the nearest neighbor thermodynamic model. This process yields updated estimates of base pairing probability. The updated base pairing probabilities in turn are used to recompute extrinsic information, resulting in the overall iterative estimation procedure that defines TurboFold. TurboFold is benchmarked on a number of ncRNA datasets and compared against alternative secondary structure prediction methods. The iterative procedure in TurboFold is shown to improve estimates of base pairing probability with each iteration, though only small gains are obtained beyond three iterations. Secondary structures composed of base pairs with estimated probabilities higher than a significance threshold are shown to be more accurate for TurboFold than for alternative methods that estimate base pairing probabilities. TurboFold-MEA, which uses base pairing probabilities from TurboFold in a maximum expected accuracy algorithm for secondary structure prediction, has accuracy comparable to the best performing secondary structure prediction methods. The computational and memory requirements for TurboFold are modest and, in terms of sequence length and number of sequences, scale much more favorably than joint alignment and folding algorithms.

Conclusions

TurboFold is an iterative probabilistic method for predicting secondary structures for multiple RNA sequences that efficiently and accurately combines the information from the comparative analysis between sequences with the thermodynamic folding model. Unlike most other multi-sequence structure prediction methods, TurboFold does not enforce strict commonality of structures and is therefore useful for predicting structures for homologous sequences that have diverged significantly. TurboFold can be downloaded as part of the RNAstructure package at http://rna.urmc.rochester.edu. 相似文献

12.

基因芯片筛选差异表达基因方法比较 总被引：1，自引：0，他引：1

单文娟童春发施季森《遗传》2008,30(12):1640-1646

摘要: 使用计算机模拟数据和真实的芯片数据, 对8种筛选差异表达基因的方法进行了比较分析, 旨在比较不同方法对基因芯片数据的筛选效果。模拟数据分析表明, 所使用的8种方法对均匀分布的差异表达基因有很好的识别、检出作用。算法方面, SAM和Wilcoxon秩和检验方法较好; 数据分布方面, 正态分布的识别效果较好, 卡方分布和指数分布的识别效果较差。杨树cDNA芯片分析表明, SAM、Samroc和回归模型方法相近, 而Wilcoxon秩和检验方法与它们有较大差异。相似文献

13.

NanoStriDE: normalization and differential expression analysis of NanoString nCounter data

Christopher D Brumbaugh Hyunsung J Kim Mario Giovacchini Nader Pourmand 《BMC bioinformatics》2011,12(1):1-4

Background

Relationships between species, genes and genomes have been printed as trees for over a century. Whilst this may have been the best format for exchanging and sharing phylogenetic hypotheses during the 20^th century, the worldwide web now provides faster and automated ways of transferring and sharing phylogenetic knowledge. However, novel software is needed to defrost these published phylogenies for the 21^st century.

Results

TreeRipper is a simple website for the fully-automated recognition of multifurcating phylogenetic trees (http://linnaeus.zoology.gla.ac.uk/~jhughes/treeripper/). The program accepts a range of input image formats (PNG, JPG/JPEG or GIF). The underlying command line c++ program follows a number of cleaning steps to detect lines, remove node labels, patch-up broken lines and corners and detect line edges. The edge contour is then determined to detect the branch length, tip label positions and the topology of the tree. Optical Character Recognition (OCR) is used to convert the tip labels into text with the freely available tesseract-ocr software. 32% of images meeting the prerequisites for TreeRipper were successfully recognised, the largest tree had 115 leaves.

Conclusions

Despite the diversity of ways phylogenies have been illustrated making the design of a fully automated tree recognition software difficult, TreeRipper is a step towards automating the digitization of past phylogenies. We also provide a dataset of 100 tree images and associated tree files for training and/or benchmarking future software. TreeRipper is an open source project licensed under the GNU General Public Licence v3. 相似文献

14.

A discussion of statistical methods for design and analysis of microarray experiments for plant scientists 总被引：3，自引：0，他引：3

下载免费PDF全文

Nettleton D 《The Plant cell》2006,18(9):2112-2121

相似文献

15.

Evaluation of candidate reference genes in Clostridium difficile for gene expression normalization

Devon Metcalf Shayan Sharif J. Scott Weese 《Anaerobe》2010,16(4):439-443

Quantitative real-time polymerase chain reaction (qPCR) is a sensitive, efficient and reproducible technique for studying gene expression. Identification of stably expressed reference genes is required to avoid bias in these studies yet mostly unvalidated reference genes are used in studying gene expression in Clostridium difficile. Here, we sought to identify a set of stable reference genes used to normalize C. difficile expression data comparing exponential versus stationary phases of growth. Eight candidate reference genes (rpoA, rrs, gyrA, gluD, adk, rpsJ, tpi, and rho) were assessed in 3 C. difficile genotypes (ribotypes 027, 078, and 001). The primers were analyzed for efficiency and the 8 genes were ranked according to their stability. Overall, the genes rrs, adk, and rpsJ ranked among the most stable. Identification of the most stable genes was, however, strain dependent and suggests that selection of reference genes in a heterogeneous species, such as C. difficile, requires multiple genes to be assessed to confirm their stability within the strains being studied. 相似文献

16.

A Laplace mixture model for identification of differential expression in microarray experiments 总被引：1，自引：0，他引：1

Bhowmick D Davison AC Goldstein DR Ruffieux Y 《Biostatistics (Oxford, England)》2006,7(4):630-641

Microarrays have become an important tool for studying the molecular basis of complex disease traits and fundamental biological processes. A common purpose of microarray experiments is the detection of genes that are differentially expressed under two conditions, such as treatment versus control or wild type versus knockout. We introduce a Laplace mixture model as a long-tailed alternative to the normal distribution when identifying differentially expressed genes in microarray experiments, and provide an extension to asymmetric over- or underexpression. This model permits greater flexibility than models in current use as it has the potential, at least with sufficient data, to accommodate both whole genome and restricted coverage arrays. We also propose likelihood approaches to hyperparameter estimation which are equally applicable in the Normal mixture case. The Laplace model appears to give some improvement in fit to data, though simulation studies show that our method performs similarly to several other statistical approaches to the problem of identification of differential expression. 相似文献

17.

Use of within-array replicate spots for assessing differential expression in microarray experiments 总被引：16，自引：0，他引：16

Smyth GK Michaud J Scott HS 《Bioinformatics (Oxford, England)》2005,21(9):2067-2075

MOTIVATION: Spotted arrays are often printed with probes in duplicate or triplicate, but current methods for assessing differential expression are not able to make full use of the resulting information. The usual practice is to average the duplicate or triplicate results for each probe before assessing differential expression. This results in the loss of valuable information about genewise variability. RESULTS: A method is proposed for extracting more information from within-array replicate spots in microarray experiments by estimating the strength of the correlation between them. The method involves fitting separate linear models to the expression data for each gene but with a common value for the between-replicate correlation. The method greatly improves the precision with which the genewise variances are estimated and thereby improves inference methods designed to identify differentially expressed genes. The method may be combined with empirical Bayes methods for moderating the genewise variances between genes. The method is validated using data from a microarray experiment involving calibration and ratio control spots in conjunction with spiked-in RNA. Comparing results for calibration and ratio control spots shows that the common correlation method results in substantially better discrimination of differentially expressed genes from those which are not. The spike-in experiment also confirms that the results may be further improved by empirical Bayes smoothing of the variances when the sample size is small. AVAILABILITY: The methodology is implemented in the limma software package for R, available from the CRAN repository http://www.r-project.org 相似文献

18.

Evaluation of statistical methods for avoidance data of schooling fish

S. I. Hartwell H. J. Jin D. S. Cherry J. Cairns Jr. 《Hydrobiologia》1986,131(1):63-76

Concentrations of heavy metals blends avoided by schools of fathead minnows and alkaline pH levels avoided by schools of bluegill sunfish, fathead minnows, golden shiners, and rainbow trout were determined in a boundary layer avoidance chamber. Parameters measured were residence time, activity, and sequential fish location counts. Data were evaluated using linear, quadratic, and polynomial regression, log₁₀ transformations, analysis of variance, covariance analysis, Duncan's multiple range test, and Hochberg's GT2 test. The best methods of analysis are quadratic regression and covariance analysis. 相似文献

19.

Nonparametric tests for differential gene expression and interaction effects in multi-factorial microarray experiments

Xin?Gao Email author Peter?XK?Song 《BMC bioinformatics》2005,6(1):186

Background

Numerous nonparametric approaches have been proposed in literature to detect differential gene expression in the setting of two user-defined groups. However, there is a lack of nonparametric procedures to analyze microarray data with multiple factors attributing to the gene expression. Furthermore, incorporating interaction effects in the analysis of microarray data has long been of great interest to biological scientists, little of which has been investigated in the nonparametric framework. 相似文献

20.

DNA microarray normalization methods can remove bias from differential protein expression analysis of 2D difference gel electrophoresis results

Kreil DP Karp NA Lilley KS 《Bioinformatics (Oxford, England)》2004,20(13):2026-2034

MOTIVATION: Two-dimensional Difference Gel Electrophoresis (DIGE) measures expression differences for thousands of proteins in parallel. In contrast to DNA microarray analysis, however, there have been few systematic studies on the validity of differential protein expression analysis, and the effects of normalization methods have not yet been investigated. To address this need, we assessed a series of same-same comparisons, evaluating how random experimental variance influenced differential expression analysis. RESULTS: The strong fluctuations observed were reflected in large discrepancies between the distributions of the spot intensities for different gels. Correct normalization for pooling of multiple gels for analysis is, therefore, essential. We show that both dye-specific background levels and the differences in scale of the spot intensity distributions must be accounted for. A variance stabilizing transform that had been developed for DNA microarray analysis combined with a robust Z-score allowed the determination of gel-independent signal thresholds based on the empirical distributions from same-same comparisons. In contrast, similar thresholds holding up to cross-validation could not be proposed for data normalized using methods established in the field of proteomics. AVAILABILITY: Software is available on request from the authors. SUPPLEMENTARY INFORMATION: There is supplementary material available online at http://www.flychip.org.uk/kreil/pub/2dgels/ 相似文献