期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Power and sample size estimation in microarray studies

Wei-Jiun Lin Huey-Miin Hsueh James J Chen 《BMC bioinformatics》2010,11(1):48

Background

Before conducting a microarray experiment, one important issue that needs to be determined is the number of arrays required in order to have adequate power to identify differentially expressed genes. This paper discusses some crucial issues in the problem formulation, parameter specifications, and approaches that are commonly proposed for sample size estimation in microarray experiments. Common methods for sample size estimation are formulated as the minimum sample size necessary to achieve a specified sensitivity (proportion of detected truly differentially expressed genes) on average at a specified false discovery rate (FDR) level and specified expected proportion (π ₁) of the true differentially expression genes in the array. Unfortunately, the probability of detecting the specified sensitivity in such a formulation can be low. We formulate the sample size problem as the number of arrays needed to achieve a specified sensitivity with 95% probability at the specified significance level. A permutation method using a small pilot dataset to estimate sample size is proposed. This method accounts for correlation and effect size heterogeneity among genes. 相似文献

2.

OpenDMAP: An open source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression

Lawrence Hunter Zhiyong Lu James Firby William A Baumgartner Jr Helen L Johnson Philip V Ogren K Bretonnel Cohen 《BMC bioinformatics》2008,9(1):1-11

Background

Microarray technology provides an efficient means for globally exploring physiological processes governed by the coordinated expression of multiple genes. However, identification of genes differentially expressed in microarray experiments is challenging because of their potentially high type I error rate. Methods for large-scale statistical analyses have been developed but most of them are applicable to two-sample or two-condition data.

Results

We developed a large-scale multiple-group F-test based method, named ranking analysis of F-statistics (RAF), which is an extension of ranking analysis of microarray data (RAM) for two-sample t-test. In this method, we proposed a novel random splitting approach to generate the null distribution instead of using permutation, which may not be appropriate for microarray data. We also implemented a two-simulation strategy to estimate the false discovery rate. Simulation results suggested that it has higher efficiency in finding differentially expressed genes among multiple classes at a lower false discovery rate than some commonly used methods. By applying our method to the experimental data, we found 107 genes having significantly differential expressions among 4 treatments at <0.7% FDR, of which 31 belong to the expressed sequence tags (ESTs), 76 are unique genes who have known functions in the brain or central nervous system and belong to six major functional groups.

Conclusion

Our method is suitable to identify differentially expressed genes among multiple groups, in particular, when sample size is small. 相似文献

3.

RNA-Seq vs Dual- and Single-Channel Microarray Data: Sensitivity Analysis for Differential Expression and Clustering

Alina S?rbu Gráinne Kerr Martin Crane Heather J. Ruskin 《PloS one》2012,7(12)

With the fast development of high-throughput sequencing technologies, a new generation of genome-wide gene expression measurements is under way. This is based on mRNA sequencing (RNA-seq), which complements the already mature technology of microarrays, and is expected to overcome some of the latter’s disadvantages. These RNA-seq data pose new challenges, however, as strengths and weaknesses have yet to be fully identified. Ideally, Next (or Second) Generation Sequencing measures can be integrated for more comprehensive gene expression investigation to facilitate analysis of whole regulatory networks. At present, however, the nature of these data is not very well understood. In this paper we study three alternative gene expression time series datasets for the Drosophila melanogaster embryo development, in order to compare three measurement techniques: RNA-seq, single-channel and dual-channel microarrays. The aim is to study the state of the art for the three technologies, with a view of assessing overlapping features, data compatibility and integration potential, in the context of time series measurements. This involves using established tools for each of the three different technologies, and technical and biological replicates (for RNA-seq and microarrays, respectively), due to the limited availability of biological RNA-seq replicates for time series data. The approach consists of a sensitivity analysis for differential expression and clustering. In general, the RNA-seq dataset displayed highest sensitivity to differential expression. The single-channel data performed similarly for the differentially expressed genes common to gene sets considered. Cluster analysis was used to identify different features of the gene space for the three datasets, with higher similarities found for the RNA-seq and single-channel microarray dataset. 相似文献

4.

Finding differentially expressed genes in two-channel DNA microarray datasets: how to increase reliability of data preprocessing

Rotter A Hren M Baebler S Blejec A Gruden K 《Omics : a journal of integrative biology》2008,12(3):171-182

Due to the great variety of preprocessing tools in two-channel expression microarray data analysis it is difficult to choose the most appropriate one for a given experimental setup. In our study, two independent two-channel inhouse microarray experiments as well as a publicly available dataset were used to investigate the influence of the selection of preprocessing methods (background correction, normalization, and duplicate spots correlation calculation) on the discovery of differentially expressed genes. Here we are showing that both the list of differentially expressed genes and the expression values of selected genes depend significantly on the preprocessing approach applied. The choice of normalization method to be used had the highest impact on the results. We propose a simple but efficient approach to increase the reliability of obtained results, where two normalization methods which are theoretically distinct from one another are used on the same dataset. Then the intersection of results, that is, the lists of differentially expressed genes, is used in order to get a more accurate estimation of the genes that were de facto differentially expressed. 相似文献

5.

A semi-parametric statistical model for integrating gene expression profiles across different platforms

Lyu Yafei Li Qunhua 《BMC bioinformatics》2016,17(1):51-60

Determining differentially expressed genes (DEGs) between biological samples is the key to understand how genotype gives rise to phenotype. RNA-seq and microarray are two main technologies for profiling gene expression levels. However, considerable discrepancy has been found between DEGs detected using the two technologies. Integration data across these two platforms has the potential to improve the power and reliability of DEG detection. We propose a rank-based semi-parametric model to determine DEGs using information across different sources and apply it to the integration of RNA-seq and microarray data. By incorporating both the significance of differential expression and the consistency across platforms, our method effectively detects DEGs with moderate but consistent signals. We demonstrate the effectiveness of our method using simulation studies, MAQC/SEQC data and a synthetic microRNA dataset. Our integration method is not only robust to noise and heterogeneity in the data, but also adaptive to the structure of data. In our simulations and real data studies, our approach shows a higher discriminate power and identifies more biologically relevant DEGs than eBayes, DEseq and some commonly used meta-analysis methods. 相似文献

6.

Powers of Multiple-Testing Procedures for Identification of Genes Significantly Differentially Expressed in Microarray Experiments

TAN Yuan-De YAN Heng-Mei 《Acta Genetica Sinica》2006,33(12):1132-1140

Because of the high operation costs involved in microarray experiments, the determination of the number of replicates required to detect a gene significantly differentially expressed in a given multiple-testing procedure is of considerable significance. Calculation of power/replicate numbers required in multiple-testing procedures provides design guidance for microarray experiments. Based on this model and by choice of a multiple-testing procedure, expression noises based on permutation resampling can be considerably minimized. The method for mixture distribution model is suitable to various microarray data types obtained from single noise sources, or from multiple noise sources. By using the biological replicate number required in microarray experiments for a given power or by determining the power required to detect a gene significantly differentially expressed, given the sample size, or the best multiple-testing method can be chosen. As an example, a single-distribution model of t-statistic was fitted to an observed microarray dataset of 3 000 genes responsive to stroke in rat, and then used to calculate powers of four popular multiple-testing procedures to detect a gene of an expression change D. The results show that the B-procedure had the lowest power to detect a gene of small change among the multiple-testing procedures, whereas the BH-procedure had the highest power. However, all multiple-testing procedures had the same power to identify a gene having the largest change. Similar to a single test, the power of the BH-procedure to detect a small change does not vary as the number of genes increases, but powers of the other three multiple-testing procedures decline as the number of genes increases. 相似文献

7.

A normalization strategy for comparing tag count data

Kadota K Nishiyama T Shimizu K 《Algorithms for molecular biology : AMB》2012,7(1):5-13

Background

High-throughput sequencing, such as ribonucleic acid sequencing (RNA-seq) and chromatin immunoprecipitation sequencing (ChIP-seq) analyses, enables various features of organisms to be compared through tag counts. Recent studies have demonstrated that the normalization step for RNA-seq data is critical for a more accurate subsequent analysis of differential gene expression. Development of a more robust normalization method is desirable for identifying the true difference in tag count data.

Results

We describe a strategy for normalizing tag count data, focusing on RNA-seq. The key concept is to remove data assigned as potential differentially expressed genes (DEGs) before calculating the normalization factor. Several R packages for identifying DEGs are currently available, and each package uses its own normalization method and gene ranking algorithm. We compared a total of eight package combinations: four R packages (edgeR, DESeq, baySeq, and NBPSeq) with their default normalization settings and with our normalization strategy. Many synthetic datasets under various scenarios were evaluated on the basis of the area under the curve (AUC) as a measure for both sensitivity and specificity. We found that packages using our strategy in the data normalization step overall performed well. This result was also observed for a real experimental dataset.

Conclusion

Our results showed that the elimination of potential DEGs is essential for more accurate normalization of RNA-seq data. The concept of this normalization strategy can widely be applied to other types of tag count data and to microarray data. 相似文献

8.

Global Analysis of Differentially Expressed Genes and Proteins in the Wheat Callus Infected by Agrobacterium tumefaciens

Xiaohong Zhou Ke Wang Dongwen Lv Chengjun Wu Jiarui Li Pei Zhao Zhishan Lin Lipu Du Yueming Yan Xingguo Ye 《PloS one》2013,8(11)

Agrobacterium-mediated plant transformation is an extremely complex and evolved process involving genetic determinants of both the bacteria and the host plant cells. However, the mechanism of the determinants remains obscure, especially in some cereal crops such as wheat, which is recalcitrant for Agrobacterium-mediated transformation. In this study, differentially expressed genes (DEGs) and differentially expressed proteins (DEPs) were analyzed in wheat callus cells co-cultured with Agrobacterium by using RNA sequencing (RNA-seq) and two-dimensional electrophoresis (2-DE) in conjunction with mass spectrometry (MS). A set of 4,889 DEGs and 90 DEPs were identified, respectively. Most of them are related to metabolism, chromatin assembly or disassembly and immune defense. After comparative analysis, 24 of the 90 DEPs were detected in RNA-seq and proteomics datasets simultaneously. In addition, real-time RT-PCR experiments were performed to check the differential expression of the 24 genes, and the results were consistent with the RNA-seq data. According to gene ontology (GO) analysis, we found that a big part of these differentially expressed genes were related to the process of stress or immunity response. Several putative determinants and candidate effectors responsive to Agrobacterium mediated transformation of wheat cells were discussed. We speculate that some of these genes are possibly related to Agrobacterium infection. Our results will help to understand the interaction between Agrobacterium and host cells, and may facilitate developing efficient transformation strategies in cereal crops. 相似文献

9.

Effects of Sample Size on Differential Gene Expression,Rank Order and Prediction Accuracy of a Gene Signature

Cynthia Stretch Sheehan Khan Nasimeh Asgarian Roman Eisner Saman Vaisipour Sambasivarao Damaraju Kathryn Graham Oliver F. Bathe Helen Steed Russell Greiner Vickie E. Baracos 《PloS one》2013,8(6)

相似文献

10.

多重检验法在基因芯片研究中鉴定差异表达基因的统计功效

谭远德颜亨梅《遗传学报》2006,33(12):1132-1140

鉴于基因芯片实验的造价,在基因芯片实验设计中,首要考虑的因素是需要多少重复才能检测出一个具有显著差异表达的基因。计算多重检验法要求的重复数（样本大小）或功效可为基因芯片实验设计提供重要的参考。为此,本文基于置换重抽样法构建了一种基因表达噪声混合分布模型。该方法适用各类基因表达数据,即无论是基因表达单噪声源或是多噪声源都可行。应用混合模型和多重检验法并给定统计功效。研究者能在基因芯片实验中获得所需要的最少生物学重复数：或者根据样本大小来确定测定一个显著差异表达的基因所具有的检验功效;或者根据样本大小和统计检验功效,选择最好的统计测验方法。本文以一组在老鼠中与中风有关的3000个基因的基因芯片实验所获得的数据为例,应用该方法拟和后组建了一个单分布模型（即表达单噪声源的分布模型）。根据该模型,我们计算了4种多重检验法在鉴定一个具有表达差异（D）值的基因中所需要的统计功效。结果表明。检测一个小的差异D值,4种多重检验法中B方法的统计功效最低,而BH方法最高。但是,对于鉴定一个具有最大表达差异的基因时,4种方法有相同的鉴定功效。与传统的单个检验法一样,BH方法检测一个小的变化所需要的效率不会随基因数目增加而改变,其他3种多重检验法的检测功效则随基因数目增加而降低。相似文献

11.

How Many Genes Are Expressed in a Transcriptome? Estimation and Results for RNA-Seq

Luis Fernando García-Ortega Octavio Martínez 《PloS one》2015,10(6)

相似文献

12.

Sample size for identifying differentially expressed genes in microarray experiments.

Sue-Jane Wang James J Chen 《Journal of computational biology》2004,11(4):714-726

Microarray technology allows simultaneous comparison of expression levels of thousands of genes under each condition. This paper concerns sample size calculation in the identification of differentially expressed genes between a control and a treated sample. In a typical experiment, only a fraction of genes (altered genes) is expected to be differentially expressed between two samples. Sample size determination depends on a number of factors including the specified significance level (alpha), the desired statistical power (1-beta), the fraction (eta) of truly altered genes out of the total g genes studied, and the effect sizes (Delta) for the altered genes. This paper proposes a method to calculate the number of arrays required to detect at least 100lambda % (where 0 < lambda < or = 1) of the truly altered genes under the model of an equal effect size for all altered genes. The required numbers of arrays are tabulated for various values of alpha, beta, Delta, eta, and lambda for the one-sample and two-sample t-tests for g = 10,000. Based on the proposed approach, to identify up to 90% of truly altered genes among the unknown number of truly altered genes, the estimated numbers of arrays needed appear to be manageable. For instance, when the standardized effect size is at least 2.0, the number of arrays needed is less than or equal to 14 for the two-sample t-test and is less than or equal to 10 for the one-sample t-test. As the cost per array declines, such array numbers become practical. The proposed method offers a simple, intuitive, and practical way to determine the number of arrays needed in microarray experiments in which the true correlation structure among the genes under investigation cannot be reasonably assumed. An example dataset is used to illustrate the use of the proposed approach to plan microarray experiments. 相似文献

13.

Split-Plot Microarray Experiments

Tsai PW Lee ML 《Applied bioinformatics》2005,4(3):187-194

This article focuses on microarray experiments with two or more factors in which treatment combinations of the factors corresponding to the samples paired together onto arrays are not completely random. A main effect of one (or more) factor(s) is confounded with arrays (the experimental blocks). This is called a split-plot microarray experiment. We utilise an analysis of variance (ANOVA) model to assess differentially expressed genes for between-array and within-array comparisons that are generic under a split-plot microarray experiment. Instead of standard t- or F-test statistics that rely on mean square errors of the ANOVA model, we use a robust method, referred to as 'a pooled percentile estimator', to identify genes that are differentially expressed across different treatment conditions. We illustrate the design and analysis of split-plot microarray experiments based on a case application described by Jin et al. A brief discussion of power and sample size for split-plot microarray experiments is also presented. 相似文献

14.

Generation of patterns from gene expression data by assigning confidence to differentially expressed genes

Manduchi E Grant GR McKenzie SE Overton GC Surrey S Stoeckert CJ 《Bioinformatics (Oxford, England)》2000,16(8):685-698

相似文献

15.

Global transcriptome analysis of <Emphasis Type="Italic">Clostridium thermocellum</Emphasis> ATCC 27405 during growth on dilute acid pretreated <Emphasis Type="Italic">Populus</Emphasis> and switchgrass

Charlotte?M?Wilson Miguel?RodriguezJr Courtney?M?Johnson Stanton?L?Martin Tzu?Ming?Chu Russ?D?Wolfinger Loren?J?Hauser Miriam?L?Land Dawn?M?Klingeman Mustafa?H?Syed Arthur?J?Ragauskas Timothy?J?Tschaplinski Jonathan?R?Mielenz Steven?D?Brown Email author 《Biotechnology for biofuels》2013,6(1):179

相似文献

16.

Probe-level measurement error improves accuracy in detecting differential gene expression 总被引：1，自引：0，他引：1

Liu X Milo M Lawrence ND Rattray M 《Bioinformatics (Oxford, England)》2006,22(17):2107-2113

MOTIVATION: Finding differentially expressed genes is a fundamental objective of a microarray experiment. Numerous methods have been proposed to perform this task. Existing methods are based on point estimates of gene expression level obtained from each microarray experiment. This approach discards potentially useful information about measurement error that can be obtained from an appropriate probe-level analysis. Probabilistic probe-level models can be used to measure gene expression and also provide a level of uncertainty in this measurement. This probe-level measurement error provides useful information which can help in the identification of differentially expressed genes. RESULTS: We propose a Bayesian method to include probe-level measurement error into the detection of differentially expressed genes from replicated experiments. A variational approximation is used for efficient parameter estimation. We compare this approximation with MAP and MCMC parameter estimation in terms of computational efficiency and accuracy. The method is used to calculate the probability of positive log-ratio (PPLR) of expression levels between conditions. Using the measurements from a recently developed Affymetrix probe-level model, multi-mgMOS, we test PPLR on a spike-in dataset and a mouse time-course dataset. Results show that the inclusion of probe-level measurement error improves accuracy in detecting differential gene expression. AVAILABILITY: The MAP approximation and variational inference described in this paper have been implemented in an R package pplr. The MCMC method is implemented in Matlab. Both software are available from http://umber.sbs.man.ac.uk/resources/puma. 相似文献

17.

Identification of differentially expressed genes using multi-resolution wavelet transformation analysis combined with SAM

Yazhou Wu Ling Zhang Ling Liu Yanqi Zhang Dong Yi 《Gene》2012

Although many statistical methods have been proposed for identifying differentially expressed genes, the optimal approach has still not been resolved. Therefore, it is necessary to develop more efficient methods of finding differentially expressed genes while accounting for noise and false discovery rate (FDR). We propose a method based on multi-resolution wavelet transformation analysis combined with SAM for identifying differentially expressed genes by adjusting the Δ and computing the FDR. This method was applied to a microarray expression dataset from adenoma patients and normal subjects. The number of differentially expressed genes gradually reduced with an increasing Δ value, and the FDR was reduced after wavelet transformation. At a given Δ value, the FDR was also reduced before and after wavelet transformation. In conclusion, a greater number and quality of differentially expressed genes were detected using the method when compared to non-transformed data, and the FDRs were notably more controlled and reduced. 相似文献

18.

Genome-Wide PhoB Binding and Gene Expression Profiles Reveal the Hierarchical Gene Regulatory Network of Phosphate Starvation in Escherichia coli

Chi Yang Tzu-Wen Huang Shiau-Yi Wen Chun-Yang Chang Shih-Feng Tsai Whei-Fen Wu Chuan-Hsiung Chang 《PloS one》2012,7(10)

相似文献

19.

Differential gene expression detection and sample classification using penalized linear regression models

Wu B 《Bioinformatics (Oxford, England)》2006,22(4):472-476

Differential gene expression detection and sample classification using microarray data have received much research interest recently. Owing to the large number of genes p and small number of samples n (p > n), microarray data analysis poses big challenges for statistical analysis. An obvious problem owing to the 'large p small n' is over-fitting. Just by chance, we are likely to find some non-differentially expressed genes that can classify the samples very well. The idea of shrinkage is to regularize the model parameters to reduce the effects of noise and produce reliable inferences. Shrinkage has been successfully applied in the microarray data analysis. The SAM statistics proposed by Tusher et al. and the 'nearest shrunken centroid' proposed by Tibshirani et al. are ad hoc shrinkage methods. Both methods are simple, intuitive and prove to be useful in empirical studies. Recently Wu proposed the penalized t/F-statistics with shrinkage by formally using the (1) penalized linear regression models for two-class microarray data, showing good performance. In this paper we systematically discussed the use of penalized regression models for analyzing microarray data. We generalize the two-class penalized t/F-statistics proposed by Wu to multi-class microarray data. We formally derive the ad hoc shrunken centroid used by Tibshirani et al. using the (1) penalized regression models. And we show that the penalized linear regression models provide a rigorous and unified statistical framework for sample classification and differential gene expression detection. 相似文献

20.

Identification of gene signatures from RNA-seq data using Pareto-optimal cluster algorithm

Saurav Mallik Zhongming Zhao 《BMC systems biology》2018,12(8):126

Background

Gene signatures are important to represent the molecular changes in the disease genomes or the cells in specific conditions, and have been often used to separate samples into different groups for better research or clinical treatment. While many methods and applications have been available in literature, there still lack powerful ones that can take account of the complex data and detect the most informative signatures.

Methods

In this article, we present a new framework for identifying gene signatures using Pareto-optimal cluster size identification for RNA-seq data. We first performed pre-filtering steps and normalization, then utilized the empirical Bayes test in Limma package to identify the differentially expressed genes (DEGs). Next, we used a multi-objective optimization technique, “Multi-objective optimization for collecting cluster alternatives” (MOCCA in R package) on these DEGs to find Pareto-optimal cluster size, and then applied k-means clustering to the RNA-seq data based on the optimal cluster size. The best cluster was obtained through computing the average Spearman’s Correlation Score among all the genes in pair-wise manner belonging to the module. The best cluster is treated as the signature for the respective disease or cellular condition.

Results

We applied our framework to a cervical cancer RNA-seq dataset, which included 253 squamous cell carcinoma (SCC) samples and 22 adenocarcinoma (ADENO) samples. We identified a total of 582 DEGs by Limma analysis of SCC versus ADENO samples. Among them, 260 are up-regulated genes and 322 are down-regulated genes. Using MOCCA, we obtained seven Pareto-optimal clusters. The best cluster has a total of 35 DEGs consisting of all-upregulated genes. For validation, we ran PAMR (prediction analysis for microarrays) classifier on the selected best cluster, and assessed the classification performance. Our evaluation, measured by sensitivity, specificity, precision, and accuracy, showed high confidence.

Conclusions

Our framework identified a multi-objective based cluster that is treated as a signature that can classify the disease and control group of samples with higher classification performance (accuracy 0.935) for the corresponding disease. Our method is useful to find signature for any RNA-seq or microarray data.

相似文献