首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.

Background  

Before conducting a microarray experiment, one important issue that needs to be determined is the number of arrays required in order to have adequate power to identify differentially expressed genes. This paper discusses some crucial issues in the problem formulation, parameter specifications, and approaches that are commonly proposed for sample size estimation in microarray experiments. Common methods for sample size estimation are formulated as the minimum sample size necessary to achieve a specified sensitivity (proportion of detected truly differentially expressed genes) on average at a specified false discovery rate (FDR) level and specified expected proportion (π 1) of the true differentially expression genes in the array. Unfortunately, the probability of detecting the specified sensitivity in such a formulation can be low. We formulate the sample size problem as the number of arrays needed to achieve a specified sensitivity with 95% probability at the specified significance level. A permutation method using a small pilot dataset to estimate sample size is proposed. This method accounts for correlation and effect size heterogeneity among genes.  相似文献   

2.
MOTIVATION: Microarray experiments often involve hundreds or thousands of genes. In a typical experiment, only a fraction of genes are expected to be differentially expressed; in addition, the measured intensities among different genes may be correlated. Depending on the experimental objectives, sample size calculations can be based on one of the three specified measures: sensitivity, true discovery and accuracy rates. The sample size problem is formulated as: the number of arrays needed in order to achieve the desired fraction of the specified measure at the desired family-wise power at the given type I error and (standardized) effect size. RESULTS: We present a general approach for estimating sample size under independent and equally correlated models using binomial and beta-binomial models, respectively. The sample sizes needed for a two-sample z-test are computed; the computed theoretical numbers agree well with the Monte Carlo simulation results. But, under more general correlation structures, the beta-binomial model can underestimate the needed samples by about 1-5 arrays. CONTACT: jchen@nctr.fda.gov.  相似文献   

3.
Accurately identifying differentially expressed genes from microarray data is not a trivial task, partly because of poor variance estimates of gene expression signals. Here, after analyzing 380 replicated microarray experiments, we found that probesets have typical, distinct variances that can be estimated based on a large number of microarray experiments. These probeset-specific variances depend at least in part on the function of the probed gene: genes for ribosomal or structural proteins often have a small variance, while genes implicated in stress responses often have large variances. We used these variance estimates to develop a statistical test for differentially expressed genes called EVE (external variance estimation). The EVE algorithm performs better than the t-test and LIMMA on some real-world data, where external information from appropriate databases is available. Thus, EVE helps to maximize the information gained from a typical microarray experiment. Nonetheless, only a large number of replicates will guarantee to identify nearly all truly differentially expressed genes. However, our simulation studies suggest that even limited numbers of replicates will usually result in good coverage of strongly differentially expressed genes.  相似文献   

4.
MOTIVATION: Microarray technology emerges as a powerful tool in life science. One major application of microarray technology is to identify differentially expressed genes under various conditions. Currently, the statistical methods to analyze microarray data are generally unsatisfactory, mainly due to the lack of understanding of the distribution and error structure of microarray data. RESULTS: We develop a generalized likelihood ratio (GLR) test based on the two-component model proposed by Rocke and Durbin to identify differentially expressed genes from microarray data. Simulation studies show that the GLR test is more powerful than commonly used methods, like the fold-change method and the two-sample t-test. When applied to microarray data, the GLR test identifies more differentially expressed genes than the t-test, has a lower false discovery rate and shows more consistency over independently repeated experiments. AVAILABILITY: The approach is implemented in software called GLR, which is freely available for downloading at http://www.cc.utah.edu/~jw27c60  相似文献   

5.
False discovery rate, sensitivity and sample size for microarray studies   总被引:10,自引:0,他引:10  
MOTIVATION: In microarray data studies most researchers are keenly aware of the potentially high rate of false positives and the need to control it. One key statistical shift is the move away from the well-known P-value to false discovery rate (FDR). Less discussion perhaps has been spent on the sensitivity or the associated false negative rate (FNR). The purpose of this paper is to explain in simple ways why the shift from P-value to FDR for statistical assessment of microarray data is necessary, to elucidate the determining factors of FDR and, for a two-sample comparative study, to discuss its control via sample size at the design stage. RESULTS: We use a mixture model, involving differentially expressed (DE) and non-DE genes, that captures the most common problem of finding DE genes. Factors determining FDR are (1) the proportion of truly differentially expressed genes, (2) the distribution of the true differences, (3) measurement variability and (4) sample size. Many current small microarray studies are plagued with large FDR, but controlling FDR alone can lead to unacceptably large FNR. In evaluating a design of a microarray study, sensitivity or FNR curves should be computed routinely together with FDR curves. Under certain assumptions, the FDR and FNR curves coincide, thus simplifying the choice of sample size for controlling the FDR and FNR jointly.  相似文献   

6.
Testing for differentially expressed genes with microarray data   总被引:1,自引:1,他引:0       下载免费PDF全文
This paper compares the type I error and power of the one- and two-sample t-tests, and the one- and two-sample permutation tests for detecting differences in gene expression between two microarray samples with replicates using Monte Carlo simulations. When data are generated from a normal distribution, type I errors and powers of the one-sample parametric t-test and one-sample permutation test are very close, as are the two-sample t-test and two-sample permutation test, provided that the number of replicates is adequate. When data are generated from a t-distribution, the permutation tests outperform the corresponding parametric tests if the number of replicates is at least five. For data from a two-color dye swap experiment, the one-sample test appears to perform better than the two-sample test since expression measurements for control and treatment samples from the same spot are correlated. For data from independent samples, such as the one-channel array or two-channel array experiment using reference design, the two-sample t-tests appear more powerful than the one-sample t-tests.  相似文献   

7.

Background

Microarray technology provides an efficient means for globally exploring physiological processes governed by the coordinated expression of multiple genes. However, identification of genes differentially expressed in microarray experiments is challenging because of their potentially high type I error rate. Methods for large-scale statistical analyses have been developed but most of them are applicable to two-sample or two-condition data.

Results

We developed a large-scale multiple-group F-test based method, named ranking analysis of F-statistics (RAF), which is an extension of ranking analysis of microarray data (RAM) for two-sample t-test. In this method, we proposed a novel random splitting approach to generate the null distribution instead of using permutation, which may not be appropriate for microarray data. We also implemented a two-simulation strategy to estimate the false discovery rate. Simulation results suggested that it has higher efficiency in finding differentially expressed genes among multiple classes at a lower false discovery rate than some commonly used methods. By applying our method to the experimental data, we found 107 genes having significantly differential expressions among 4 treatments at <0.7% FDR, of which 31 belong to the expressed sequence tags (ESTs), 76 are unique genes who have known functions in the brain or central nervous system and belong to six major functional groups.

Conclusion

Our method is suitable to identify differentially expressed genes among multiple groups, in particular, when sample size is small.  相似文献   

8.
Microarray experiments contribute significantly to the progress in disease treatment by enabling a precise and early diagnosis. One of the major objectives of microarray experiments is to identify differentially expressed genes under various conditions. The statistical methods currently used to analyse microarray data are inadequate, mainly due to the lack of understanding of the distribution of microarray data. We present a nonparametric likelihood ratio (NPLR) test to identify differentially expressed genes using microarray data. The NPLR test is highly robust against extreme values and does not assume the distribution of the parent population. Simulation studies show that the NPLR test is more powerful than some of the commonly used methods, such as the two-sample t-test, the Mann-Whitney U-test and significance analysis of microarrays (SAM). When applied to microarray data, we found that the NPLR test identifies more differentially expressed genes than its competitors. The asymptotic distribution of the NPLR test statistic and the p-value function is presented. The application of the NPLR method is shown, using both synthetic and real-life data. The biological significance of some of the genes detected only by the NPLR method is discussed.  相似文献   

9.
Hu J  Wright FA 《Biometrics》2007,63(1):41-49
The identification of the genes that are differentially expressed in two-sample microarray experiments remains a difficult problem when the number of arrays is very small. We discuss the implications of using ordinary t-statistics and examine other commonly used variants. For oligonucleotide arrays with multiple probes per gene, we introduce a simple model relating the mean and variance of expression, possibly with gene-specific random effects. Parameter estimates from the model have natural shrinkage properties that guard against inappropriately small variance estimates, and the model is used to obtain a differential expression statistic. A limiting value to the positive false discovery rate (pFDR) for ordinary t-tests provides motivation for our use of the data structure to improve variance estimates. Our approach performs well compared to other proposed approaches in terms of the false discovery rate.  相似文献   

10.
Microarray technology is rapidly emerging for genome-wide screening of differentially expressed genes between clinical subtypes or different conditions of human diseases. Traditional statistical testing approaches, such as the two-sample t-test or Wilcoxon test, are frequently used for evaluating statistical significance of informative expressions but require adjustment for large-scale multiplicity. Due to its simplicity, Bonferroni adjustment has been widely used to circumvent this problem. It is well known, however, that the standard Bonferroni test is often very conservative. In the present paper, we compare three multiple testing procedures in the microarray context: the original Bonferroni method, a Bonferroni-type improved single-step method and a step-down method. The latter two methods are based on nonparametric resampling, by which the null distribution can be derived with the dependency structure among gene expressions preserved and the family-wise error rate accurately controlled at the desired level. We also present a sample size calculation method for designing microarray studies. Through simulations and data analyses, we find that the proposed methods for testing and sample size calculation are computationally fast and control error and power precisely.  相似文献   

11.
Tan YD  Fornage M  Fu YX 《Genomics》2006,88(6):846-854
Microarray technology provides a powerful tool for the expression profile of thousands of genes simultaneously, which makes it possible to explore the molecular and metabolic etiology of the development of a complex disease under study. However, classical statistical methods and technologies fail to be applicable to microarray data. Therefore, it is necessary and motivating to develop powerful methods for large-scale statistical analyses. In this paper, we described a novel method, called Ranking Analysis of Microarray Data (RAM). RAM, which is a large-scale two-sample t-test method, is based on comparisons between a set of ranked T statistics and a set of ranked Z values (a set of ranked estimated null scores) yielded by a "randomly splitting" approach instead of a "permutation" approach and a two-simulation strategy for estimating the proportion of genes identified by chance, i.e., the false discovery rate (FDR). The results obtained from the simulated and observed microarray data show that RAM is more efficient in identification of genes differentially expressed and estimation of FDR under undesirable conditions such as a large fudge factor, small sample size, or mixture distribution of noises than Significance Analysis of Microarrays.  相似文献   

12.
We consider identifying differentially expressing genes between two patient groups using microarray experiment. We propose a sample size calculation method for a specified number of true rejections while controlling the false discovery rate at a desired level. Input parameters for the sample size calculation include the allocation proportion in each group, the number of genes in each array, the number of differentially expressing genes and the effect sizes among the differentially expressing genes. We have a closed-form sample size formula if the projected effect sizes are equal among differentially expressing genes. Otherwise, our method requires a numerical method to solve an equation. Simulation studies are conducted to show that the calculated sample sizes are accurate in practical settings. The proposed method is demonstrated with a real study.  相似文献   

13.
Wang D  Cheng L  Zhang Y  Wu R  Wang M  Gu Y  Zhao W  Li P  Li B  Zhang Y  Wang H  Huang Y  Wang C  Guo Z 《Molecular bioSystems》2012,8(3):818-827
Based on the assumption that only a few genes are differentially expressed in a disease and have balanced upward and downward expression level changes, researchers usually normalise microarray data by forcing all of the arrays to have the same probe intensity distributions to remove technical variations in the data. However, accumulated evidence suggests that gene expressions could be widely altered in cancer, so we need to evaluate the sensitivities of biological discoveries to violation of the normalisation assumption. Here, we show that the medians of the original probe intensities increase in most of the ten cancer types analyzed in this paper, indicating that genes may be widely up-regulated in many cancer types. Thus, at least for cancer study, normalising all arrays to have the same distribution of probe intensities regardless of the state (diseased vs. normal) tends to falsely produce many down-regulated differentially expressed (DE) genes while missing many truly up-regulated DE genes. We also show that the DE genes solely detected in the non-normalised data for cancers are highly reproducible across different datasets for the same cancers, indicating that effective biological signals naturally exist in the non-normalised data. Because the powers of current statistical analyses using the non-normalised data tend to be low, we suggest selecting DE genes in both normalised and non-normalised data and then filter out the false DE genes extracted from the normalised data that show opposite deregulation directions in the non-normalised data.  相似文献   

14.
This article focuses on microarray experiments with two or more factors in which treatment combinations of the factors corresponding to the samples paired together onto arrays are not completely random. A main effect of one (or more) factor(s) is confounded with arrays (the experimental blocks). This is called a split-plot microarray experiment. We utilise an analysis of variance (ANOVA) model to assess differentially expressed genes for between-array and within-array comparisons that are generic under a split-plot microarray experiment. Instead of standard t- or F-test statistics that rely on mean square errors of the ANOVA model, we use a robust method, referred to as 'a pooled percentile estimator', to identify genes that are differentially expressed across different treatment conditions. We illustrate the design and analysis of split-plot microarray experiments based on a case application described by Jin et al. A brief discussion of power and sample size for split-plot microarray experiments is also presented.  相似文献   

15.
MOTIVATION: A major focus of current cancer research is to identify genes that can be used as markers for prognosis and diagnosis, and as targets for therapy. Microarray technology has been applied extensively for this purpose, even though it has been reported that the agreement between microarray platforms is poor. A critical question is: how can we best combine the measurements of matched genes across microarray platforms to develop diagnostic and prognostic tools related to the underlying biology? RESULTS: We introduce a statistical approach within a Bayesian framework to combine the microarray data on matched genes from three investigations of gene expression profiling of B-cell chronic lymphocytic leukemia (CLL) and normal B cells (NBC) using three different microarray platforms, oligonucleotide arrays, cDNA arrays printed on glass slides and cDNA arrays printed on nylon membranes. Using this approach, we identified a number of genes that were consistently differentially expressed between CLL and NBC samples.  相似文献   

16.
Hu J  Xu J 《BMC genomics》2010,11(Z2):S3

Motivation

Identification of differentially expressed genes from microarray datasets is one of the most important analyses for microarray data mining. Popular algorithms such as statistical t-test rank genes based on a single statistics. The false positive rate of these methods can be improved by considering other features of differentially expressed genes.

Results

We proposed a pattern recognition strategy for identifying differentially expressed genes. Genes are mapped to a two dimension feature space composed of average difference of gene expression and average expression levels. A density based pruning algorithm (DB Pruning) is developed to screen out potential differentially expressed genes usually located in the sparse boundary region. Biases of popular algorithms for identifying differentially expressed genes are visually characterized. Experiments on 17 datasets from Gene Omnibus Database (GEO) with experimentally verified differentially expressed genes showed that DB pruning can significantly improve the prediction accuracy of popular identification algorithms such as t-test, rank product, and fold change.

Conclusions

Density based pruning of non-differentially expressed genes is an effective method for enhancing statistical testing based algorithms for identifying differentially expressed genes. It improves t-test, rank product, and fold change by 11% to 50% in the numbers of identified true differentially expressed genes. The source code of DB pruning is freely available on our website http://mleg.cse.sc.edu/degprune
  相似文献   

17.
One of the main objectives in the analysis of microarray experiments is the identification of genes that are differentially expressed under two experimental conditions. This task is complicated by the noisiness of the data and the large number of genes that are examined simultaneously. Here, we present a novel technique for identifying differentially expressed genes that does not originate from a sophisticated statistical model but rather from an analysis of biological reasoning. The new technique, which is based on calculating rank products (RP) from replicate experiments, is fast and simple. At the same time, it provides a straightforward and statistically stringent way to determine the significance level for each gene and allows for the flexible control of the false-detection rate and familywise error rate in the multiple testing situation of a microarray experiment. We use the RP technique on three biological data sets and show that in each case it performs more reliably and consistently than the non-parametric t-test variant implemented in Tusher et al.'s significance analysis of microarrays (SAM). We also show that the RP results are reliable in highly noisy data. An analysis of the physiological function of the identified genes indicates that the RP approach is powerful for identifying biologically relevant expression changes. In addition, using RP can lead to a sharp reduction in the number of replicate experiments needed to obtain reproducible results.  相似文献   

18.
This paper presents Fuzzy-Adaptive-Subspace-Iteration-based Two-way Clustering (FASIC) of microarray data for finding differentially expressed genes (DEGs) from two-sample microarray experiments. The concept of fuzzy membership is introduced to transform the hard adaptive subspace iteration (ASI) algorithm into a fuzzy-ASI algorithm to perform two-way clustering. The proposed approach follows a progressive framework to assign a relevance value to genes associated with each cluster. Subsequently, each gene cluster is scored and ranked based on its potential to provide a correct classification of the sample classes. These ranks are converted into P values using the R-test, and the significance of each gene is determined. A fivefold validation is performed on the DEGs selected using the proposed approach. Empirical analyses on a number of simulated microarray data sets are conducted to quantify the results obtained using the proposed approach. To exemplify the efficacy of the proposed approach, further analyses on different real microarray data sets are also performed.  相似文献   

19.
Heart failure (HF) is the major of cause of mortality and morbidity in the developed world. Gene expression profiles of animal model of heart failure have been used in number of studies to understand human cardiac disease. In this study, statistical methods of analysing microarray data on cardiac tissues from dogs with pacing induced HF were used to identify differentially expressed genes between normal and two abnormal tissues. The unsupervised techniques principal component analysis (PCA) and cluster analysis were explored to distinguish between three different groups of 12 arrays and to separate the genes which are up regulated in different conditions among 23912 genes in heart failure canines'' microarray data. It was found that out of 23912 genes, 1802 genes were differentially expressed in the three groups at 5% level of significance and 496 genes were differentially expressed at 1% level of significance using one way analysis of variance (ANOVA). The genes clustered using PCA and clustering analysis were explored in the paper to understand HF and a small number of differentially expressed genes related to HF were identified.  相似文献   

20.
A key step in the analysis of microarray data is the selection of genes that are differentially expressed. Ideally, such experiments should be properly replicated in order to infer both technical and biological variability, and the data should be subjected to rigorous hypothesis tests to identify the differentially expressed genes. However, in microarray experiments involving the analysis of very large numbers of biological samples, replication is not always practical. Therefore, there is a need for a method to select differentially expressed genes in a rational way from insufficiently replicated data. In this paper, we describe a simple method that uses bootstrapping to generate an error model from a replicated pilot study that can be used to identify differentially expressed genes in subsequent large-scale studies on the same platform, but in which there may be no replicated arrays. The method builds a stratified error model that includes array-to-array variability, feature-to-feature variability and the dependence of error on signal intensity. We apply this model to the characterization of the host response in a model of bacterial infection of human intestinal epithelial cells. We demonstrate the effectiveness of error model based microarray experiments and propose this as a general strategy for a microarray-based screening of large collections of biological samples.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号