首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
High throughput technologies, such as gene expression arrays and protein mass spectrometry, allow one to simultaneously evaluate thousands of potential biomarkers that could distinguish different tissue types. Of particular interest here is distinguishing between cancerous and normal organ tissues. We consider statistical methods to rank genes (or proteins) in regards to differential expression between tissues. Various statistical measures are considered, and we argue that two measures related to the Receiver Operating Characteristic Curve are particularly suitable for this purpose. We also propose that sampling variability in the gene rankings be quantified, and suggest using the "selection probability function," the probability distribution of rankings for each gene. This is estimated via the bootstrap. A real dataset, derived from gene expression arrays of 23 normal and 30 ovarian cancer tissues, is analyzed. Simulation studies are also used to assess the relative performance of different statistical gene ranking measures and our quantification of sampling variability. Our approach leads naturally to a procedure for sample-size calculations, appropriate for exploratory studies that seek to identify differentially expressed genes.  相似文献   

2.
Identifying differentially expressed genes in cDNA microarray experiments.   总被引:1,自引:0,他引:1  
A major goal of microarray experiments is to determine which genes are differentially expressed between samples. Differential expression has been assessed by taking ratios of expression levels of different samples at a spot on the array and flagging spots (genes) where the magnitude of the fold difference exceeds some threshold. More recent work has attempted to incorporate the fact that the variability of these ratios is not constant. Most methods are variants of Student's t-test. These variants standardize the ratios by dividing by an estimate of the standard deviation of that ratio; spots with large standardized values are flagged. Estimating these standard deviations requires replication of the measurements, either within a slide or between slides, or the use of a model describing what the standard deviation should be. Starting from considerations of the kinetics driving microarray hybridization, we derive models for the intensity of a replicated spot, when replication is performed within and between arrays. Replication within slides leads to a beta-binomial model, and replication between slides leads to a gamma-Poisson model. These models predict how the variance of a log ratio changes with the total intensity of the signal at the spot, independent of the identity of the gene. Ratios for genes with a small amount of total signal are highly variable, whereas ratios for genes with a large amount of total signal are fairly stable. Log ratios are scaled by the standard deviations given by these functions, giving model-based versions of Studentization. An example is given.  相似文献   

3.
MOTIVATION: An important application of microarray experiments is to identify differentially expressed genes. Because microarray data are often not distributed according to a normal distribution nonparametric methods were suggested for their statistical analysis. Here, the Baumgartner-Weiss-Schindler test, a novel and powerful test based on ranks, is investigated and compared with the parametric t-test as well as with two other nonparametric tests (Wilcoxon rank sum test, Fisher-Pitman permutation test) recently recommended for the analysis of gene expression data. RESULTS: Simulation studies show that an exact permutation test based on the Baumgartner-Weiss-Schindler statistic B is preferable to the other three tests. It is less conservative than the Wilcoxon test and more powerful, in particular in case of asymmetric or heavily tailed distributions. When the underlying distribution is symmetric the differences in power between the tests are relatively small. Thus, the Baumgartner-Weiss-Schindler is recommended for the usual situation that the underlying distribution is a priori unknown. AVAILABILITY: SAS code available on request from the authors.  相似文献   

4.

Background  

Microarray experiments are often performed with a small number of biological replicates, resulting in low statistical power for detecting differentially expressed genes and concomitant high false positive rates. While increasing sample size can increase statistical power and decrease error rates, with too many samples, valuable resources are not used efficiently. The issue of how many replicates are required in a typical experimental system needs to be addressed. Of particular interest is the difference in required sample sizes for similar experiments in inbred vs. outbred populations (e.g. mouse and rat vs. human).  相似文献   

5.
MOTIVATION: A common objective of microarray experiments is the detection of differential gene expression between samples obtained under different conditions. The task of identifying differentially expressed genes consists of two aspects: ranking and selection. Numerous statistics have been proposed to rank genes in order of evidence for differential expression. However, no one statistic is universally optimal and there is seldom any basis or guidance that can direct toward a particular statistic of choice. RESULTS: Our new approach, which addresses both ranking and selection of differentially expressed genes, integrates differing statistics via a distance synthesis scheme. Using a set of (Affymetrix) spike-in datasets, in which differentially expressed genes are known, we demonstrate that our method compares favorably with the best individual statistics, while achieving robustness properties lacked by the individual statistics. We further evaluate performance on one other microarray study.  相似文献   

6.
7.
Microarray technology allows simultaneous comparison of expression levels of thousands of genes under each condition. This paper concerns sample size calculation in the identification of differentially expressed genes between a control and a treated sample. In a typical experiment, only a fraction of genes (altered genes) is expected to be differentially expressed between two samples. Sample size determination depends on a number of factors including the specified significance level (alpha), the desired statistical power (1-beta), the fraction (eta) of truly altered genes out of the total g genes studied, and the effect sizes (Delta) for the altered genes. This paper proposes a method to calculate the number of arrays required to detect at least 100lambda % (where 0 < lambda < or = 1) of the truly altered genes under the model of an equal effect size for all altered genes. The required numbers of arrays are tabulated for various values of alpha, beta, Delta, eta, and lambda for the one-sample and two-sample t-tests for g = 10,000. Based on the proposed approach, to identify up to 90% of truly altered genes among the unknown number of truly altered genes, the estimated numbers of arrays needed appear to be manageable. For instance, when the standardized effect size is at least 2.0, the number of arrays needed is less than or equal to 14 for the two-sample t-test and is less than or equal to 10 for the one-sample t-test. As the cost per array declines, such array numbers become practical. The proposed method offers a simple, intuitive, and practical way to determine the number of arrays needed in microarray experiments in which the true correlation structure among the genes under investigation cannot be reasonably assumed. An example dataset is used to illustrate the use of the proposed approach to plan microarray experiments.  相似文献   

8.
MOTIVATION: Microarray technology allows the monitoring of expression levels for thousands of genes simultaneously. In time-course experiments in which gene expression is monitored over time, we are interested in testing gene expression profiles for different experimental groups. However, no sophisticated analytic methods have yet been proposed to handle time-course experiment data. RESULTS: We propose a statistical test procedure based on the ANOVA model to identify genes that have different gene expression profiles among experimental groups in time-course experiments. Especially, we propose a permutation test which does not require the normality assumption. For this test, we use residuals from the ANOVA model only with time-effects. Using this test, we detect genes that have different gene expression profiles among experimental groups. The proposed model is illustrated using cDNA microarrays of 3840 genes obtained in an experiment to search for changes in gene expression profiles during neuronal differentiation of cortical stem cells.  相似文献   

9.

Background  

Thousands of genes in a genomewide data set are tested against some null hypothesis, for detecting differentially expressed genes in microarray experiments. The expected proportion of false positive genes in a set of genes, called the False Discovery Rate (FDR), has been proposed to measure the statistical significance of this set. Various procedures exist for controlling the FDR. However the threshold (generally 5%) is arbitrary and a specific measure associated with each gene would be worthwhile.  相似文献   

10.

Background  

This paper presents a unified framework for finding differentially expressed genes (DEGs) from the microarray data. The proposed framework has three interrelated modules: (i) gene ranking, ii) significance analysis of genes and (iii) validation. The first module uses two gene selection algorithms, namely, a) two-way clustering and b) combined adaptive ranking to rank the genes. The second module converts the gene ranks into p-values using an R-test and fuses the two sets of p-values using the Fisher's omnibus criterion. The DEGs are selected using the FDR analysis. The third module performs three fold validations of the obtained DEGs. The robustness of the proposed unified framework in gene selection is first illustrated using false discovery rate analysis. In addition, the clustering-based validation of the DEGs is performed by employing an adaptive subspace-based clustering algorithm on the training and the test datasets. Finally, a projection-based visualization is performed to validate the DEGs obtained using the unified framework.  相似文献   

11.
MOTIVATION: An important goal in analyzing microarray data is to determine which genes are differentially expressed across two kinds of tissue samples or samples obtained under two experimental conditions. Various parametric tests, such as the two-sample t-test, have been used, but their possibly too strong parametric assumptions or large sample justifications may not hold in practice. As alternatives, a class of three nonparametric statistical methods, including the empirical Bayes method of Efron et al. (2001), the significance analysis of microarray (SAM) method of Tusher et al. (2001) and the mixture model method (MMM) of Pan et al. (2001), have been proposed. All the three methods depend on constructing a test statistic and a so-called null statistic such that the null statistic's distribution can be used to approximate the null distribution of the test statistic. However, relatively little effort has been directed toward assessment of the performance or the underlying assumptions of the methods in constructing such test and null statistics. RESULTS: We point out a problem of a current method to construct the test and null statistics, which may lead to largely inflated Type I errors (i.e. false positives). We also propose two modifications that overcome the problem. In the context of MMM, the improved performance of the modified methods is demonstrated using simulated data. In addition, our numerical results also provide evidence to support the utility and effectiveness of MMM.  相似文献   

12.
We introduce a non-parametric approach using bootstrap-assisted correspondence analysis to identify and validate genes that are differentially expressed in factorial microarray experiments. Model comparison showed that although both parametric and non-parametric methods capture the different profiles in the data, our method is less inclined to false positive results due to dimension reduction in data analysis.  相似文献   

13.
Motivation: The proliferation of public data repositories createsa need for meta-analysis methods to efficiently evaluate, integrateand validate related datasets produced by independent groups.A t-based approach has been proposed to integrate effect sizefrom multiple studies by modeling both intra- and between-studyvariation. Recently, a non-parametric ‘rank product’method, which is derived based on biological reasoning of fold-changecriteria, has been applied to directly combine multiple datasetsinto one meta study. Fisher's Inverse 2 method, which only dependson P-values from individual analyses of each dataset, has beenused in a couple of medical studies. While these methods addressthe question from different angles, it is not clear how theycompare with each other. Results: We comparatively evaluate the three methods; t-basedhierarchical modeling, rank products and Fisher's Inverse 2test with P-values from either the t-based or the rank productmethod. A simulation study shows that the rank product method,in general, has higher sensitivity and selectivity than thet-based method in both individual and meta-analysis, especiallyin the setting of small sample size and/or large between-studyvariation. Not surprisingly, Fisher's 2 method highly dependson the method used in the individual analysis. Application toreal datasets demonstrates that meta-analysis achieves morereliable identification than an individual analysis, and rankproducts are more robust in gene ranking, which leads to a muchhigher reproducibility among independent studies. Though t-basedmeta-analysis greatly improves over the individual analysis,it suffers from a potentially large amount of false positiveswhen P-values serve as threshold. We conclude that careful meta-analysisis a powerful tool for integrating multiple array studies. Contact: fxhong{at}jimmy.harvard.edu Supplementary information: Supplementary data are availableat Bioinformatics online. Associate Editor: David Rocke Present address: Department of Biostatistics and ComputationalBiology, Dana-Farber Cancer Institute, Harvard School of PublicHealth, 44 Binney Street, Boston, MA 02115, USA.  相似文献   

14.

Background  

The small sample sizes often used for microarray experiments result in poor estimates of variance if each gene is considered independently. Yet accurately estimating variability of gene expression measurements in microarray experiments is essential for correctly identifying differentially expressed genes. Several recently developed methods for testing differential expression of genes utilize hierarchical Bayesian models to "pool" information from multiple genes. We have developed a statistical testing procedure that further improves upon current methods by incorporating the well-documented relationship between the absolute gene expression level and the variance of gene expression measurements into the general empirical Bayes framework.  相似文献   

15.

Background  

Differentially expressed genes are typically identified by analyzing the variation between replicate measurements. These procedures implicitly assume that there are no systematic errors in the data even though several sources of systematic error are known.  相似文献   

16.
Matsui S  Noma H 《Biometrics》2011,67(4):1225-1235
Summary In microarray screening for differentially expressed genes using multiple testing, assessment of power or sample size is of particular importance to ensure that few relevant genes are removed from further consideration prematurely. In this assessment, adequate estimation of the effect sizes of differentially expressed genes is crucial because of its substantial impact on power and sample‐size estimates. However, conventional methods using top genes with largest observed effect sizes would be subject to overestimation due to random variation. In this article, we propose a simple estimation method based on hierarchical mixture models with a nonparametric prior distribution to accommodate random variation and possible large diversity of effect sizes across differential genes, separated from nuisance, nondifferential genes. Based on empirical Bayes estimates of effect sizes, the power and false discovery rate (FDR) can be estimated to monitor them simultaneously in gene screening. We also propose a power index that concerns selection of top genes with largest effect sizes, called partial power. This new power index could provide a practical compromise for the difficulty in achieving high levels of usual overall power as confronted in many microarray experiments. Applications to two real datasets from cancer clinical studies are provided.  相似文献   

17.
MOTIVATION: A common task in analyzing microarray data is to determine which genes are differentially expressed across two kinds of tissue samples or samples obtained under two experimental conditions. Recently several statistical methods have been proposed to accomplish this goal when there are replicated samples under each condition. However, it may not be clear how these methods compare with each other. Our main goal here is to compare three methods, the t-test, a regression modeling approach (Thomas et al., Genome Res., 11, 1227-1236, 2001) and a mixture model approach (Pan et al., http://www.biostat.umn.edu/cgi-bin/rrs?print+2001,2001a,b) with particular attention to their different modeling assumptions. RESULTS: It is pointed out that all the three methods are based on using the two-sample t-statistic or its minor variation, but they differ in how to associate a statistical significance level to the corresponding statistic, leading to possibly large difference in the resulting significance levels and the numbers of genes detected. In particular, we give an explicit formula for the test statistic used in the regression approach. Using the leukemia data of Golub et al. (Science, 285, 531-537, 1999), we illustrate these points. We also briefly compare the results with those of several other methods, including the empirical Bayesian method of Efron et al. (J. Am. Stat. Assoc., to appear, 2001) and the Significance Analysis of Microarray (SAM) method of Tusher et al. (PROC: Natl Acad. Sci. USA, 98, 5116-5121, 2001).  相似文献   

18.
19.
MOTIVATION: Currently most of the methods for identifying differentially expressed genes fall into the category of so called single-gene-analysis, performing hypothesis testing on a gene-by-gene basis. In a single-gene-analysis approach, estimating the variability of each gene is required to determine whether a gene is differentially expressed or not. Poor accuracy of variability estimation makes it difficult to identify genes with small fold-changes unless a very large number of replicate experiments are performed. RESULTS: We propose a method that can avoid the difficult task of estimating variability for each gene, while reliably identifying a group of differentially expressed genes with low false discovery rates, even when the fold-changes are very small. In this article, a new characterization of differentially expressed genes is established based on a theorem about the distribution of ranks of genes sorted by (log) ratios within each array. This characterization of differentially expressed genes based on rank is an example of all-gene-analysis instead of single gene analysis. We apply the method to a cDNA microarray dataset and many low fold-changed genes (as low as 1.3 fold-changes) are reliably identified without carrying out hypothesis testing on a gene-by-gene basis. The false discovery rate is estimated in two different ways reflecting the variability from all the genes without the complications related to multiple hypothesis testing. We also provide some comparisons between our approach and single-gene-analysis based methods. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.  相似文献   

20.
Testing for differentially expressed genes with microarray data   总被引:1,自引:1,他引:0       下载免费PDF全文
This paper compares the type I error and power of the one- and two-sample t-tests, and the one- and two-sample permutation tests for detecting differences in gene expression between two microarray samples with replicates using Monte Carlo simulations. When data are generated from a normal distribution, type I errors and powers of the one-sample parametric t-test and one-sample permutation test are very close, as are the two-sample t-test and two-sample permutation test, provided that the number of replicates is adequate. When data are generated from a t-distribution, the permutation tests outperform the corresponding parametric tests if the number of replicates is at least five. For data from a two-color dye swap experiment, the one-sample test appears to perform better than the two-sample test since expression measurements for control and treatment samples from the same spot are correlated. For data from independent samples, such as the one-channel array or two-channel array experiment using reference design, the two-sample t-tests appear more powerful than the one-sample t-tests.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号