首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.

Background

With the growing abundance of microarray data, statistical methods are increasingly needed to integrate results across studies. Two common approaches for meta-analysis of microarrays include either combining gene expression measures across studies or combining summaries such as p-values, probabilities or ranks. Here, we compare two Bayesian meta-analysis models that are analogous to these methods.

Results

Two Bayesian meta-analysis models for microarray data have recently been introduced. The first model combines standardized gene expression measures across studies into an overall mean, accounting for inter-study variability, while the second combines probabilities of differential expression without combining expression values. Both models produce the gene-specific posterior probability of differential expression, which is the basis for inference. Since the standardized expression integration model includes inter-study variability, it may improve accuracy of results versus the probability integration model. However, due to the small number of studies typical in microarray meta-analyses, the variability between studies is challenging to estimate. The probability integration model eliminates the need to model variability between studies, and thus its implementation is more straightforward. We found in simulations of two and five studies that combining probabilities outperformed combining standardized gene expression measures for three comparison values: the percent of true discovered genes in meta-analysis versus individual studies; the percent of true genes omitted in meta-analysis versus separate studies, and the number of true discovered genes for fixed levels of Bayesian false discovery. We identified similar results when pooling two independent studies of Bacillus subtilis. We assumed that each study was produced from the same microarray platform with only two conditions: a treatment and control, and that the data sets were pre-scaled.

Conclusion

The Bayesian meta-analysis model that combines probabilities across studies does not aggregate gene expression measures, thus an inter-study variability parameter is not included in the model. This results in a simpler modeling approach than aggregating expression measures, which accounts for variability across studies. The probability integration model identified more true discovered genes and fewer true omitted genes than combining expression measures, for our data sets.  相似文献   

2.
MOTIVATION: In microarray studies gene discovery based on fold-change values is often misleading because error variability for each gene is heterogeneous under different biological conditions and intensity ranges. Several statistical testing methods for differential gene expression have been suggested, but some of these approaches are underpowered and result in high false positive rates because within-gene variance estimates are based on a small number of replicated arrays. RESULTS: We propose to use local-pooled-error (LPE) estimates and robust statistical tests for evaluating significance of each gene's differential expression. Our LPE estimation is based on pooling errors within genes and between replicate arrays for genes in which expression values are similar. We have applied our LPE method to compare gene expression in na?ve and activated CD8+ T-cells. Our results show that the LPE method effectively identifies significant differential-expression patterns with a small number of replicated arrays. AVAILABILITY: The methodology is implemented with S-PLUS and R functions available at http://hesweb1.med.virginia.edu/bioinformatics  相似文献   

3.
We demonstrate here that SMART PCR-amplified cDNAs arrayed on a nylon membrane are suitable for high-throughput tissue expression profiling when starting biological materials are limited. We show that SMART cDNA accurately reflects gene expression patterns found in total RNA by comparing the expression level of several target genes in SMART PCR-amplified cDNAs and their corresponding total RNAs. We also arrayed cDNAs from 68 matched tumor and normal samples on a nylon membrane to determine whether SMART PCR-amplified cDNA could be used for detecting differentially expressed genes in these tissues. These arrays containing normalized tumor and normal cDNAs were hybridized with probes for glutathione peroxidase and gelsolin. The hybridization results revealed cancer-related and patient-specific gene expression differences between tumor and normal tissues for these genes. These studies show that SMART PCR-amplified cDNAs maintain the complexity of the original mRNA population and are thus suitable for high-throughput studies to compare the relative abundance of target genes and to detect differentially expressed genes in a wide variety of tissues simultaneously.  相似文献   

4.
MOTIVATION: We present statistical methods for determining the number of per gene replicate spots required in microarray experiments. The purpose of these methods is to obtain an estimate of the sampling variability present in microarray data, and to determine the number of replicate spots required to achieve a high probability of detecting a significant fold change in gene expression, while maintaining a low error rate. Our approach is based on data from control microarrays, and involves the use of standard statistical estimation techniques. RESULTS: After analyzing two experimental data sets containing control array data, we were able to determine the statistical power available for the detection of significant differential expression given differing levels of replication. The inclusion of replicate spots on microarrays not only allows more accurate estimation of the variability present in an experiment, but more importantly increases the probability of detecting genes undergoing significant fold changes in expression, while substantially decreasing the probability of observing fold changes due to chance rather than true differential expression.  相似文献   

5.
Analysis of multivariate data sets from, for example, microarray studies frequently results in lists of genes which are associated with some response of interest. The biological interpretation is often complicated by the statistical instability of the obtained gene lists, which may partly be due to the functional redundancy among genes, implying that multiple genes can play exchangeable roles in the cell. In this paper, we use the concept of exchangeability of random variables to model this functional redundancy and thereby account for the instability. We present a flexible framework to incorporate the exchangeability into the representation of lists. The proposed framework supports straightforward comparison between any 2 lists. It can also be used to generate new more stable gene rankings incorporating more information from the experimental data. Using 2 microarray data sets, we show that the proposed method provides more robust gene rankings than existing methods with respect to sampling variations, without compromising the biological significance of the rankings.  相似文献   

6.

Background

DNA microarrays are a powerful technology that can provide a wealth of gene expression data for disease studies, drug development, and a wide scope of other investigations. Because of the large volume and inherent variability of DNA microarray data, many new statistical methods have been developed for evaluating the significance of the observed differences in gene expression. However, until now little attention has been given to the characterization of dispersion of DNA microarray data.

Results

Here we examine the expression data obtained from 682 Affymetrix GeneChips® with 22 different types and we demonstrate that the Gaussian (normal) frequency distribution is characteristic for the variability of gene expression values. However, typically 5 to 15% of the samples deviate from normality. Furthermore, it is shown that the frequency distributions of the difference of expression in subsets of ordered, consecutive pairs of genes (consecutive samples) in pair-wise comparisons of replicate experiments are also normal. We describe a consecutive sampling method, which is employed to calculate the characteristic function approximating standard deviation and show that the standard deviation derived from the consecutive samples is equivalent to the standard deviation obtained from individual genes. Finally, we determine the boundaries of probability intervals and demonstrate that the coefficients defining the intervals are independent of sample characteristics, variability of data, laboratory conditions and type of chips. These coefficients are very closely correlated with Student's t-distribution.

Conclusion

In this study we ascertained that the non-systematic variations possess Gaussian distribution, determined the probability intervals and demonstrated that the K α coefficients defining these intervals are invariant; these coefficients offer a convenient universal measure of dispersion of data. The fact that the K α distributions are so close to t-distribution and independent of conditions and type of arrays suggests that the quantitative data provided by Affymetrix technology give "true" representation of physical processes, involved in measurement of RNA abundance.

Reviewers

This article was reviewed by Yoav Gilad (nominated by Doron Lancet), Sach Mukherjee (nominated by Sandrine Dudoit) and Amir Niknejad and Shmuel Friedland (nominated by Neil Smalheiser).  相似文献   

7.
Novak JP  Sladek R  Hudson TJ 《Genomics》2002,79(1):104-113
Large-scale gene expression measurement techniques provide a unique opportunity to gain insight into biological processes under normal and pathological conditions. To interpret the changes in expression profiles for thousands of genes, we face the nontrivial problem of understanding the significance of these changes. In practice, the sources of background variability in expression data can be divided into three categories: technical, physiological, and sampling. To assess the relative importance of these sources of background variation, we generated replicate gene expression profiles on high-density Affymetrix GeneChip oligonucleotide arrays, using either identical RNA samples or RNA samples obtained under similar biological states. We derived a novel measure of dispersion in two-way comparisons, using a linear characteristic function. When comparing expression profiles from replicate tests using the same RNA sample (a test for technical variability), we observed a level of dispersion similar to the pattern obtained with RNA samples from replicate cultures of the same cell line (a test for physiological variability). On the other hand, a higher level of dispersion was observed when tissue samples of different animals were compared (an example of sampling variability). This implies that, in experiments in which samples from different subjects are used, the variation induced by the stimulus may be masked by non-stimuli-related differences in the subjects' biological state. These analyses underscore the need for replica experiments to reliably interpret large-scale expression data sets, even with simple microarray experiments.  相似文献   

8.
Many gene expression studies attempt to develop a predictor of pre-defined diagnostic or prognostic classes. If the classes are similar biologically, then the number of genes that are differentially expressed between the classes is likely to be small compared to the total number of genes measured. This motivates a two-step process for predictor development, a subset of differentially expressed genes is selected for use in the predictor and then the predictor constructed from these. Both these steps will introduce variability into the resulting classifier, so both must be incorporated in sample size estimation. We introduce a methodology for sample size determination for prediction in the context of high-dimensional data that captures variability in both steps of predictor development. The methodology is based on a parametric probability model, but permits sample size computations to be carried out in a practical manner without extensive requirements for preliminary data. We find that many prediction problems do not require a large training set of arrays for classifier development.  相似文献   

9.
Real-time PCR has become increasingly important in gene expression profiling research, and it is widely agreed that normalized data are required for accurate estimates of messenger RNA (mRNA) expression. With increased gene expression profiling in preclinical research and toxicogenomics, a need for reference genes in the rat has emerged, and the studies in this area have not yet been thoroughly evaluated. The purpose of our study was to evaluate a panel of rat reference genes for variation of gene expression in different tissue types. We selected 48 known target genes based on their putative invariability. The gene expression of all targets was examined in 11 types of rat tissues using TaqMan low density array (LDA) technology. The variability of each gene was assessed using a two-step statistical model. The analysis of mean expression using multiple reference genes was shown to provide accurate and reliable normalized expression data. The least five variable genes from each specific tissue were recommended for future tissue-specific studies. Finally, a subset of investigated rat reference genes showing the least variation is recommended for further evaluation using the LDA platform. Our work should considerably enhance a researcher's ability to simply and efficiently identify appropriate reference genes for given experiments.  相似文献   

10.
Summaries of Affymetrix GeneChip probe level data   总被引:9,自引:0,他引:9  
High density oligonucleotide array technology is widely used in many areas of biomedical research for quantitative and highly parallel measurements of gene expression. Affymetrix GeneChip arrays are the most popular. In this technology each gene is typically represented by a set of 11–20 pairs of probes. In order to obtain expression measures it is necessary to summarize the probe level data. Using two extensive spike-in studies and a dilution study, we developed a set of tools for assessing the effectiveness of expression measures. We found that the performance of the current version of the default expression measure provided by Affymetrix Microarray Suite can be significantly improved by the use of probe level summaries derived from empirically motivated statistical models. In particular, improvements in the ability to detect differentially expressed genes are demonstrated.  相似文献   

11.
The expression patterns of 62 genes interacting with p53 have been investigated in 24 normal and cancerous tissues using NIH's dbEST library. The expression levels of individual genes, such as the TTP53 gene itself, but also other genes, vary up to 33-fold among the 24 different tissues and no consistent pattern can be recognized. However, when expression levels for all 63 genes are summed, these "cumulated levels" are surprisingly constant over the 24 investigated normal tissues. In cancers, the variation is further reduced. Essentially, the cumulated expression levels in cancer are independent of those in normal tissue. We furthermore constructed a linear statistical classifier, i.e., a weighted sum of gene expression levels, which robustly distinguishes normal from cancer tissue independent of the particular kind of tissue. Thus, despite very large differences for individual genes and considerable changes during carcinogenesis, the cumulated expressions have narrowly defined levels.  相似文献   

12.
Recent developments in microarray technology make it possible to capture the gene expression profiles for thousands of genes at once. With this data researchers are tackling problems ranging from the identification of 'cancer genes' to the formidable task of adding functional annotations to our rapidly growing gene databases. Specific research questions suggest patterns of gene expression that are interesting and informative: for instance, genes with large variance or groups of genes that are highly correlated. Cluster analysis and related techniques are proving to be very useful. However, such exploratory methods alone do not provide the opportunity to engage in statistical inference. Given the high dimensionality (thousands of genes) and small sample sizes (often <30) encountered in these datasets, an honest assessment of sampling variability is crucial and can prevent the over-interpretation of spurious results. We describe a statistical framework that encompasses many of the analytical goals in gene expression analysis; our framework is completely compatible with many of the current approaches and, in fact, can increase their utility. We propose the use of a deterministic rule, applied to the parameters of the gene expression distribution, to select a target subset of genes that are of biological interest. In addition to subset membership, the target subset can include information about relationships between genes, such as clustering. This target subset presents an interesting parameter that we can estimate by applying the rule to the sample statistics of microarray data. The parametric bootstrap, based on a multivariate normal model, is used to estimate the distribution of these estimated subsets and relevant summary measures of this sampling distribution are proposed. We focus on rules that operate on the mean and covariance. Using Bernstein's Inequality, we obtain consistency of the subset estimates, under the assumption that the sample size converges faster to infinity than the logarithm of the number of genes. We also provide a conservative sample size formula guaranteeing that the sample mean and sample covariance matrix are uniformly within a distance epsilon > 0 of the population mean and covariance. The practical performance of the method using a cluster-based subset rule is illustrated with a simulation study. The method is illustrated with an analysis of a publicly available leukemia data set.  相似文献   

13.
14.
15.
Pan W  Lin J  Le CT 《Genome biology》2002,3(5):research0022.1-research002210

Background  

It has been recognized that replicates of arrays (or spots) may be necessary for reliably detecting differentially expressed genes in microarray experiments. However, the often-asked question of how many replicates are required has barely been addressed in the literature. In general, the answer depends on several factors: a given magnitude of expression change, a desired statistical power (that is, probability) to detect it, a specified Type I error rate, and the statistical method being used to detect the change. Here, we discuss how to calculate the number of replicates in the context of applying a nonparametric statistical method, the normal mixture model approach, to detect changes in gene expression.  相似文献   

16.
Statistical design of reverse dye microarrays   总被引:7,自引:0,他引:7  
MOTIVATION: In cDNA microarray experiments all samples are labelled with either Cy3 dye or Cy5 dye. Certain genes exhibit dye bias-a tendency to bind more efficiently to one of the dyes. The common reference design avoids the problem of dye bias by running all arrays 'forward', so that the samples being compared are always labelled with the same dye. But comparison of samples labelled with different dyes is sometimes of interest. In these situations, it is necessary to run some arrays 'reverse'-with the dye labelling reversed-in order to correct for the dye bias. The design of these experiments will impact one's ability to identify genes that are differentially expressed in different tissues or conditions. We address the design issue of how many specimens are needed, how many forward and reverse labelled arrays to perform, and how to optimally assign Cy3 and Cy5 labels to the specimens. RESULTS: We consider three types of experiments for which some reverse labelling is needed: paired samples, samples from two predefined groups, and reference design data when comparison with the reference is of interest. We present simple probability models for the data, derive optimal estimators for relative gene expression, and compare the efficiency of the estimators for a range of designs. In each case, we present the optimal design and sample size formulas. We show that reverse labelling of individual arrays is generally not required.  相似文献   

17.
Microarray experiments are being increasingly used in molecular biology. A common task is to detect genes with differential expression across two experimental conditions, such as two different tissues or the same tissue at two time points of biological development. To take proper account of statistical variability, some statistical approaches based on the t-statistic have been proposed. In constructing the t-statistic, one needs to estimate the variance of gene expression levels. With a small number of replicated array experiments, the variance estimation can be challenging. For instance, although the sample variance is unbiased, it may have large variability, leading to a large mean squared error. For duplicated array experiments, a new approach based on simple averaging has recently been proposed in the literature. Here we consider two more general approaches based on nonparametric smoothing. Our goal is to assess the performance of each method empirically. The three methods are applied to a colon cancer data set containing 2,000 genes. Using two arrays, we compare the variance estimates obtained from the three methods. We also consider their impact on the t-statistics. Our results indicate that the three methods give variance estimates close to each other. Due to its simplicity and generality, we recommend the use of the smoothed sample variance for data with a small number of replicates. Electronic Publication  相似文献   

18.
Determining which reference genes have the highest stability, and are therefore appropriate for normalising data, is a crucial step in the design of real-time quantitative PCR (qPCR) gene expression studies. This is particularly warranted in non-model and ecologically important species for which appropriate reference genes are lacking, such as the mallard—a key reservoir of many diseases with relevance for human and livestock health. Previous studies assessing gene expression changes as a consequence of infection in mallards have nearly universally used β-actin and/or GAPDH as reference genes without confirming their suitability as normalisers. The use of reference genes at random, without regard for stability of expression across treatment groups, can result in erroneous interpretation of data. Here, eleven putative reference genes for use in gene expression studies of the mallard were evaluated, across six different tissues, using a low pathogenic avian influenza A virus infection model. Tissue type influenced the selection of reference genes, whereby different genes were stable in blood, spleen, lung, gastrointestinal tract and colon. β-actin and GAPDH generally displayed low stability and are therefore inappropriate reference genes in many cases. The use of different algorithms (GeNorm and NormFinder) affected stability rankings, but for both algorithms it was possible to find a combination of two stable reference genes with which to normalise qPCR data in mallards. These results highlight the importance of validating the choice of normalising reference genes before conducting gene expression studies in ducks. The fact that nearly all previous studies of the influence of pathogen infection on mallard gene expression have used a single, non-validated reference gene is problematic. The toolkit of putative reference genes provided here offers a solid foundation for future studies of gene expression in mallards and other waterfowl.  相似文献   

19.
20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号