首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 417 毫秒
1.
    
Summary .   We develop formulae to calculate sample sizes for ranking and selection of differentially expressed genes among different clinical subtypes or prognostic classes of disease in genome-wide screening studies with microarrays. The formulae aim to control the probability that a selected subset of genes with fixed size contains enough truly top-ranking informative genes, which can be assessed on the basis of the distribution of ordered statistics from independent genes. We provide strategies for conservative designs to cope with issues of unknown number of informative genes and unknown correlation structure across genes. Application of the formulae to a clinical study for multiple myeloma is given.  相似文献   

2.
    
Heo M  Leon AC 《Biometrics》2008,64(4):1256-1262
SUMMARY: Cluster randomized clinical trials (cluster-RCT), where the community entities serve as clusters, often yield data with three hierarchy levels. For example, interventions are randomly assigned to the clusters (level three unit). Health care professionals (level two unit) within the same cluster are trained with the randomly assigned intervention to provide care to subjects (level one unit). In this study, we derived a closed form power function and formulae for sample size determination required to detect an intervention effect on outcomes at the subject's level. In doing so, we used a test statistic based on maximum likelihood estimates from a mixed-effects linear regression model for three level data. A simulation study follows and verifies that theoretical power estimates based on the derived formulae are nearly identical to empirical estimates based on simulated data. Recommendations at the design stage of a cluster-RCT are discussed.  相似文献   

3.
Empirical Bayes Gibbs sampling   总被引:3,自引:0,他引:3  
The wide applicability of Gibbs sampling has increased the use of more complex and multi-level hierarchical models. To use these models entails dealing with hyperparameters in the deeper levels of a hierarchy. There are three typical methods for dealing with these hyperparameters: specify them, estimate them, or use a 'flat' prior. Each of these strategies has its own associated problems. In this paper, using an empirical Bayes approach, we show how the hyperparameters can be estimated in a way that is both computationally feasible and statistically valid.  相似文献   

4.
Non-biological experimental variation or "batch effects" are commonly observed across multiple batches of microarray experiments, often rendering the task of combining data from these batches difficult. The ability to combine microarray data sets is advantageous to researchers to increase statistical power to detect biological phenomena from studies where logistical considerations restrict sample size or in studies that require the sequential hybridization of arrays. In general, it is inappropriate to combine data sets without adjusting for batch effects. Methods have been proposed to filter batch effects from data, but these are often complicated and require large batch sizes ( > 25) to implement. Because the majority of microarray studies are conducted using much smaller sample sizes, existing methods are not sufficient. We propose parametric and non-parametric empirical Bayes frameworks for adjusting data for batch effects that is robust to outliers in small sample sizes and performs comparable to existing methods for large samples. We illustrate our methods using two example data sets and show that our methods are justifiable, easy to apply, and useful in practice. Software for our method is freely available at: http://biosun1.harvard.edu/complab/batch/.  相似文献   

5.
Microarrays provide a valuable tool for the quantification of gene expression. Usually, however, there is a limited number of replicates leading to unsatisfying variance estimates in a gene‐wise mixed model analysis. As thousands of genes are available, it is desirable to combine information across genes. When more than two tissue types or treatments are to be compared it might be advisable to consider the array effect as random. Then information between arrays may be recovered, which can increase accuracy in estimation. We propose a method of variance component estimation across genes for a linear mixed model with two random effects. The method may be extended to models with more than two random effects. We assume that the variance components follow a log‐normal distribution. Assuming that the sums of squares from the gene‐wise analysis, given the true variance components, follow a scaled χ2‐distribution, we adopt an empirical Bayes approach. The variance components are estimated by the expectation of their posterior distribution. The new method is evaluated in a simulation study. Differentially expressed genes are more likely to be detected by tests based on these variance estimates than by tests based on gene‐wise variance estimates. This effect is most visible in studies with small array numbers. Analyzing a real data set on maize endosperm the method is shown to work well. (© 2008 WILEY‐VCH Verlag GmbH & Co. KGaA, Weinheim)  相似文献   

6.
    
In some occupational health studies, observations occur in both exposed and unexposed individuals. If the levels of all exposed individuals have been detected, a two-part zero-inflated log-normal model is usually recommended, which assumes that the data has a probability mass at zero for unexposed individuals and a continuous response for values greater than zero for exposed individuals. However, many quantitative exposure measurements are subject to left censoring due to values falling below assay detection limits. A zero-inflated log-normal mixture model is suggested in this situation since unexposed zeros are not distinguishable from those exposed with values below detection limits. In the context of this mixture distribution, the information contributed by values falling below a fixed detection limit is used only to estimate the probability of unexposed. We consider sample size and statistical power calculation when comparing the median of exposed measurements to a regulatory limit. We calculate the required sample size for the data presented in a recent paper comparing the benzene TWA exposure data to a regulatory occupational exposure limit. A simulation study is conducted to investigate the performance of the proposed sample size calculation methods.  相似文献   

7.
8.
Hierarchical Bayes models for cDNA microarray gene expression   总被引:2,自引:0,他引:2  
cDNA microarrays are used in many contexts to compare mRNA levels between samples of cells. Microarray experiments typically give us expression measurements on 1000-20 000 genes, but with few replicates for each gene. Traditional methods using means and standard deviations to detect differential expression are not satisfactory in this context. A handful of alternative statistics have been developed, including several empirical Bayes methods. In the present paper we present two full hierarchical Bayes models for detecting gene expression, of which one (D) describes our microarray data very well. We also compare the full Bayes and empirical Bayes approaches with respect to model assumptions, false discovery rates and computer running time. The proposed models are compared to existing empirical Bayes models in a simulation study and for a set of data (Yuen et al., 2002), where 27 genes have been categorized by quantitative real-time PCR. It turns out that the existing empirical Bayes methods have at least as good performance as the full Bayes ones.  相似文献   

9.
This paper is concerned with the estimation of the number of species in a population through a fully hierarchical Bayesian model using the Metropolis algorithm. The proposed Bayesian estimator is based on Poisson random variables with means that are distributed according to some prior distributions with unknown hyperparameters. An empirical Bayes approach is considered and compared with the fully Bayesian approach based on biological data.  相似文献   

10.
Mixture modeling provides an effective approach to the differential expression problem in microarray data analysis. Methods based on fully parametric mixture models are available, but lack of fit in some examples indicates that more flexible models may be beneficial. Existing, more flexible, mixture models work at the level of one-dimensional gene-specific summary statistics, and so when there are relatively few measurements per gene these methods may not provide sensitive detectors of differential expression. We propose a hierarchical mixture model to provide methodology that is both sensitive in detecting differential expression and sufficiently flexible to account for the complex variability of normalized microarray data. EM-based algorithms are used to fit both parametric and semiparametric versions of the model. We restrict attention to the two-sample comparison problem; an experiment involving Affymetrix microarrays and yeast translation provides the motivating case study. Gene-specific posterior probabilities of differential expression form the basis of statistical inference; they define short gene lists and false discovery rates. Compared to several competing methodologies, the proposed methodology exhibits good operating characteristics in a simulation study, on the analysis of spike-in data, and in a cross-validation calculation.  相似文献   

11.
12.
    
Summary .   An approach for determining the power of a case–cohort study for a single binary exposure variable and a low failure rate was recently proposed by Cai and Zeng (2004, Biometrics 60, 1015–1024). In this article, we show that computing power for a case–cohort study using a standard case–control method yields nearly identical levels of power. An advantage of the case–control approach is that existing sample size software can be used for the calculations. We also propose an additional formula for computing the power of a case–cohort study for the situation when the event is not rare.  相似文献   

13.
  总被引:1,自引:0,他引:1  
Estimation of covariance matrices in small samples has been studied by many authors. Standard estimators, like the unstructured maximum likelihood estimator (ML) or restricted maximum likelihood (REML) estimator, can be very unstable with the smallest estimated eigenvalues being too small and the largest too big. A standard approach to more stably estimating the matrix in small samples is to compute the ML or REML estimator under some simple structure that involves estimation of fewer parameters, such as compound symmetry or independence. However, these estimators will not be consistent unless the hypothesized structure is correct. If interest focuses on estimation of regression coefficients with correlated (or longitudinal) data, a sandwich estimator of the covariance matrix may be used to provide standard errors for the estimated coefficients that are robust in the sense that they remain consistent under misspecification of the covariance structure. With large matrices, however, the inefficiency of the sandwich estimator becomes worrisome. We consider here two general shrinkage approaches to estimating the covariance matrix and regression coefficients. The first involves shrinking the eigenvalues of the unstructured ML or REML estimator. The second involves shrinking an unstructured estimator toward a structured estimator. For both cases, the data determine the amount of shrinkage. These estimators are consistent and give consistent and asymptotically efficient estimates for regression coefficients. Simulations show the improved operating characteristics of the shrinkage estimators of the covariance matrix and the regression coefficients in finite samples. The final estimator chosen includes a combination of both shrinkage approaches, i.e., shrinking the eigenvalues and then shrinking toward structure. We illustrate our approach on a sleep EEG study that requires estimation of a 24 x 24 covariance matrix and for which inferences on mean parameters critically depend on the covariance estimator chosen. We recommend making inference using a particular shrinkage estimator that provides a reasonable compromise between structured and unstructured estimators.  相似文献   

14.
When a trial involves an invasive laboratory test procedure or requires patients to make a commitment to follow a restrictive test schedule, we can often lose a great proportion of our sampled patients due to refusal of participation into our study. Therefore, incorporating the possible loss of patients into sample size calculation is certainly important in the planning stage of a study. In this paper, we have generalized the sample size calculation procedure for intraclass correlation by accounting for the random loss of patients in the beginning of a trial. We have demonstrated that the simple ad hoc procedure, that raises the estimated sample size in the absence of loss of patients by the factor 1/po, where po is the retention probability for a randomly selected patient, is adequate when po is large (=0.80). When po is small (i.e., a high refusal rate), however, use of this simple ad hoc procedure tends to underestimate the required sample size. Furthermore, we have found that if the individual retention probability varied substantially among patients, then the magnitude of the above underestimation could even be critical and therefore, the application of the simple direct adjustment procedure in this situation should be avoided.  相似文献   

15.
Numerous initiatives are underway throughout New England and elsewhere to quantify salt marsh vegetation change, mostly in response to habitat restoration, sea level rise, and nutrient enrichment. To detect temporal changes in vegetation at a marsh or to compare vegetation among different marshes with a degree of statistical certainty an adequate sample size is required. Based on sampling 1 m2 vegetation plots from 11 New England salt marsh data sets, we conducted a power analysis to determine the minimum number of samples that were necessary to detect change between vegetation communities. Statistical power was determined for sample sizes of 5, 10, 15, and 20 vegetation plots at an alpha level of 0.05. Detection of subtle differences between vegetation data sets (e.g., comparing vegetation in the same marsh over two consecutive years) can be accomplished using a sample size of 20 plots with a reasonable probability of detecting a difference when one truly exists. With a lower sample size, and thus lower power, there is an increased probability of not detecting a difference when one exists (e.g., Type II error). However, if investigators expect to detect major changes in vegetation (e.g., such as those between an un-impacted and a highly impacted marsh) then a sample size of 5, 10, or 15 plots may be appropriate while still maintaining adequate power. Due to the relative ease of collecting vegetation data, we suggest a minimum sample size of 20 randomly located 1 m2 plots when developing monitoring designs to detect vegetation community change of salt marshes. The sample size of 20 plots per New England salt marsh is appropriate regardless of marsh size or permanency (permanent or non-permanent) of the plots.  相似文献   

16.
Large-scale microarray gene expression data provide the possibility of constructing genetic networks or biological pathways. Gaussian graphical models have been suggested to provide an effective method for constructing such genetic networks. However, most of the available methods for constructing Gaussian graphs do not account for the sparsity of the networks and are computationally more demanding or infeasible, especially in the settings of high dimension and low sample size. We introduce a threshold gradient descent (TGD) regularization procedure for estimating the sparse precision matrix in the setting of Gaussian graphical models and demonstrate its application to identifying genetic networks. Such a procedure is computationally feasible and can easily incorporate prior biological knowledge about the network structure. Simulation results indicate that the proposed method yields a better estimate of the precision matrix than the procedures that fail to account for the sparsity of the graphs. We also present the results on inference of a gene network for isoprenoid biosynthesis in Arabidopsis thaliana. These results demonstrate that the proposed procedure can indeed identify biologically meaningful genetic networks based on microarray gene expression data.  相似文献   

17.
18.
    
Keleş S 《Biometrics》2007,63(1):10-21
Chromatin immunoprecipitation followed by DNA microarray analysis (ChIP-chip methodology) is an efficient way of mapping genome-wide protein-DNA interactions. Data from tiling arrays encompass DNA-protein interaction measurements on thousands or millions of short oligonucleotides (probes) tiling a whole chromosome or genome. We propose a new model-based method for analyzing ChIP-chip data. The proposed model is motivated by the widely used two-component multinomial mixture model of de novo motif finding. It utilizes a hierarchical gamma mixture model of binding intensities while incorporating inherent spatial structure of the data. In this model, genomic regions belong to either one of the following two general groups: regions with a local protein-DNA interaction (peak) and regions lacking this interaction. Individual probes within a genomic region are allowed to have different localization rates accommodating different binding affinities. A novel feature of this model is the incorporation of a distribution for the peak size derived from the experimental design and parameters. This leads to the relaxation of the fixed peak size assumption that is commonly employed when computing a test statistic for these types of spatial data. Simulation studies and a real data application demonstrate good operating characteristics of the method including high sensitivity with small sample sizes when compared to available alternative methods.  相似文献   

19.
  总被引:1,自引:0,他引:1  
Tai YC  Speed TP 《Biometrics》2009,65(1):40-51
Summary .  Consider the ranking of genes using data from replicated microarray time course experiments, where there are multiple biological conditions, and the genes of interest are those whose temporal profiles differ across conditions. We derive a multisample multivariate empirical Bayes' statistic for ranking genes in the order of differential expression, from both longitudinal and cross-sectional replicated developmental microarray time course data. Our longitudinal multisample model assumes that time course replicates are independent and identically distributed multivariate normal vectors. On the other hand, we construct a cross-sectional model using a normal regression framework with any appropriate basis for the design matrices. In both cases, we use natural conjugate priors in our empirical Bayes' setting which guarantee closed form solutions for the posterior odds. The simulations and two case studies using published worm and mouse microarray time course datasets indicate that the proposed approaches perform satisfactorily.  相似文献   

20.
  总被引:1,自引:0,他引:1  
Cui L  Hung HM  Wang SJ 《Biometrics》1999,55(3):853-857
In group sequential clinical trials, sample size reestimation can be a complicated issue when it allows for change of sample size to be influenced by an observed sample path. Our simulation studies show that increasing sample size based on an interim estimate of the treatment difference can substantially inflate the probability of type I error in most practical situations. A new group sequential test procedure is developed by modifying the weights used in the traditional repeated significance two-sample mean test. The new test has the type I error probability preserved at the target level and can provide a substantial gain in power with the increase of sample size. Generalization of the new procedure is discussed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号