首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Rosetta error model for gene expression analysis   总被引:4,自引:0,他引:4  
MOTIVATION: In microarray gene expression studies, the number of replicated microarrays is usually small because of cost and sample availability, resulting in unreliable variance estimation and thus unreliable statistical hypothesis tests. The unreliable variance estimation is further complicated by the fact that the technology-specific variance is intrinsically intensity-dependent. RESULTS: The Rosetta error model captures the variance-intensity relationship for various types of microarray technologies, such as single-color arrays and two-color arrays. This error model conservatively estimates intensity error and uses this value to stabilize the variance estimation. We present two commonly used error models: the intensity error-model for single-color microarrays and the ratio error model for two-color microarrays or ratios built from two single-color arrays. We present examples to demonstrate the strength of our error models in improving statistical power of microarray data analysis, particularly, in increasing expression detection sensitivity and specificity when the number of replicates is limited.  相似文献   

2.
3.

Background  

Microarrays permit biologists to simultaneously measure the mRNA abundance of thousands of genes. An important issue facing investigators planning microarray experiments is how to estimate the sample size required for good statistical power. What is the projected sample size or number of replicate chips needed to address the multiple hypotheses with acceptable accuracy? Statistical methods exist for calculating power based upon a single hypothesis, using estimates of the variability in data from pilot studies. There is, however, a need for methods to estimate power and/or required sample sizes in situations where multiple hypotheses are being tested, such as in microarray experiments. In addition, investigators frequently do not have pilot data to estimate the sample sizes required for microarray studies.  相似文献   

4.
Conventional methods for sample size calculation for population-based longitudinal studies tend to overestimate the statistical power by overlooking important determinants of the required sample size, such as the measurement errors and unmeasured etiological determinants, etc. In contrast, a simulation-based sample size calculation, if designed properly, allows these determinants to be taken into account and offers flexibility in accommodating complex study design features. The Canadian Longitudinal Study on Aging (CLSA) is a Canada-wide, 20-year follow-up study of 30,000 people between the ages of 45 and 85 years, with in-depth information collected every 3 years. A simulation study, based on an illness-death model, was conducted to: (1) investigate the statistical power profile of the CLSA to detect the effect of environmental and genetic risk factors, and their interaction on age-related chronic diseases; and (2) explore the design alternatives and implementation strategies for increasing the statistical power of population-based longitudinal studies in general. The results showed that the statistical power to identify the effect of environmental and genetic risk exposures, and their interaction on a disease was boosted when: (1) the prevalence of the risk exposures increased; (2) the disease of interest is relatively common in the population; and (3) risk exposures were measured accurately. In addition, the frequency of data collection every three years in the CLSA led to a slightly lower statistical power compared to the design assuming that participants underwent health monitoring continuously. The CLSA had sufficient power to detect a small (1<hazard ratio (HR)≤1.5) or moderate effect (1.5< HR≤2.0) of the environmental risk exposure, as long as the risk exposure and the disease of interest were not rare. It had enough power to detect a moderate or large (2.0<HR≤3.0) effect of the genetic risk exposure when the prevalence of the risk exposure was not very low (≥0.1) and the disease of interest was not rare (such as diabetes and dementia). The CLSA had enough power to detect a large effect of the gene-environment interaction only when both risk exposures had relatively high prevalence (0.2) and the disease of interest was very common (such as diabetes). The minimum detectable hazard ratios (MDHR) of the CLSA for the environmental and genetic risk exposures obtained from this simulation study were larger than those calculated according to the conventional sample size calculation method. For example, the MDHR for the environmental risk exposure was 1.15 according to the conventional method if the prevalence of the risk exposure was 0.1 and the disease of interest was dementia. In contrast, the MDHR was 1.61 if the same exposure was measured every 3 years with a misclassification rate of 0.1 according to this simulation study. With a given sample size, higher statistical power could be achieved by increasing the measuring frequency in participants with high risk of declining health status or changing risk exposures, and by increasing measurement accuracy of diseases and risk exposures. A properly designed simulation-based sample size calculation is superior to conventional methods when rigorous sample size calculation is necessary.  相似文献   

5.
Numerous initiatives are underway throughout New England and elsewhere to quantify salt marsh vegetation change, mostly in response to habitat restoration, sea level rise, and nutrient enrichment. To detect temporal changes in vegetation at a marsh or to compare vegetation among different marshes with a degree of statistical certainty an adequate sample size is required. Based on sampling 1 m2 vegetation plots from 11 New England salt marsh data sets, we conducted a power analysis to determine the minimum number of samples that were necessary to detect change between vegetation communities. Statistical power was determined for sample sizes of 5, 10, 15, and 20 vegetation plots at an alpha level of 0.05. Detection of subtle differences between vegetation data sets (e.g., comparing vegetation in the same marsh over two consecutive years) can be accomplished using a sample size of 20 plots with a reasonable probability of detecting a difference when one truly exists. With a lower sample size, and thus lower power, there is an increased probability of not detecting a difference when one exists (e.g., Type II error). However, if investigators expect to detect major changes in vegetation (e.g., such as those between an un-impacted and a highly impacted marsh) then a sample size of 5, 10, or 15 plots may be appropriate while still maintaining adequate power. Due to the relative ease of collecting vegetation data, we suggest a minimum sample size of 20 randomly located 1 m2 plots when developing monitoring designs to detect vegetation community change of salt marshes. The sample size of 20 plots per New England salt marsh is appropriate regardless of marsh size or permanency (permanent or non-permanent) of the plots.  相似文献   

6.
Many medical and biological studies entail classifying a number of observations according to two factors, where one has two and the other three possible categories. This is the case of, for example, genetic association studies of complex traits with single-nucleotide polymorphisms (SNPs), where the a priori statistical planning, analysis, and interpretation of results are of critical importance. Here, we present methodology to determine the minimum sample size required to detect dependence in 2 x 3 tables based on Fisher's exact test, assuming that neither of the two margins is fixed and only the grand total N is known in advance. We provide the numerical tools necessary to determine these sample sizes for desired power, significance level, and effect size, where only the computational time can be a limitation for extreme parameter values. These programs can be accessed at . This solution of the sample size problem for an exact test will permit experimentalists to plan efficient sampling designs, determine the extent of statistical support for their hypotheses, and gain insight into the repeatability of their results. We apply this solution to the sample size problem to three empirical studies, and discuss the results with specified power and nominal significance levels.  相似文献   

7.
Hu Z  Xu S 《Heredity》2008,101(1):48-52
We developed a simple method for calculating the statistical power for detecting a QTL located in an interval flanked by two markers. The statistical method for QTL detection is assumed to be the Haley and Knott's simple regression method of interval mapping. This method allows us to answer one of the fundamental questions in designing a QTL mapping experiment: What is the minimum marker density required to detect a QTL explaining a certain heritable proportion of the phenotypic variance (denoted by h(2)) with a power gamma under a Type I error alpha in an F(2) or other mating designs with a sample size n? Computing the statistical power only requires the ability to evaluate a non-central F-distribution function and the inverse function of this distribution.  相似文献   

8.
The transmission/disequilibrium test (TDT) and the affected sib pair test (ASP) both test for the association of a marker allele with some conditions. Here, we present methods for calculating the probability of detecting the association (power) for a study examining a fixed number of families for suitability for the study and for calculating the number of such families to be examined. Both calculations use a genetic model for the association. The model considered posits a bi-allelic marker locus that is linked to a bi-allelic disease locus with a possibly nonzero recombination fraction between the loci. The penetrance of the disease is an increasing function of the number of disease alleles. The TDT tests whether the transmission by a heterozygous parent of a particular allele at a marker locus to an affected offspring occurs with probability greater than 0.5. The ASP tests whether transmission of the same allele to two affected sibs occurs with probability greater than 0.5. In either case, evidence that the probability is greater than 0.5 is evidence for association between the marker and the disease. Study inclusion criteria (IC) can greatly affect the necessary sample size of a TDT or ASP study. IC considered by us include a randomly selected parent at least one parent or both parents required to be heterozygous. It also allows a specified minimum number of affected offspring to be required (TDT only). We use elementary probability calculations rather than complex mathematical manipulations or asymptotic methods (large sample size approximations) to compute power and requisite sample size for a proposed study. The advantages of these methods are simplicity and generality.  相似文献   

9.
Sun W 《Biometrics》2012,68(1):1-11
RNA-seq may replace gene expression microarrays in the near future. Using RNA-seq, the expression of a gene can be estimated using the total number of sequence reads mapped to that gene, known as the total read count (TReC). Traditional expression quantitative trait locus (eQTL) mapping methods, such as linear regression, can be applied to TReC measurements after they are properly normalized. In this article, we show that eQTL mapping, by directly modeling TReC using discrete distributions, has higher statistical power than the two-step approach: data normalization followed by linear regression. In addition, RNA-seq provides information on allele-specific expression (ASE) that is not available from microarrays. By combining the information from TReC and ASE, we can computationally distinguish cis- and trans-eQTL and further improve the power of cis-eQTL mapping. Both simulation and real data studies confirm the improved power of our new methods. We also discuss the design issues of RNA-seq experiments. Specifically, we show that by combining TReC and ASE measurements, it is possible to minimize cost and retain the statistical power of cis-eQTL mapping by reducing sample size while increasing the number of sequence reads per sample. In addition to RNA-seq data, our method can also be employed to study the genetic basis of other types of sequencing data, such as chromatin immunoprecipitation followed by DNA sequencing data. In this article, we focus on eQTL mapping of a single gene using the association-based method. However, our method establishes a statistical framework for future developments of eQTL mapping methods using RNA-seq data (e.g., linkage-based eQTL mapping), and the joint study of multiple genetic markers and/or multiple genes.  相似文献   

10.
MOTIVATION: We present statistical methods for determining the number of per gene replicate spots required in microarray experiments. The purpose of these methods is to obtain an estimate of the sampling variability present in microarray data, and to determine the number of replicate spots required to achieve a high probability of detecting a significant fold change in gene expression, while maintaining a low error rate. Our approach is based on data from control microarrays, and involves the use of standard statistical estimation techniques. RESULTS: After analyzing two experimental data sets containing control array data, we were able to determine the statistical power available for the detection of significant differential expression given differing levels of replication. The inclusion of replicate spots on microarrays not only allows more accurate estimation of the variability present in an experiment, but more importantly increases the probability of detecting genes undergoing significant fold changes in expression, while substantially decreasing the probability of observing fold changes due to chance rather than true differential expression.  相似文献   

11.
The promise of microarray technology in providing prediction classifiers for cancer outcome estimation has been confirmed by a number of demonstrable successes. However, the reliability of prediction results relies heavily on the accuracy of statistical parameters involved in classifiers. It cannot be reliably estimated with only a small number of training samples. Therefore, it is of vital importance to determine the minimum number of training samples and to ensure the clinical value of microarrays in cancer outcome prediction. We evaluated the impact of training sample size on model performance extensively based on 3 large-scale cancer microarray datasets provided by the second phase of MicroArray Quality Control project (MAQC-II). An SSNR-based (scale of signal-to-noise ratio) protocol was proposed in this study for minimum training sample size determination. External validation results based on another 3 cancer datasets confirmed that the SSNR-based approach could not only determine the minimum number of training samples efficiently, but also provide a valuable strategy for estimating the underlying performance of classifiers in advance. Once translated into clinical routine applications, the SSNR-based protocol would provide great convenience in microarray-based cancer outcome prediction in improving classifier reliability.  相似文献   

12.
Scientists who use animals in research must justify the number of animals to be used, and committees that review proposals to use animals in research must review this justification to ensure the appropriateness of the number of animals to be used. This article discusses when the number of animals to be used can best be estimated from previous experience and when a simple power and sample size calculation should be performed. Even complicated experimental designs requiring sophisticated statistical models for analysis can usually be simplified to a single key or critical question so that simple formulae can be used to estimate the required sample size. Approaches to sample size estimation for various types of hypotheses are described, and equations are provided in the Appendix. Several web sites are cited for more information and for performing actual calculations  相似文献   

13.
Selective phenotyping for increased efficiency in genetic mapping studies   总被引:3,自引:0,他引:3  
Jin C  Lan H  Attie AD  Churchill GA  Bulutuglo D  Yandell BS 《Genetics》2004,168(4):2285-2293
The power of a genetic mapping study depends on the heritability of the trait, the number of individuals included in the analysis, and the genetic dissimilarity among them. In experiments that involve microarrays or other complex physiological assays, phenotyping can be expensive and time-consuming and may impose limits on the sample size. A random selection of individuals may not provide sufficient power to detect linkage until a large sample size is reached. We present an algorithm for selecting a subset of individuals solely on the basis of genotype data that can achieve substantial improvements in sensitivity compared to a random sample of the same size. The selective phenotyping method involves preferentially selecting individuals to maximize their genotypic dissimilarity. Selective phenotyping is most effective when prior knowledge of genetic architecture allows us to focus on specific genetic regions. However, it can also provide modest improvements in efficiency when applied on a whole-genome basis. Importantly, selective phenotyping does not reduce the efficiency of mapping as compared to a random sample in regions that are not considered in the selection process. In contrast to selective genotyping, inferences based solely on a selectively phenotyped population of individuals are representative of the whole population. The substantial improvement introduced by selective phenotyping is particularly useful when phenotyping is difficult or costly and thus limits the sample size in a genetic mapping study.  相似文献   

14.
Accommodating general patterns of confounding in sample size/power calculations for observational studies is extremely challenging, both technically and scientifically. While employing previously implemented sample size/power tools is appealing, they typically ignore important aspects of the design/data structure. In this paper, we show that sample size/power calculations that ignore confounding can be much more unreliable than is conventionally thought; using real data from the US state of North Carolina, naive calculations yield sample size estimates that are half those obtained when confounding is appropriately acknowledged. Unfortunately, eliciting realistic design parameters for confounding mechanisms is difficult. To overcome this, we propose a novel two-stage strategy for observational study design that can accommodate arbitrary patterns of confounding. At the first stage, researchers establish bounds for power that facilitate the decision of whether or not to initiate the study. At the second stage, internal pilot data are used to estimate key scientific inputs that can be used to obtain realistic sample size/power. Our results indicate that the strategy is effective at replicating gold standard calculations based on knowing the true confounding mechanism. Finally, we show that consideration of the nature of confounding is a crucial aspect of the elicitation process; depending on whether the confounder is positively or negatively associated with the exposure of interest and outcome, naive power calculations can either under or overestimate the required sample size. Throughout, simulation is advocated as the only general means to obtain realistic estimates of statistical power; we describe, and provide in an R package, a simple algorithm for estimating power for a case-control study.  相似文献   

15.
OBJECTIVE: To develop an optimal sampling strategy for tissue microarrays using automated digital analysis for androgen receptor (heterogeneous expression) and the cellular proliferation marker Ki-67 (homogeneous expression and evaluated by others using nonautomated methods). STUDY DESIGN: Tissue microarrays were constructed from 23 radical prostatectomy specimens and immunostained for androgen receptor expression and cellular proliferation. Automated digital image analysis was used, and the minimum number of cores necessary to capture variance change <3% was determined. Androgen receptor immunostaining was described by percent positive nuclei (PPN) and mean optical density (MOD). RESULTS: Androgen receptor PPN variance measurements showed that 5 cores should be obtained when a single block of a radical prostatectomy specimen contained cancer. If all of 15 blocks contained cancer, 2 cores should be obtained from each of 6 blocks. An optimal sampling strategy was developed for androgen receptor PPN, androgen receptor MOD and Ki-67 PPN. CONCLUSION: The selection of the number of cores to sample is a tradeoff between the number of cores available that contain cancer and the amount of work involved in the analysis. Sampling no fewer than 5 but no more than 12 cores per radical prostatectomy specimen can capture tissue heterogeneity.  相似文献   

16.
MOTIVATION: There is not a widely applicable method to determine the sample size for experiments basing statistical significance on the false discovery rate (FDR). RESULTS: We propose and develop the anticipated FDR (aFDR) as a conceptual tool for determining sample size. We derive mathematical expressions for the aFDR and anticipated average statistical power. These expressions are used to develop a general algorithm to determine sample size. We provide specific details on how to implement the algorithm for a k-group (k > or = 2) comparisons. The algorithm performs well for k-group comparisons in a series of traditional simulations and in a real-data simulation conducted by resampling from a large, publicly available dataset. AVAILABILITY: Documented S-plus and R code libraries are freely available from www.stjuderesearch.org/depts/biostats.  相似文献   

17.
THE POWER OF SENSORY DISCRIMINATION METHODS   总被引:8,自引:1,他引:7  
Difference testing methods are extensively used in a variety of applications from small sensory evaluation tests to large scale consumer tests. A central issue in the use of these tests is their statistical power, or the probability that if a specified difference exists it will be demonstrated as a significant difference in a difference test. A general equation for the power of any discrimination method is given. A general equation for the sample size required to meet Type I and Type II error specifications is also given. Sample size tables for the 2-alternative forced choice (2-AFC), 3-AFC, the duo-trio and the triangular methods are given. Tables of the psychometric functions for the 2-AFC, 3-AFC, triangular and duo-trio methods are also given.  相似文献   

18.
19.

Background  

Much of the public access cancer microarray data is asymmetric, belonging to datasets containing no samples from normal tissue. Asymmetric data cannot be used in standard meta-analysis approaches (such as the inverse variance method) to obtain large sample sizes for statistical power enrichment. Noting that plenty of normal tissue microarray samples exist in studies not involving cancer, we investigated the viability and accuracy of an integrated microarray analysis approach based on significance analysis of microarrays (merged SAM) using a collection of data from separate diseased and normal samples.  相似文献   

20.
AMaCAID is an R program designed to analyse multilocus genotypic patterns in large samples. It allows (i) the computation of the number and frequency of the different multilocus patterns available in a molecular data set and (ii) the analysis of discriminatory power of each combination of k markers among n available. It thus enables the identification of the minimum number of markers required to distinguish all the observed genotypes and the subset of markers that maximize the number of distinct genotypes. AMaCAID can be used with any kind of molecular markers, on data sets mixing different kinds of markers, but also on qualitative characters like morphological or taxonomic traits. AMaCAID has been built primarily to select subsets of markers for identifying accessions and monitoring their genetic stability during regeneration cycles in an ex situ genebank. It can, however, also be used to screen any kind of data set that characterizes a set of individuals or species (e.g. taxonomic or phylogenetic studies) for discrimination purposes. The size of the assayed sample has no limitation, but the program only performs computations on all combinations of markers when there are less than 25 markers. For larger number of markers/characters, it is possible to ask AMaCAID to screen a large but limited number of combinations of markers. We apply AMaCAID to three data sets involving either molecular or taxonomic data and give some results on the computing time of the program with respect to the size of the data set.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号