首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.

Background  

In the analysis of microarray data one generally produces a vector of p-values that for each gene give the likelihood of obtaining equally strong evidence of change by pure chance. The distribution of these p-values is a mixture of two components corresponding to the changed genes and the unchanged ones. The focus of this article is how to estimate the proportion unchanged and the false discovery rate (FDR) and how to make inferences based on these concepts. Six published methods for estimating the proportion unchanged genes are reviewed, two alternatives are presented, and all are tested on both simulated and real data. All estimates but one make do without any parametric assumptions concerning the distributions of the p-values. Furthermore, the estimation and use of the FDR and the closely related q-value is illustrated with examples. Five published estimates of the FDR and one new are presented and tested. Implementations in R code are available.  相似文献   

2.
Multiple testing (MT) with false discovery rate (FDR) control has been widely conducted in the “discrete paradigm” where p-values have discrete and heterogeneous null distributions. However, in this scenario existing FDR procedures often lose some power and may yield unreliable inference, and for this scenario there does not seem to be an FDR procedure that partitions hypotheses into groups, employs data-adaptive weights and is nonasymptotically conservative. We propose a weighted p-value-based FDR procedure, “weighted FDR (wFDR) procedure” for short, for MT in the discrete paradigm that efficiently adapts to both heterogeneity and discreteness of p-value distributions. We theoretically justify the nonasymptotic conservativeness of the wFDR procedure under independence, and show via simulation studies that, for MT based on p-values of binomial test or Fisher's exact test, it is more powerful than six other procedures. The wFDR procedure is applied to two examples based on discrete data, a drug safety study, and a differential methylation study, where it makes more discoveries than two existing methods.  相似文献   

3.
For multiple testing based on discrete p-values, we propose a false discovery rate (FDR) procedure “BH+” with proven conservativeness. BH+ is at least as powerful as the BH (i.e., Benjamini-Hochberg) procedure when they are applied to superuniform p-values. Further, when applied to mid-p-values, BH+ can be more powerful than it is applied to conventional p-values. An easily verifiable necessary and sufficient condition for this is provided. BH+ is perhaps the first conservative FDR procedure applicable to mid-p-values and to p-values with general distributions. It is applied to multiple testing based on discrete p-values in a methylation study, an HIV study and a clinical safety study, where it makes considerably more discoveries than the BH procedure. In addition, we propose an adaptive version of the BH+ procedure, prove its conservativeness under certain conditions, and provide evidence on its excellent performance via simulation studies.  相似文献   

4.

Background  

Thousands of genes in a genomewide data set are tested against some null hypothesis, for detecting differentially expressed genes in microarray experiments. The expected proportion of false positive genes in a set of genes, called the False Discovery Rate (FDR), has been proposed to measure the statistical significance of this set. Various procedures exist for controlling the FDR. However the threshold (generally 5%) is arbitrary and a specific measure associated with each gene would be worthwhile.  相似文献   

5.

Background  

The evaluation of statistical significance has become a critical process in identifying differentially expressed genes in microarray studies. Classical p-value adjustment methods for multiple comparisons such as family-wise error rate (FWER) have been found to be too conservative in analyzing large-screening microarray data, and the False Discovery Rate (FDR), the expected proportion of false positives among all positives, has been recently suggested as an alternative for controlling false positives. Several statistical approaches have been used to estimate and control FDR, but these may not provide reliable FDR estimation when applied to microarray data sets with a small number of replicates.  相似文献   

6.
Beyond Bonferroni: less conservative analyses for conservation genetics   总被引:1,自引:0,他引:1  
Studies in conservation genetics often attempt to determine genetic differentiation between two or more temporally or geographically distinct sample collections. Pairwise p-values from Fisher’s exact tests or contingency Chi-square tests are commonly reported with a Bonferroni correction for multiple tests. While the Bonferroni correction controls the experiment-wise α, this correction is very conservative and results in greatly diminished power to detect differentiation among pairs of sample collections. An alternative is to control the false discovery rate (FDR) that provides increased power, but this method only maintains experiment-wise α when none of the pairwise comparisons are significant. Recent modifications to the FDR method provide a moderate approach to determining significance level. Simulations reveal that critical values of multiple comparison tests with both the Bonferroni method and a modified FDR method approach a minimum asymptote very near zero as the number of tests gets large, but the Bonferroni method approaches zero much more rapidly than the modified FDR method. I compared pairwise significance from three published studies using three critical values corresponding to Bonferroni, FDR, and modified FDR methods. Results suggest that the modified FDR method may provide the most biologically important critical value for evaluating significance of population differentiation in conservation genetics.␣Ultimately, more thorough reporting of statistical significance is needed to allow interpretation of biological significance of genetic differentiation among populations.An erratum to this article can be found at  相似文献   

7.

Background  

In microarray studies researchers are often interested in the comparison of relevant quantities between two or more similar experiments, involving different treatments, tissues, or species. Typically each experiment reports measures of significance (e.g. p-values) or other measures that rank its features (e.g genes). Our objective is to find a list of features that are significant in all experiments, to be further investigated. In this paper we present an R package called sdef, that allows the user to quantify the evidence of communality between the experiments using previously proposed statistical methods based on the ranked lists of p-values. sdef implements two approaches that address this objective: the first is a permutation test of the maximal ratio of observed to expected common features under the hypothesis of independence between the experiments. The second approach, set in a Bayesian framework, is more flexible as it takes into account the uncertainty on the number of genes differentially expressed in each experiment.  相似文献   

8.
Summary In a microarray experiment, one experimental design is used to obtain expression measures for all genes. One popular analysis method involves fitting the same linear mixed model for each gene, obtaining gene‐specific p‐values for tests of interest involving fixed effects, and then choosing a threshold for significance that is intended to control false discovery rate (FDR) at a desired level. When one or more random factors have zero variance components for some genes, the standard practice of fitting the same full linear mixed model for all genes can result in failure to control FDR. We propose a new method that combines results from the fit of full and selected linear mixed models to identify differentially expressed genes and provide FDR control at target levels when the true underlying random effects structure varies across genes.  相似文献   

9.
当两组样本间基因表达的差异程度较低或样本量较少时,采用通常的错误发现率(falsediscovery rate,FDR)控制水平(如5%或10%),可能无法识别足够多的差异表达基因以进行后续的功能富集分析。然而,功能富集分析对差异表达基因中的错误发现具有一定的稳健性。所以,采用较低的FDR控制水平(即允许较高的FDR)识别差异表达基因,可能可以可靠地发现疾病相关功能。本文分析了5套研究乳腺癌转移的基因表达谱,通过其中差异表达信号较强的3套数据,论证了即使差异表达基因的FDR达到25%,功能富集分析的结果仍具有较高的稳健性。然后,在另外2套差异表达信号微弱的数据中,采用25%的FDR控制水平筛选差异表达基因来进行功能富集分析,并与前述3套数据的功能富集结果做比较。结果显示,采用较低的FDR控制水平筛选差异表达基因,仍然可以可靠地识别乳腺癌转移相关功能。分析结果也提示,在乳腺癌转移过程中,一些功能较为宽泛的生物学过程(如细胞分裂、细胞周期和DNA复制等)整体受到了扰动,反映出乳腺癌转移是一种涉及广泛基因表达改变的系统性疾病。  相似文献   

10.

Background

q-value is a widely used statistical method for estimating false discovery rate (FDR), which is a conventional significance measure in the analysis of genome-wide expression data. q-value is a random variable and it may underestimate FDR in practice. An underestimated FDR can lead to unexpected false discoveries in the follow-up validation experiments. This issue has not been well addressed in literature, especially in the situation when the permutation procedure is necessary for p-value calculation.

Results

We proposed a statistical method for the conservative adjustment of q-value. In practice, it is usually necessary to calculate p-value by a permutation procedure. This was also considered in our adjustment method. We used simulation data as well as experimental microarray or sequencing data to illustrate the usefulness of our method.

Conclusions

The conservativeness of our approach has been mathematically confirmed in this study. We have demonstrated the importance of conservative adjustment of q-value, particularly in the situation that the proportion of differentially expressed genes is small or the overall differential expression signal is weak.
  相似文献   

11.
The ordinary-, penalized-, and bootstrap t-test, least squares and best linear unbiased prediction were compared for their false discovery rates (FDR), i.e. the fraction of falsely discovered genes, which was empirically estimated in a duplicate of the data set. The bootstrap-t-test yielded up to 80% lower FDRs than the alternative statistics, and its FDR was always as good as or better than any of the alternatives. Generally, the predicted FDR from the bootstrapped P-values agreed well with their empirical estimates, except when the number of mRNA samples is smaller than 16. In a cancer data set, the bootstrap-t-test discovered 200 differentially regulated genes at a FDR of 2.6%, and in a knock-out gene expression experiment 10 genes were discovered at a FDR of 3.2%. It is argued that, in the case of microarray data, control of the FDR takes sufficient account of the multiple testing, whilst being less stringent than Bonferoni-type multiple testing corrections. Extensions of the bootstrap simulations to more complicated test-statistics are discussed.  相似文献   

12.
13.

Background  

Before conducting a microarray experiment, one important issue that needs to be determined is the number of arrays required in order to have adequate power to identify differentially expressed genes. This paper discusses some crucial issues in the problem formulation, parameter specifications, and approaches that are commonly proposed for sample size estimation in microarray experiments. Common methods for sample size estimation are formulated as the minimum sample size necessary to achieve a specified sensitivity (proportion of detected truly differentially expressed genes) on average at a specified false discovery rate (FDR) level and specified expected proportion (π 1) of the true differentially expression genes in the array. Unfortunately, the probability of detecting the specified sensitivity in such a formulation can be low. We formulate the sample size problem as the number of arrays needed to achieve a specified sensitivity with 95% probability at the specified significance level. A permutation method using a small pilot dataset to estimate sample size is proposed. This method accounts for correlation and effect size heterogeneity among genes.  相似文献   

14.

Background  

Many studies have provided algorithms or methods to assess a statistical significance in quantitative proteomics when multiple replicates for a protein sample and a LC/MS analysis are available. But, confidence is still lacking in using datasets for a biological interpretation without protein sample replicates. Although a fold-change is a conventional threshold that can be used when there are no sample replicates, it does not provide an assessment of statistical significance such as a false discovery rate (FDR) which is an important indicator of the reliability to identify differentially expressed proteins. In this work, we investigate whether differentially expressed proteins can be detected with a statistical significance from a pair of unlabeled protein samples without replicates and with only duplicate LC/MS injections per sample. A FDR is used to gauge the statistical significance of the differentially expressed proteins.  相似文献   

15.
Genome-wide association studies (GWAS) have identified thousands of genetic variants that are associated with complex traits. However, a stringent significance threshold is required to identify robust genetic associations. Leveraging relevant auxiliary covariates has the potential to boost statistical power to exceed the significance threshold. Particularly, abundant pleiotropy and the non-random distribution of SNPs across various functional categories suggests that leveraging GWAS test statistics from related traits and/or functional genomic data may boost GWAS discovery. While type 1 error rate control has become standard in GWAS, control of the false discovery rate can be a more powerful approach. The conditional false discovery rate (cFDR) extends the standard FDR framework by conditioning on auxiliary data to call significant associations, but current implementations are restricted to auxiliary data satisfying specific parametric distributions, typically GWAS p-values for related traits. We relax these distributional assumptions, enabling an extension of the cFDR framework that supports auxiliary covariates from arbitrary continuous distributions (“Flexible cFDR”). Our method can be applied iteratively, thereby supporting multi-dimensional covariate data. Through simulations we show that Flexible cFDR increases sensitivity whilst controlling FDR after one or several iterations. We further demonstrate its practical potential through application to an asthma GWAS, leveraging various functional genomic data to find additional genetic associations for asthma, which we validate in the larger, independent, UK Biobank data resource.  相似文献   

16.
False discovery rate, sensitivity and sample size for microarray studies   总被引:10,自引:0,他引:10  
MOTIVATION: In microarray data studies most researchers are keenly aware of the potentially high rate of false positives and the need to control it. One key statistical shift is the move away from the well-known P-value to false discovery rate (FDR). Less discussion perhaps has been spent on the sensitivity or the associated false negative rate (FNR). The purpose of this paper is to explain in simple ways why the shift from P-value to FDR for statistical assessment of microarray data is necessary, to elucidate the determining factors of FDR and, for a two-sample comparative study, to discuss its control via sample size at the design stage. RESULTS: We use a mixture model, involving differentially expressed (DE) and non-DE genes, that captures the most common problem of finding DE genes. Factors determining FDR are (1) the proportion of truly differentially expressed genes, (2) the distribution of the true differences, (3) measurement variability and (4) sample size. Many current small microarray studies are plagued with large FDR, but controlling FDR alone can lead to unacceptably large FNR. In evaluating a design of a microarray study, sensitivity or FNR curves should be computed routinely together with FDR curves. Under certain assumptions, the FDR and FNR curves coincide, thus simplifying the choice of sample size for controlling the FDR and FNR jointly.  相似文献   

17.
Estimating the false discovery rate using nonparametric deconvolution   总被引:1,自引:0,他引:1  
van de Wiel MA  Kim KI 《Biometrics》2007,63(3):806-815
Given a set of microarray data, the problem is to detect differentially expressed genes, using a false discovery rate (FDR) criterion. As opposed to common procedures in the literature, we do not base the selection criterion on statistical significance only, but also on the effect size. Therefore, we select only those genes that are significantly more differentially expressed than some f-fold (e.g., f = 2). This corresponds to use of an interval null domain for the effect size. Based on a simple error model, we discuss a naive estimator for the FDR, interpreted as the probability that the parameter of interest lies in the null-domain (e.g., mu < log(2)(2) = 1) given that the test statistic exceeds a threshold. We improve the naive estimator by using deconvolution. That is, the density of the parameter of interest is recovered from the data. We study performance of the methods using simulations and real data.  相似文献   

18.
Zhao J  Boerwinkle E  Xiong M 《Human genetics》2007,121(3-4):357-367
Availability of a large collection of single nucleotide polymorphisms (SNPs) and efficient genotyping methods enable the extension of linkage and association studies for complex diseases from small genomic regions to the whole genome. Establishing global significance for linkage or association requires small P-values of the test. The original TDT statistic compares the difference in linear functions of the number of transmitted and nontransmitted alleles or haplotypes. In this report, we introduce a novel TDT statistic, which uses Shannon entropy as a nonlinear transformation of the frequencies of the transmitted or nontransmitted alleles (or haplotypes), to amplify the difference in the number of transmitted and nontransmitted alleles or haplotypes in order to increase statistical power with large number of marker loci. The null distribution of the entropy-based TDT statistic and the type I error rates in both homogeneous and admixture populations are validated using a series of simulation studies. By analytical methods, we show that the power of the entropy-based TDT statistic is higher than the original TDT, and this difference increases with the number of marker loci. Finally, the new entropy-based TDT statistic is applied to two real data sets to test the association of the RET gene with Hirschsprung disease and the Fcγ receptor genes with systemic lupus erythematosus. Results show that the entropy-based TDT statistic can reach p-values that are small enough to establish genome-wide linkage or association analyses.  相似文献   

19.
In MS‐based quantitative proteomics, the FDR control (i.e. the limitation of the number of proteins that are wrongly claimed as differentially abundant between several conditions) is a major postanalysis step. It is classically achieved thanks to a specific statistical procedure that computes the adjusted p‐values of the putative differentially abundant proteins. Unfortunately, such adjustment is conservative only if the p‐values are well‐calibrated; the false discovery control being spuriously underestimated otherwise. However, well‐calibration is a property that can be violated in some practical cases. To overcome this limitation, we propose a graphical method to straightforwardly and visually assess the p‐value well‐calibration, as well as the R codes to embed it in any pipeline. All MS data have been deposited in the ProteomeXchange with identifier PXD002370 ( http://proteomecentral.proteomexchange.org/dataset/PXD002370 ).  相似文献   

20.
Li MX  Yeung JM  Cherny SS  Sham PC 《Human genetics》2012,131(5):747-756
Current genome-wide association studies (GWAS) use commercial genotyping microarrays that can assay over a million single nucleotide polymorphisms (SNPs). The number of SNPs is further boosted by advanced statistical genotype-imputation algorithms and large SNP databases for reference human populations. The testing of a huge number of SNPs needs to be taken into account in the interpretation of statistical significance in such genome-wide studies, but this is complicated by the non-independence of SNPs because of linkage disequilibrium (LD). Several previous groups have proposed the use of the effective number of independent markers (M e) for the adjustment of multiple testing, but current methods of calculation for M e are limited in accuracy or computational speed. Here, we report a more robust and fast method to calculate M e. Applying this efficient method [implemented in a free software tool named Genetic type 1 error calculator (GEC)], we systematically examined the M e, and the corresponding p-value thresholds required to control the genome-wide type 1 error rate at 0.05, for 13 Illumina or Affymetrix genotyping arrays, as well as for HapMap Project and 1000 Genomes Project datasets which are widely used in genotype imputation as reference panels. Our results suggested the use of a p-value threshold of ~10−7 as the criterion for genome-wide significance for early commercial genotyping arrays, but slightly more stringent p-value thresholds ~5 × 10−8 for current or merged commercial genotyping arrays, ~10−8 for all common SNPs in the 1000 Genomes Project dataset and ~5 × 10−8 for the common SNPs only within genes.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号