首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
MOTIVATION: Statistical tests for the detection of differentially expressed genes lead to a large collection of p-values one for each gene comparison. Without any further adjustment, these p-values may lead to a large number of false positives, simply because the number of genes to be tested is huge, which might mean wastage of laboratory resources. To account for multiple hypotheses, these p-values are typically adjusted using a single step method or a step-down method in order to achieve an overall control of the error rate (the so-called familywise error rate). In many applications, this may lead to an overly conservative strategy leading to too few genes being flagged. RESULTS: In this paper we introduce a novel empirical Bayes screening (EBS) technique to inspect a large number of p-values in an effort to detect additional positive cases. In effect, each case borrows strength from an overall picture of the alternative hypotheses computed from all the p-values, while the entire procedure is calibrated by a step-down method so that the familywise error rate at the complete null hypothesis is still controlled. It is shown that the EBS has substantially higher sensitivity than the standard step-down approach for multiple comparison at the cost of a modest increase in the false discovery rate (FDR). The EBS procedure also compares favorably when compared with existing FDR control procedures for multiple testing. The EBS procedure is particularly useful in situations where it is important to identify all possible potentially positive cases which can be subjected to further confirmatory testing in order to eliminate the false positives. We illustrated this screening procedure using a data set on human colorectal cancer where we show that the EBS method detected additional genes related to colon cancer that were missed by other methods.This novel empirical Bayes procedure is advantageous over our earlier proposed empirical Bayes adjustments due to the following reasons: (i) it offers an automatic screening of the p-values the user may obtain from a univariate (i.e., gene by gene) analysis package making it extremely easy to use for a non-statistician, (ii) since it applies to the p-values, the tests do not have to be t-tests; in particular they could be F-tests which might arise in certain ANOVA formulations with expression data or even nonparametric tests, (iii) the empirical Bayes adjustment uses nonparametric function estimation techniques to estimate the marginal density of the transformed p-values rather than using a parametric model for the prior distribution and is therefore robust against model mis-specification. AVAILABILITY: R code for EBS is available from the authors upon request. SUPPLEMENTARY INFORMATION: http://www.stat.uga.edu/~datta/EBS/supp.htm  相似文献   

2.
Unsequenced bacterial strains can be characterized by comparing their genomic DNA to a sequenced reference genome of the same species. This comparative genomic approach, also called genomotyping, is leading to an increased understanding of bacterial evolution and pathogenesis. It is efficiently accomplished by comparative genomic hybridization on custom-designed cDNA microarrays. The microarray experiment results in fluorescence intensities for reference and sample genome for each gene. The log-ratio of these intensities is usually compared to a cut-off, classifying each gene of the sample genome as a candidate for an absent or present gene with respect to the reference genome. Reducing the usually high rate of false positives in the list of candidates for absent genes is decisive for both time and costs of the experiment. We propose a novel method to improve efficiency of genomotyping experiments in this sense, by rotating the normalized intensity data before setting up the list of candidate genes. We analyze simulated genomotyping data and also re-analyze an experimental data set for comparison and illustration. We approximately halve the proportion of false positives in the list of candidate absent genes for the example comparative genomic hybridization experiment as well as for the simulation experiments.  相似文献   

3.
We present a wrapper-based approach to estimate and control the false discovery rate for peptide identifications using the outputs from multiple commercially available MS/MS search engines. Features of the approach include the flexibility to combine output from multiple search engines with sequence and spectral derived features in a flexible classification model to produce a score associated with correct peptide identifications. This classification model score from a reversed database search is taken as the null distribution for estimating p-values and false discovery rates using a simple and established statistical procedure. Results from 10 analyses of rat sera on an LTQ-FT mass spectrometer indicate that the method is well calibrated for controlling the proportion of false positives in a set of reported peptide identifications while correctly identifying more peptides than rule-based methods using one search engine alone.  相似文献   

4.
Controlling the proportion of false positives in multiple dependent tests   总被引:4,自引:0,他引:4  
Genome scan mapping experiments involve multiple tests of significance. Thus, controlling the error rate in such experiments is important. Simple extension of classical concepts results in attempts to control the genomewise error rate (GWER), i.e., the probability of even a single false positive among all tests. This results in very stringent comparisonwise error rates (CWER) and, consequently, low experimental power. We here present an approach based on controlling the proportion of false positives (PFP) among all positive test results. The CWER needed to attain a desired PFP level does not depend on the correlation among the tests or on the number of tests as in other approaches. To estimate the PFP it is necessary to estimate the proportion of true null hypotheses. Here we show how this can be estimated directly from experimental results. The PFP approach is similar to the false discovery rate (FDR) and positive false discovery rate (pFDR) approaches. For a fixed CWER, we have estimated PFP, FDR, pFDR, and GWER through simulation under a variety of models to illustrate practical and philosophical similarities and differences among the methods.  相似文献   

5.
Hao W  Golding GB 《Gene》2008,421(1-2):27-31
Methods for assessing gene presence and absence have been widely used to study bacterial genome evolution. A recent report by Zhaxybayeva et al. [Zhaxybayeva, O., Nesbo, C. L., and Doolittle, W. F., 2007. Systematic overestimation of gene gain through false diagnosis of gene absence. Genome. Biol. 8, 402] suggests that false diagnosis of gene absence or the presence of undetected truncated genes leads to a systematic overestimation of gene gain. Here (1) we argue that these annotation errors can cause more complicated effects and are not necessarily systematic, (2) we argue that current annotations (supplemented with BLAST searches) are the best way to consistently score gene presence/absence and (3) that genome wide estimates of gene gain/loss are not strongly affected by small differences in gene annotations but that the number of related gene families is strongly affected. We have estimated the rates of gene insertions/deletions using a variety of cutoff thresholds and match lengths as a way in which to alter the recognition of genes and gene fragments. The results reveal that different cutoffs for match length only cause a small variation of the estimated insertion/deletion rates. The rates of gene insertions/deletions on recent branches remain relatively high regardless of the thresholds for match length. Lastly (4), the dynamic process of gene truncation needs to be further considered in genome comparison studies. The data presented suggest that gene truncation tends to take place preferentially in recently transferred genes, which supports a fast turnover of recent laterally transferred genes. The presence of truncated genes or false diagnosis of gene absence therefore does not significantly affect the estimation of gene insertions/deletions rates, but there are several other factors that bias the results toward an under-estimation of the rate of gene insertion/deletion. All of these factors need to be considered.  相似文献   

6.
A Bayesian model-based clustering approach is proposed for identifying differentially expressed genes in meta-analysis. A Bayesian hierarchical model is used as a scientific tool for combining information from different studies, and a mixture prior is used to separate differentially expressed genes from non-differentially expressed genes. Posterior estimation of the parameters and missing observations are done by using a simple Markov chain Monte Carlo method. From the estimated mixture model, useful measure of significance of a test such as the Bayesian false discovery rate (FDR), the local FDR (Efron et al., 2001), and the integration-driven discovery rate (IDR; Choi et al., 2003) can be easily computed. The model-based approach is also compared with commonly used permutation methods, and it is shown that the model-based approach is superior to the permutation methods when there are excessive under-expressed genes compared to over-expressed genes or vice versa. The proposed method is applied to four publicly available prostate cancer gene expression data sets and simulated data sets.  相似文献   

7.
Tsai CA  Hsueh HM  Chen JJ 《Biometrics》2003,59(4):1071-1081
Testing for significance with gene expression data from DNA microarray experiments involves simultaneous comparisons of hundreds or thousands of genes. If R denotes the number of rejections (declared significant genes) and V denotes the number of false rejections, then V/R, if R > 0, is the proportion of false rejected hypotheses. This paper proposes a model for the distribution of the number of rejections and the conditional distribution of V given R, V / R. Under the independence assumption, the distribution of R is a convolution of two binomials and the distribution of V / R has a noncentral hypergeometric distribution. Under an equicorrelated model, the distributions are more complex and are also derived. Five false discovery rate probability error measures are considered: FDR = E(V/R), pFDR = E(V/R / R > 0) (positive FDR), cFDR = E(V/R / R = r) (conditional FDR), mFDR = E(V)/E(R) (marginal FDR), and eFDR = E(V)/r (empirical FDR). The pFDR, cFDR, and mFDR are shown to be equivalent under the Bayesian framework, in which the number of true null hypotheses is modeled as a random variable. We present a parametric and a bootstrap procedure to estimate the FDRs. Monte Carlo simulations were conducted to evaluate the performance of these two methods. The bootstrap procedure appears to perform reasonably well, even when the alternative hypotheses are correlated (rho = .25). An example from a toxicogenomic microarray experiment is presented for illustration.  相似文献   

8.
MOTIVATION: Microarrays have been widely used to discover novel disease related genes. Some types of microarray, such as cDNA arrays, usually contain a considerable portion of missing values. When missing value imputation and gene prioritization are sequentially conducted, it is necessary to consider the distribution space of prioritization scores due to the existence of missing values. We propose an ensemble approach to address this issue. A bootstrap procedure enables us to generate a resample multivariate distribution of the prioritization scores and then to obtain the expected prioritization scores. RESULTS: We used a published microarray two-sample data set to illustrate our approach. We focused on the following issues after missing value imputation: (i) concordance of gene prioritization and (ii) control of true and false positives. We compared our approach with the traditional non-ensemble approach to missing value imputation. We also evaluated the performance of non-imputation approach when the theoretical test distribution was available. The results showed that the ensemble imputation approach provided clearly improved performances in the concordance of gene prioritization and the control of true/false positives, especially when sample sizes were about 5-10 per group and missing rates were about 10-20%, which was a common situation for cDNA microarray studies. AVAILABILITY: The Matlab codes are freely available at http://home.gwu.edu/~ylai/research/Missing.  相似文献   

9.
In the search for genes associated with disease, statistical analysis yields a key towards reproducible results. To avoid a plethora of type I errors, classical gene selection procedures strike a balance between magnitude and precision of observed effects in terms of p-values. Protecting false discovery rates recovers some power but still ranks genes according to classical p-values. In contrast, we propose a selection procedure driven by the concern to detect well-specified important alternatives. By summarizing evidence from the perspective of both the null and such an alternative hypothesis, genes line up in a substantially different order with different genes yielding powerful signals. A cutoff point for a measure of relative evidence which balances the standard p-value, p0, with its counterpart, p1, derived from the perspective of the target alternative, determines our gene selection. We find the cutoff point that maximizes an expected specific gain. This yields an optimal decision which exploits gene-specific variances and thus involves different type I and type II errors across genes. We show the dramatic impact of this alternative perspective on the detection of differentially expressed genes in hereditary breast cancer. Our analysis does not rely on parametric assumptions on the data.  相似文献   

10.
11.
12.
False positive control/estimate in peptide identifications by MS is of critical importance for reliable inference at the protein level and downstream bioinformatics analysis. Approaches based on search against decoy databases have become popular for its conceptual simplicity and easy implementation. Although various decoy search strategies have been proposed, few studies have investigated their difference in performance. With datasets collected on a mixture of model proteins, we demonstrate that a single search against the target database coupled with its reversed version offers a good balance between performance and simplicity. In particular, both the accuracy of the estimate of the number of false positives and sensitivity is at least comparable to other procedures examined in this study. It is also shown that scrambling while preserving frequency of amino acid words can potentially improve the accuracy of false positive estimate, though more studies are needed to investigate the optimal scrambling procedure for specific condition and the variation of the estimate across repeated scrambling.  相似文献   

13.
In this article, we consider the probabilistic identification of amino acid positions that evolve under positive selection as a multiple hypothesis testing problem. The null hypothesis "H0,s: site s evolves under a negative selection or under a neutral process of evolution" is tested at each codon site of the alignment of homologous coding sequences. Standard hypothesis testing is based on the control of the expected proportion of falsely rejected null hypotheses or type-I error rate. As the number of tests increases, however, the power of an individual test may become unacceptably low. Recent advances in statistics have shown that the false discovery rate--in this case, the expected proportion of sites that do not evolve under positive selection among those that are estimated to evolve under this selection regime--is a quantity that can be controlled. Keeping the proportion of false positives low among the significant results generally leads to an increase in power. In this article, we show that controlling the false detection rate is relevant when searching for positively selected sites. We also compare this new approach to traditional methods using extensive simulations.  相似文献   

14.
Improving false discovery rate estimation   总被引:1,自引:0,他引:1  
MOTIVATION: Recent attempts to account for multiple testing in the analysis of microarray data have focused on controlling the false discovery rate (FDR). However, rigorous control of the FDR at a preselected level is often impractical. Consequently, it has been suggested to use the q-value as an estimate of the proportion of false discoveries among a set of significant findings. However, such an interpretation of the q-value may be unwarranted considering that the q-value is based on an unstable estimator of the positive FDR (pFDR). Another method proposes estimating the FDR by modeling p-values as arising from a beta-uniform mixture (BUM) distribution. Unfortunately, the BUM approach is reliable only in settings where the assumed model accurately represents the actual distribution of p-values. METHODS: A method called the spacings LOESS histogram (SPLOSH) is proposed for estimating the conditional FDR (cFDR), the expected proportion of false positives conditioned on having k 'significant' findings. SPLOSH is designed to be more stable than the q-value and applicable in a wider variety of settings than BUM. RESULTS: In a simulation study and data analysis example, SPLOSH exhibits the desired characteristics relative to the q-value and BUM. AVAILABILITY: The Web site www.stjuderesearch.org/statistics/splosh.html has links to freely available S-plus code to implement the proposed procedure.  相似文献   

15.
Broberg P 《Genome biology》2002,3(9):preprint00-23

Background  

In the pharmaceutical industry and in academia substantial efforts are made to make the best use of the promising microarray technology. The data generated by microarrays are more complex than most other biological data attracting much attention at this point. A method for finding an optimal test statistic with which to rank genes with respect to differential expression is outlined and tested. At the heart of the method lies an estimate of the false negative and false positive rates. Both investing in false positives and missing true positives lead to a waste of resources. The procedure sets out to minimise these errors. For calculation of the false positive and negative rates a simulation procedure is invoked.  相似文献   

16.
In collision-induced dissociation (CID) of peptides, it has been observed that rearrangement processes can take place that appear to permute/scramble the original primary structure, which may in principle adversely affect peptide identification. Here, an analysis of sequence permutation in tandem mass spectra is presented for a previously published proteomics study on P. aeruginosa (Scherl et al., J. Am. Soc. Mass Spectrom.2008, 19, 891) conducted using an LTQ-orbitrap. Overall, 4878 precursor ions are matched by considering the accurate mass (i.e., <5 ppm) of the precursor ion and at least one fragment ion that confirms the sequence. The peptides are then grouped into higher- and lower-confidence data sets, using five fragment ions as a cutoff for higher-confidence identification. It is shown that the propensity for sequence permutation increases with the length of the tryptic peptide in both data sets. A higher charge state (i.e., 3+ vs 2+) also appears to correlate with a higher appearance of permuted masses for larger peptides. The ratio of these permuted sequence ions, compared to all tandem mass spectral peaks, reaches ~25% in the higher-confidence data set, compared to an estimated incidence of false positives for permuted masses (maximum ~8%), based on a null-hypothesis decoy data set.  相似文献   

17.
Gene expression array technology has reached the stage of being routinely used to study clinical samples in search of diagnostic and prognostic biomarkers. Due to the nature of array experiments, which examine the expression of tens of thousands of genes simultaneously, the number of null hypotheses is large. Hence, multiple testing correction is often necessary to control the number of false positives. However, multiple testing correction can lead to low statistical power in detecting genes that are truly differentially expressed. Filtering out non-informative genes allows for reduction in the number of null hypotheses. While several filtering methods have been suggested, the appropriate way to perform filtering is still debatable. We propose a new filtering strategy for Affymetrix GeneChips®, based on principal component analysis of probe-level gene expression data. Using a wholly defined spike-in data set and one from a diabetes study, we show that filtering by the proportion of variation accounted for by the first principal component (PVAC) provides increased sensitivity in detecting truly differentially expressed genes while controlling false discoveries. We demonstrate that PVAC exhibits equal or better performance than several widely used filtering methods. Furthermore, a data-driven approach that guides the selection of the filtering threshold value is also proposed.  相似文献   

18.
Previously, we introduced a neural network system predicting locations of transmembrane helices (HTMs) based on evolutionary profiles (PHDhtm, Rost B, Casadio R, Fariselli P, Sander C, 1995, Protein Sci 4:521-533). Here, we describe an improvement and an extension of that system. The improvement is achieved by a dynamic programming-like algorithm that optimizes helices compatible with the neural network output. The extension is the prediction of topology (orientation of first loop region with respect to membrane) by applying to the refined prediction the observation that positively charged residues are more abundant in extra-cytoplasmic regions. Furthermore, we introduce a method to reduce the number of false positives, i.e., proteins falsely predicted with membrane helices. The evaluation of prediction accuracy is based on a cross-validation and a double-blind test set (in total 131 proteins). The final method appears to be more accurate than other methods published: (1) For almost 89% (+/-3%) of the test proteins, all HTMs are predicted correctly. (2) For more than 86% (+/-3%) of the proteins, topology is predicted correctly. (3) We define reliability indices that correlate with prediction accuracy: for one half of the proteins, segment accuracy raises to 98%; and for two-thirds, accuracy of topology prediction is 95%. (4) The rate of proteins for which HTMs are predicted falsely is below 2% (+/-1%). Finally, the method is applied to 1,616 sequences of Haemophilus influenzae. We predict 19% of the genome sequences to contain one or more HTMs. This appears to be lower than what we predicted previously for the yeast VIII chromosome (about 25%).  相似文献   

19.
The recent availability of the chicken genome sequence poses the question of whether there are human protein-coding genes conserved in chicken that are currently not included in the human gene catalog. Here, we show, using comparative gene finding followed by experimental verification of exon pairs by RT–PCR, that the addition to the multi-exonic subset of this catalog could be as little as 0.2%, suggesting that we may be closing in on the human gene set. Our protocol, however, has two shortcomings: (i) the bioinformatic screening of the predicted genes, applied to filter out false positives, cannot handle intronless genes; and (ii) the experimental verification could fail to identify expression at a specific developmental time. This highlights the importance of developing methods that could provide a reliable estimate of the number of these two types of genes.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号