首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
As much of the focus of genetics and molecular biology has shifted toward the systems level, it has become increasingly important to accurately extract biologically relevant signal from thousands of related measurements. The common property among these high-dimensional biological studies is that the measured features have a rich and largely unknown underlying structure. One example of much recent interest is identifying differentially expressed genes in comparative microarray experiments. We propose a new approach aimed at optimally performing many hypothesis tests in a high-dimensional study. This approach estimates the optimal discovery procedure (ODP), which has recently been introduced and theoretically shown to optimally perform multiple significance tests. Whereas existing procedures essentially use data from only one feature at a time, the ODP approach uses the relevant information from the entire data set when testing each feature. In particular, we propose a generally applicable estimate of the ODP for identifying differentially expressed genes in microarray experiments. This microarray method consistently shows favorable performance over five highly used existing methods. For example, in testing for differential expression between two breast cancer tumor types, the ODP provides increases from 72% to 185% in the number of genes called significant at a false discovery rate of 3%. Our proposed microarray method is freely available to academic users in the open-source, point-and-click EDGE software package.  相似文献   

2.
3.

Background  

Large-scale statistical analyses have become hallmarks of post-genomic era biological research due to advances in high-throughput assays and the integration of large biological databases. One accompanying issue is the simultaneous estimation of p-values for a large number of hypothesis tests. In many applications, a parametric assumption in the null distribution such as normality may be unreasonable, and resampling-based p-values are the preferred procedure for establishing statistical significance. Using resampling-based procedures for multiple testing is computationally intensive and typically requires large numbers of resamples.  相似文献   

4.
C Y Meng  A P Dempster 《Biometrics》1987,43(2):301-311
Statistical analyses of simple tumor rates from an animal experiment with one control and one treated group typically consist of hypothesis testing of many 2 X 2 tables, one for each tumor type or site. The multiplicity of significance tests may cause excessive overall false-positive rates. This paper presents a Bayesian approach to the problem of multiple significance testing. We develop a normal logistic model that accommodates the incidences of all tumor types or sites observed in the current experiment simultaneously as well as their historical control incidences. Exchangeable normal priors are assumed for certain linear terms in the model. Posterior means, standard deviations, and Bayesian P-values are computed for an average treatment effect as well as for the effects on individual tumor types or sites. Model assumptions are checked using probability plots and the sensitivity of the parameter estimates to alternative priors is studied. The method is illustrated using tumor data from a chronic animal experiment.  相似文献   

5.
In a computer simulation, a neural network first received a simultaneous procedure, where the interstimulus interval (ISI) was 0 time-steps (ts). Output activations were near zero under this procedure. The network then received a forward-delay procedure where the ISI was 8 ts. Output activations increased to the near-maximum level faster than those of a control network that first received an explicitly unpaired procedure. Comparable results were obtained with rats that first received trials where a retractable lever was presented for 3s concurrently with access to water. Low-lever pressing was observed under this procedure. The rats then received trials where the lever was followed 15s after by water. Lever pressing appeared faster than a control group that received the 15-s ISI after an explicitly unpaired procedure. The model used in the simulation explains these results as connection-weight increments that promote little output activations in a simultaneous procedure, but facilitate acquisition in an optimal ISI.  相似文献   

6.
7.
Following the success of small-molecule high-throughput screening (HTS) in drug discovery, other large-scale screening techniques are currently revolutionizing the biological sciences. Powerful new statistical tools have been developed to analyze the vast amounts of data in DNA chip studies, but have not yet found their way into compound screening. In HTS, characterization of single-point hit lists is often done only in retrospect after the results of confirmation experiments are available. However, for prioritization, for optimal use of resources, for quality control, and for comparison of screens it would be extremely valuable to predict the rates of false positives and false negatives directly from the primary screening results. Making full use of the available information about compounds and controls contained in HTS results and replicated pilot runs, the Z score and from it the p value can be estimated for each measurement. Based on this consideration, we have applied the concept of p-value distribution analysis (PVDA), which was originally developed for gene expression studies, to HTS data. PVDA allowed prediction of all relevant error rates as well as the rate of true inactives, and excellent agreement with confirmation experiments was found.  相似文献   

8.
A simple procedure for estimating the false discovery rate   总被引:1,自引:0,他引:1  
MOTIVATION: The most used criterion in microarray data analysis is nowadays the false discovery rate (FDR). In the framework of estimating procedures based on the marginal distribution of the P-values without any assumption on gene expression changes, estimators of the FDR are necessarily conservatively biased. Indeed, only an upper bound estimate can be obtained for the key quantity pi0, which is the probability for a gene to be unmodified. In this paper, we propose a novel family of estimators for pi0 that allows the calculation of FDR. RESULTS: The very simple method for estimating pi0 called LBE (Location Based Estimator) is presented together with results on its variability. Simulation results indicate that the proposed estimator performs well in finite sample and has the best mean square error in most of the cases as compared with the procedures QVALUE, BUM and SPLOSH. The different procedures are then applied to real datasets. AVAILABILITY: The R function LBE is available at http://ifr69.vjf.inserm.fr/lbe CONTACT: broet@vjf.inserm.fr.  相似文献   

9.
Testing for simultaneous vicariance across comparative phylogeographic data sets is a notoriously difficult problem hindered by mutational variance, the coalescent variance, and variability across pairs of sister taxa in parameters that affect genetic divergence. We simulate vicariance to characterize the behaviour of several commonly used summary statistics across a range of divergence times, and to characterize this behaviour in comparative phylogeographic datasets having multiple taxon-pairs. We found Tajima's D to be relatively uncorrelated with other summary statistics across divergence times, and using simple hypothesis testing of simultaneous vicariance given variable population sizes, we counter-intuitively found that the variance across taxon pairs in Nei and Li's net nucleotide divergence (pi(net)), a common measure of population divergence, is often inferior to using the variance in Tajima's D across taxon pairs as a test statistic to distinguish ancient simultaneous vicariance from variable vicariance histories. The opposite and more intuitive pattern is found for testing more recent simultaneous vicariance, and overall we found that depending on the timing of vicariance, one of these two test statistics can achieve high statistical power for rejecting simultaneous vicariance, given a reasonable number of intron loci (> 5 loci, 400 bp) and a range of conditions. These results suggest that components of these two composite summary statistics should be used in future simulation-based methods which can simultaneously use a pool of summary statistics to test comparative the phylogeographic hypotheses we consider here.  相似文献   

10.
This paper addresses the question of biomarker discovery in proteomics. Given clinical data regarding a list of proteins for a set of individuals, the tackled problem is to extract a short subset of proteins the concentrations of which are an indicator of the biological status (healthy or pathological). In this paper, it is formulated as a specific instance of variable selection. The originality is that the proteins are not investigated one after the other but the best partition between discriminant and non-discriminant proteins is directly sought. In this way, correlations between the proteins are intrinsically taken into account in the decision. The developed strategy is derived in a Bayesian setting, and the decision is optimal in the sense that it minimizes a global mean error. It is finally based on the posterior probabilities of the partitions. The main difficulty is to calculate these probabilities since they are based on the so-called evidence that require marginalization of all the unknown model parameters. Two models are presented that relate the status to the protein concentrations, depending whether the latter are biomarkers or not. The first model accounts for biological variabilities by assuming that the concentrations are Gaussian distributed with a mean and a covariance matrix that depend on the status only for the biomarkers. The second one is an extension that also takes into account the technical variabilities that may significantly impact the observed concentrations. The main contributions of the paper are: (1) a new Bayesian formulation of the biomarker selection problem, (2) the closed-form expression of the posterior probabilities in the noiseless case, and (3) a suitable approximated solution in the noisy case. The methods are numerically assessed and compared to the state-of-the-art methods (t test, LASSO, Battacharyya distance, FOHSIC) on synthetic and real data from proteins quantified in human serum by mass spectrometry in selected reaction monitoring mode.  相似文献   

11.
How best to summarize large and complex datasets is a problem that arises in many areas of science. We approach it from the point of view of seeking data summaries that minimize the average squared error of the posterior distribution for a parameter of interest under approximate Bayesian computation (ABC). In ABC, simulation under the model replaces computation of the likelihood, which is convenient for many complex models. Simulated and observed datasets are usually compared using summary statistics, typically in practice chosen on the basis of the investigator's intuition and established practice in the field. We propose two algorithms for automated choice of efficient data summaries. Firstly, we motivate minimisation of the estimated entropy of the posterior approximation as a heuristic for the selection of summary statistics. Secondly, we propose a two-stage procedure: the minimum-entropy algorithm is used to identify simulated datasets close to that observed, and these are each successively regarded as observed datasets for which the mean root integrated squared error of the ABC posterior approximation is minimized over sets of summary statistics. In a simulation study, we both singly and jointly inferred the scaled mutation and recombination parameters from a population sample of DNA sequences. The computationally-fast minimum entropy algorithm showed a modest improvement over existing methods while our two-stage procedure showed substantial and highly-significant further improvement for both univariate and bivariate inferences. We found that the optimal set of summary statistics was highly dataset specific, suggesting that more generally there may be no globally-optimal choice, which argues for a new selection for each dataset even if the model and target of inference are unchanged.  相似文献   

12.
13.
14.
SUMMARY: BAli-Phy is a Bayesian posterior sampler that employs Markov chain Monte Carlo to explore the joint space of alignment and phylogeny given molecular sequence data. Simultaneous estimation eliminates bias toward inaccurate alignment guide-trees, employs more sophisticated substitution models during alignment and automatically utilizes information in shared insertion/deletions to help infer phylogenies. AVAILABILITY: Software is available for download at http://www.biomath.ucla.edu/msuchard/bali-phy.  相似文献   

15.
This article explains estimation of gene frequencies from a Bayesian viewpoint using prior information. How to obtain Bayes estimators and the highest posterior density credible sets (Bayesian counterpart to classical confidence intervals) for gene frequencies is described. Tests of hypotheses are also discussed. A readily available mathematical application package is used to demonstrate the mathematical computations.  相似文献   

16.
17.
Statistical tests that detect and measure deviation from the Hardy-Weinberg equilibrium (HWE) have been devised but are limited when testing for deviation at multiallelic DNA loci is attempted. Here we present the full Bayesian significance test (FBST) for the HWE. This test depends neither on asymptotic results nor on the number of possible alleles for the particular locus being evaluated. The FBST is based on the computation of an evidence index in favor of the HWE hypothesis. A great deal of forensic inference based on DNA evidence assumes that the HWE is valid for the genetic loci being used. We applied the FBST to genotypes obtained at several multiallelic short tandem repeat loci during routine parentage testing; the locus Penta E exemplifies those clearly in HWE while others such as D10S1214 and D19S253 do not appear to show this.  相似文献   

18.
A sharper Bonferroni procedure for multiple tests of significance   总被引:62,自引:0,他引:62  
HOCHBERG  YOSEF 《Biometrika》1988,75(4):800-802
  相似文献   

19.
An improved Bonferroni procedure for multiple tests of significance   总被引:24,自引:0,他引:24  
SIMES  R. J. 《Biometrika》1986,73(3):751-754
  相似文献   

20.
Motivation: We propose a Bayesian method for the problem ofmultiple hypothesis testing that is routinely encountered inbioinformatics research, such as the differential gene expressionanalysis. Our algorithm is based on modeling the distributionsof test statistics under both null and alternative hypotheses.We substantially reduce the complexity of the process of definingposterior model probabilities by modeling the test statisticsdirectly instead of modeling the full data. Computationally,we apply a Bayesian FDR approach to control the number of rejectionsof null hypotheses. To check if our model assumptions for thetest statistics are valid for various bioinformatics experiments,we also propose a simple graphical model-assessment tool. Results: Using extensive simulations, we demonstrate the performanceof our models and the utility of the model-assessment tool.In the end, we apply the proposed methodology to an siRNA screeningand a gene expression experiment. Contact: yuanji{at}mdanderson.org Supplementary information: Supplementary data are availableat Bioinformatics online. Associate Editor: Chris Stoeckert  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号