首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Permutation tests are amongst the most commonly used statistical tools in modern genomic research, a process by which p-values are attached to a test statistic by randomly permuting the sample or gene labels. Yet permutation p-values published in the genomic literature are often computed incorrectly, understated by about 1/m, where m is the number of permutations. The same is often true in the more general situation when Monte Carlo simulation is used to assign p-values. Although the p-value understatement is usually small in absolute terms, the implications can be serious in a multiple testing context. The understatement arises from the intuitive but mistaken idea of using permutation to estimate the tail probability of the test statistic. We argue instead that permutation should be viewed as generating an exact discrete null distribution. The relevant literature, some of which is likely to have been relatively inaccessible to the genomic community, is reviewed and summarized. A computation strategy is developed for exact p-values when permutations are randomly drawn. The strategy is valid for any number of permutations and samples. Some simple recommendations are made for the implementation of permutation tests in practice.  相似文献   

2.
MOTIVATION: Statistical tests for the detection of differentially expressed genes lead to a large collection of p-values one for each gene comparison. Without any further adjustment, these p-values may lead to a large number of false positives, simply because the number of genes to be tested is huge, which might mean wastage of laboratory resources. To account for multiple hypotheses, these p-values are typically adjusted using a single step method or a step-down method in order to achieve an overall control of the error rate (the so-called familywise error rate). In many applications, this may lead to an overly conservative strategy leading to too few genes being flagged. RESULTS: In this paper we introduce a novel empirical Bayes screening (EBS) technique to inspect a large number of p-values in an effort to detect additional positive cases. In effect, each case borrows strength from an overall picture of the alternative hypotheses computed from all the p-values, while the entire procedure is calibrated by a step-down method so that the familywise error rate at the complete null hypothesis is still controlled. It is shown that the EBS has substantially higher sensitivity than the standard step-down approach for multiple comparison at the cost of a modest increase in the false discovery rate (FDR). The EBS procedure also compares favorably when compared with existing FDR control procedures for multiple testing. The EBS procedure is particularly useful in situations where it is important to identify all possible potentially positive cases which can be subjected to further confirmatory testing in order to eliminate the false positives. We illustrated this screening procedure using a data set on human colorectal cancer where we show that the EBS method detected additional genes related to colon cancer that were missed by other methods.This novel empirical Bayes procedure is advantageous over our earlier proposed empirical Bayes adjustments due to the following reasons: (i) it offers an automatic screening of the p-values the user may obtain from a univariate (i.e., gene by gene) analysis package making it extremely easy to use for a non-statistician, (ii) since it applies to the p-values, the tests do not have to be t-tests; in particular they could be F-tests which might arise in certain ANOVA formulations with expression data or even nonparametric tests, (iii) the empirical Bayes adjustment uses nonparametric function estimation techniques to estimate the marginal density of the transformed p-values rather than using a parametric model for the prior distribution and is therefore robust against model mis-specification. AVAILABILITY: R code for EBS is available from the authors upon request. SUPPLEMENTARY INFORMATION: http://www.stat.uga.edu/~datta/EBS/supp.htm  相似文献   

3.
Estimating p-values in small microarray experiments   总被引:5,自引:0,他引:5  
MOTIVATION: Microarray data typically have small numbers of observations per gene, which can result in low power for statistical tests. Test statistics that borrow information from data across all of the genes can improve power, but these statistics have non-standard distributions, and their significance must be assessed using permutation analysis. When sample sizes are small, the number of distinct permutations can be severely limited, and pooling the permutation-derived test statistics across all genes has been proposed. However, the null distribution of the test statistics under permutation is not the same for equally and differentially expressed genes. This can have a negative impact on both p-value estimation and the power of information borrowing statistics. RESULTS: We investigate permutation based methods for estimating p-values. One of methods that uses pooling from a selected subset of the data are shown to have the correct type I error rate and to provide accurate estimates of the false discovery rate (FDR). We provide guidelines to select an appropriate subset. We also demonstrate that information borrowing statistics have substantially increased power compared to the t-test in small experiments.  相似文献   

4.
Classically, mouse embryonic stem (ES) cells are transfected by electroporation, a method that requires a large number of cells. Here we describe a protocol using a liposome based transfection agent that is a very simple, rapid and cost effective way of transiently transfecting very low numbers of ES cells. We found this method very useful in screening a large number of ES clones when working with inducible expression systems in which at least two elements are required for regulated expression of the gene of interest. After stable transfection of the first component, clones can be easily and rapidly screened for expression of the gene of interest by transiently transfecting the second component of the system using this protocol.  相似文献   

5.
Most existing statistical methods developed for calling single nucleotide polymorphisms (SNPs) using next-generation sequencing (NGS) data are based on Bayesian frameworks, and there does not exist any SNP caller that produces p-values for calling SNPs in a frequentist framework. To fill in this gap, we develop a new method MAFsnp, a Multiple-sample based Accurate and Flexible algorithm for calling SNPs with NGS data. MAFsnp is based on an estimated likelihood ratio test (eLRT) statistic. In practical situation, the involved parameter is very close to the boundary of the parametric space, so the standard large sample property is not suitable to evaluate the finite-sample distribution of the eLRT statistic. Observing that the distribution of the test statistic is a mixture of zero and a continuous part, we propose to model the test statistic with a novel two-parameter mixture distribution. Once the parameters in the mixture distribution are estimated, p-values can be easily calculated for detecting SNPs, and the multiple-testing corrected p-values can be used to control false discovery rate (FDR) at any pre-specified level. With simulated data, MAFsnp is shown to have much better control of FDR than the existing SNP callers. Through the application to two real datasets, MAFsnp is also shown to outperform the existing SNP callers in terms of calling accuracy. An R package “MAFsnp” implementing the new SNP caller is freely available at http://homepage.fudan.edu.cn/zhangh/softwares/.  相似文献   

6.
Efficient inference of haplotypes from genotypes on a pedigree   总被引:1,自引:0,他引:1  
We study haplotype reconstruction under the Mendelian law of inheritance and the minimum recombination principle on pedigree data. We prove that the problem of finding a minimum-recombinant haplotype configuration (MRHC) is in general NP-hard. This is the first complexity result concerning the problem to our knowledge. An iterative algorithm based on blocks of consecutive resolved marker loci (called block-extension) is proposed. It is very efficient and can be used for large pedigrees with a large number of markers, especially for those data sets requiring few recombinants (or recombination events). A polynomial-time exact algorithm for haplotype reconstruction without recombinants is also presented. This algorithm first identifies all the necessary constraints based on the Mendelian law and the zero recombinant assumption, and represents them using a system of linear equations over the cyclic group Z2. By using a simple method based on Gaussian elimination, we could obtain all possible feasible haplotype configurations. A C++ implementation of the block-extension algorithm, called PedPhase, has been tested on both simulated data and real data. The results show that the program performs very well on both types of data and will be useful for large scale haplotype inference projects.  相似文献   

7.
Estimation of tetrad crossover frequency distributions from genetic recombination data is a classic problem dating back to Weinstein (1936, Genetics 21, 155-199). But a number of important issues, such as how to specify the maximum number of crossovers, how to construct confidence intervals for crossover probabilities, and how to obtain correct p-values for hypothesis tests, have never been adequately addressed. In this article, we obtain some properties of the maximum likelihood estimate (MLE) for crossover probabilities that imply guidelines for choosing the maximum number of crossovers. We give these results for both normal meiosis and meiosis with nondisjunction. We also develop an accelerated EM algorithm to find the MLE more efficiently. We propose bootstrap-based methods to find confidence intervals and p-values and conduct simulation studies to check the validity of the bootstrap approach.  相似文献   

8.
Next-generation sequencing has made possible the detection of rare variant (RV) associations with quantitative traits (QT). Due to high sequencing cost, many studies can only sequence a modest number of selected samples with extreme QT. Therefore association testing in individual studies can be underpowered. Besides the primary trait, many clinically important secondary traits are often measured. It is highly beneficial if multiple studies can be jointly analyzed for detecting associations with commonly measured traits. However, analyzing secondary traits in selected samples can be biased if sample ascertainment is not properly modeled. Some methods exist for analyzing secondary traits in selected samples, where some burden tests can be implemented. However p-values can only be evaluated analytically via asymptotic approximations, which may not be accurate. Additionally, potentially more powerful sequence kernel association tests, variable selection-based methods, and burden tests that require permutations cannot be incorporated. To overcome these limitations, we developed a unified method for analyzing secondary trait associations with RVs (STAR) in selected samples, incorporating all RV tests. Statistical significance can be evaluated either through permutations or analytically. STAR makes it possible to apply more powerful RV tests to analyze secondary trait associations. It also enables jointly analyzing multiple cohorts ascertained under different study designs, which greatly boosts power. The performance of STAR and commonly used RV association tests were comprehensively evaluated using simulation studies. STAR was also implemented to analyze a dataset from the SardiNIA project where samples with extreme low-density lipoprotein levels were sequenced. A significant association between LDLR and systolic blood pressure was identified, which is supported by pharmacogenetic studies. In summary, for sequencing studies, STAR is an important tool for detecting secondary-trait RV associations.  相似文献   

9.
For pathway analysis of genomic data, the most common methods involve combining p-values from individual statistical tests. However, there are several multivariate statistical methods that can be used to test whether a pathway has changed. Because of the large number of variables and pathway sizes in genomics data, some of these statistics cannot be computed. However, in metabolomics data, the number of variables and pathway sizes are typically much smaller, making such computations feasible. Of particular interest is being able to detect changes in pathways that may not be detected for the individual variables. We compare the performance of both the p-value methods and multivariate statistics for self-contained tests with an extensive simulation study and a human metabolomics study. Permutation tests, rather than asymptotic results are used to assess the statistical significance of the pathways. Furthermore, both one and two-sided alternatives hypotheses are examined. From the human metabolomic study, many pathways were statistically significant, although the majority of the individual variables in the pathway were not. Overall, the p-value methods perform at least as well as the multivariate statistics for these scenarios.  相似文献   

10.
Linear regression analysis is considered the least computationally demanding method for mapping quantitative trait loci (QTL). However, simultaneous search for multiple QTL, the use of permutations to obtain empirical significance thresholds, and larger experimental studies significantly increase the computational demand. This report describes an easily implemented parallel algorithm, which significantly reduces the computing time in both QTL mapping and permutation testing. In the example provided, the analysis time was decreased to less than 15% of a single processor system by the use of 18 processors. We indicate how the efficiency of the analysis could be improved by distributing the computations more evenly to the processors and how other ways of distributing the data facilitate the use of more processors. The use of parallel computing in QTL mapping makes it possible to routinely use permutations to obtain empirical significance thresholds for multiple traits and multiple QTL models. It could also be of use to improve the computational efficiency of the more computationally demanding QTL analysis methods.  相似文献   

11.
MOTIVATION: Pairwise local sequence alignment is commonly used to search data bases for sequences related to some query sequence. Alignments are obtained using a scoring matrix that takes into account the different frequencies of occurrence of the various types of amino acid substitutions. Software like BLAST provides the user with a set of scoring matrices available to choose from, and in the literature it is sometimes recommended to try several scoring matrices on the sequences of interest. The significance of an alignment is usually assessed by looking at E-values and p-values. While sequence lengths and data base sizes enter the standard calculations of significance, it is much less common to take the use of several scoring matrices on the same sequences into account. Altschul proposed corrections of the p-value that account for the simultaneous use of an infinite number of PAM matrices. Here we consider the more realistic situation where the user may choose from a finite set of popular PAM and BLOSUM matrices, in particular the ones available in BLAST. It turns out that the significance of a result can be considerably overestimated, if a set of substitution matrices is used in an alignment problem and the most significant alignment is then quoted. RESULTS: Based on extensive simulations, we study the multiple testing problem that occurs when several scoring matrices for local sequence alignment are used. We consider a simple Bonferroni correction of the p-values and investigate its accuracy. Finally, we propose a more accurate correction based on extreme value distributions fitted to the maximum of the normalized scores obtained from different scoring matrices. For various sets of matrices we provide correction factors which can be easily applied to adjust p- and E-values reported by software packages.  相似文献   

12.
13.
P1-derived artificial chromosomes (PACs) and bacterial artificial chromosomes (BACs) have become very useful as tools to study gene expression and regulation in cells and in transgenic mice. They carry large fragments of genomic DNA (≥100 kb) and therefore may contain all of the cis-regulatory elements required for expression of a gene. Because of this, even when inserted randomly in the genome, they can emulate the native environment of a gene resulting in a tightly regulated pattern of expression. Because these large genomic clones often contain DNA sequences which can manipulate chromatin at the local level, they become immune to position effects which affect expression of smaller transgenes, and thus their expression is proportional to copy number. Transgenic mice containing large BACs and PACs have become excellent models to examine the regulation of gene expression. Their usefulness would certainly be increased if easy and efficient methods are developed to manipulate them. We describe herein a method to make deletion mutations reliably and efficiently using a novel modification of the Chi-stimulated homologous recombination method. Specifically, we generated and employed a Lox511 ‘floxed’ CAM resistance marker that first affords selection for homologous recombination in Escherichia coli, and then can be easily deleted leaving only a single Lox511 site as the footprint.  相似文献   

14.
Some case-control genome-wide association studies (CCGWASs) select promising single nucleotide polymorphisms (SNPs) by ranking corresponding p-values, rather than by applying the same p-value threshold to each SNP. For such a study, we define the detection probability (DP) for a specific disease-associated SNP as the probability that the SNP will be "T-selected," namely have one of the top T largest chi-square values (or smallest p-values) for trend tests of association. The corresponding proportion positive (PP) is the fraction of selected SNPs that are true disease-associated SNPs. We study DP and PP analytically and via simulations, both for fixed and for random effects models of genetic risk, that allow for heterogeneity in genetic risk. DP increases with genetic effect size and case-control sample size and decreases with the number of nondisease-associated SNPs, mainly through the ratio of T to N, the total number of SNPs. We show that DP increases very slowly with T, and the increment in DP per unit increase in T declines rapidly with T. DP is also diminished if the number of true disease SNPs exceeds T. For a genetic odds ratio per minor disease allele of 1.2 or less, even a CCGWAS with 1000 cases and 1000 controls requires T to be impractically large to achieve an acceptable DP, leading to PP values so low as to make the study futile and misleading. We further calculate the sample size of the initial CCGWAS that is required to minimize the total cost of a research program that also includes follow-up studies to examine the T-selected SNPs. A large initial CCGWAS is desirable if genetic effects are small or if the cost of a follow-up study is large.  相似文献   

15.
A key priority in infectious disease research is to understand the ecological and evolutionary drivers of viral diseases from data on disease incidence as well as viral genetic and antigenic variation. We propose using a simulation-based, Bayesian method known as Approximate Bayesian Computation (ABC) to fit and assess phylodynamic models that simulate pathogen evolution and ecology against summaries of these data. We illustrate the versatility of the method by analyzing two spatial models describing the phylodynamics of interpandemic human influenza virus subtype A(H3N2). The first model captures antigenic drift phenomenologically with continuously waning immunity, and the second epochal evolution model describes the replacement of major, relatively long-lived antigenic clusters. Combining features of long-term surveillance data from the Netherlands with features of influenza A (H3N2) hemagglutinin gene sequences sampled in northern Europe, key phylodynamic parameters can be estimated with ABC. Goodness-of-fit analyses reveal that the irregularity in interannual incidence and H3N2''s ladder-like hemagglutinin phylogeny are quantitatively only reproduced under the epochal evolution model within a spatial context. However, the concomitant incidence dynamics result in a very large reproductive number and are not consistent with empirical estimates of H3N2''s population level attack rate. These results demonstrate that the interactions between the evolutionary and ecological processes impose multiple quantitative constraints on the phylodynamic trajectories of influenza A(H3N2), so that sequence and surveillance data can be used synergistically. ABC, one of several data synthesis approaches, can easily interface a broad class of phylodynamic models with various types of data but requires careful calibration of the summaries and tolerance parameters.  相似文献   

16.
We consider the problem of using permutation-based methods to test for treatment–covariate interactions from randomized clinical trial data. Testing for interactions is common in the field of personalized medicine, as subgroups with enhanced treatment effects arise when treatment-by-covariate interactions exist. Asymptotic tests can often be performed for simple models, but in many cases, more complex methods are used to identify subgroups, and non-standard test statistics proposed, and asymptotic results may be difficult to obtain. In such cases, it is natural to consider permutation-based tests, which shuffle selected parts of the data in order to remove one or more associations of interest; however, in the case of interactions, it is generally not possible to remove only the associations of interest by simple permutations of the data. We propose a number of alternative permutation-based methods, designed to remove only the associations of interest, but preserving other associations. These methods estimate the interaction term in a model, then create data that “looks like” the original data except that the interaction term has been permuted. The proposed methods are shown to outperform traditional permutation methods in a simulation study. In addition, the proposed methods are illustrated using data from a randomized clinical trial of patients with hypertension.  相似文献   

17.
Wang H  Lu HH  Chueh TH 《PloS one》2011,6(6):e20074
Networks are widely used in biology to represent the relationships between genes and gene functions. In Boolean biological models, it is mainly assumed that there are two states to represent a gene: on-state and off-state. It is typically assumed that the relationship between two genes can be characterized by two kinds of pairwise relationships: similarity and prerequisite. Many approaches have been proposed in the literature to reconstruct biological relationships. In this article, we propose a two-step method to reconstruct the biological pathway when the binary array data have measurement error. For a pair of genes in a sample, the first step of this approach is to assign counting numbers for every relationship and select the relationship with counting number greater than a threshold. The second step is to calculate the asymptotic p-values for hypotheses of possible relationships and select relationships with a large p-value. This new method has the advantages of easy calculation for the counting numbers and simple closed forms for the p-value. The simulation study and real data example show that the two-step counting method can accurately reconstruct the biological pathway and outperform the existing methods. Compared with the other existing methods, this two-step method can provide a more accurate and efficient alternative approach for reconstructing the biological network.  相似文献   

18.
Genome-wide association studies (GWAS) are now used routinely to identify SNPs associated with complex human phenotypes. In several cases, multiple variants within a gene contribute independently to disease risk. Here we introduce a novel Gene-Wide Significance (GWiS) test that uses greedy Bayesian model selection to identify the independent effects within a gene, which are combined to generate a stronger statistical signal. Permutation tests provide p-values that correct for the number of independent tests genome-wide and within each genetic locus. When applied to a dataset comprising 2.5 million SNPs in up to 8,000 individuals measured for various electrocardiography (ECG) parameters, this method identifies more validated associations than conventional GWAS approaches. The method also provides, for the first time, systematic assessments of the number of independent effects within a gene and the fraction of disease-associated genes housing multiple independent effects, observed at 35%-50% of loci in our study. This method can be generalized to other study designs, retains power for low-frequency alleles, and provides gene-based p-values that are directly compatible for pathway-based meta-analysis.  相似文献   

19.
The assessment of the effectiveness of a treatment in a clinical trial, depends on calculating p-values. However, p-values are only indirect and partial indicators of a genuine effect. Particularly in situations where publication bias is very likely, assessment using a p-value of 0.05 may not be sufficiently cautious. In other situations it seems reasonable to believe that assessment based on p-values may be unduly conservative. Assessments could be improved by using prior information. This implies using a Bayesian approach to take account of prior probability. However, the use of prior information in the form of expert opinion can allow bias. A method is given here that applies to assessments already included or likely to be included in the Cochrane Collaboration, excluding those reviews concerning new drugs. This method uses prior information and a Bayesian approach, but the prior information comes not from expert opinion but simply from the distribution of effectiveness apparent in a random sample of summary statistics in the Cochrane Collaboration. The method takes certain types of summary statistics and their confidence intervals and with the help of a graph, translates this into probabilities that the treatments being trialled are effective.  相似文献   

20.
MOTIVATION: The occurrence of false positives and false negatives in a microarray analysis could be easily estimated if the distribution of p-values were approximated and then expressed as a mixture of null and alternative densities. Essentially any distribution of p-values can be expressed as such a mixture by extracting a uniform density from it. RESULTS: The occurrence of false positives and false negatives in a microarray analysis could be easily estimated if the distribution of p-values were approximated and then expressed as a mixture of null and alternative densities. Essentially any distribution of p-values can be expressed as such a mixture by extracting a uniform density from it. AVAILABILITY: An S-plus function library is available from http://www.stjuderesearch.org/statistics.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号