首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
MOTIVATION: We consider the detection of expressed genes and the comparison of them in different experiments with the high-density oligonucleotide microarrays. The results are summarized as the detection calls and comparison calls, and they should be robust against data outliers over a wide target concentration range. It is also helpful to provide parameters that can be adjusted by the user to balance specificity and sensitivity under various experimental conditions. RESULTS: We present rank-based algorithms for making detection and comparison calls on expression microarrays. The detection call algorithm utilizes the discrimination scores. The comparison call algorithm utilizes intensity differences. Both algorithms are based on Wilcoxon's signed-rank test. Several parameters in the algorithms can be adjusted by the user to alter levels of specificity and sensitivity. The algorithms were developed and analyzed using spiked-in genes arrayed in a Latin square format. In the call process, p-values are calculated to give a confidence level for the pertinent hypotheses. For comparison calls made between two arrays, two primary normalization factors are defined. To overcome the difficulty that constant normalization factors do not fit all probe sets, we perturb these primary normalization factors and make increasing or decreasing calls only if all resulting p-values fall within a defined critical region. Our algorithms also automatically handle scanner saturation.  相似文献   

2.
MOTIVATION: Statistical tests for the detection of differentially expressed genes lead to a large collection of p-values one for each gene comparison. Without any further adjustment, these p-values may lead to a large number of false positives, simply because the number of genes to be tested is huge, which might mean wastage of laboratory resources. To account for multiple hypotheses, these p-values are typically adjusted using a single step method or a step-down method in order to achieve an overall control of the error rate (the so-called familywise error rate). In many applications, this may lead to an overly conservative strategy leading to too few genes being flagged. RESULTS: In this paper we introduce a novel empirical Bayes screening (EBS) technique to inspect a large number of p-values in an effort to detect additional positive cases. In effect, each case borrows strength from an overall picture of the alternative hypotheses computed from all the p-values, while the entire procedure is calibrated by a step-down method so that the familywise error rate at the complete null hypothesis is still controlled. It is shown that the EBS has substantially higher sensitivity than the standard step-down approach for multiple comparison at the cost of a modest increase in the false discovery rate (FDR). The EBS procedure also compares favorably when compared with existing FDR control procedures for multiple testing. The EBS procedure is particularly useful in situations where it is important to identify all possible potentially positive cases which can be subjected to further confirmatory testing in order to eliminate the false positives. We illustrated this screening procedure using a data set on human colorectal cancer where we show that the EBS method detected additional genes related to colon cancer that were missed by other methods.This novel empirical Bayes procedure is advantageous over our earlier proposed empirical Bayes adjustments due to the following reasons: (i) it offers an automatic screening of the p-values the user may obtain from a univariate (i.e., gene by gene) analysis package making it extremely easy to use for a non-statistician, (ii) since it applies to the p-values, the tests do not have to be t-tests; in particular they could be F-tests which might arise in certain ANOVA formulations with expression data or even nonparametric tests, (iii) the empirical Bayes adjustment uses nonparametric function estimation techniques to estimate the marginal density of the transformed p-values rather than using a parametric model for the prior distribution and is therefore robust against model mis-specification. AVAILABILITY: R code for EBS is available from the authors upon request. SUPPLEMENTARY INFORMATION: http://www.stat.uga.edu/~datta/EBS/supp.htm  相似文献   

3.
We present a novel maximum-likelihood-based algorithm for estimating the distribution of alignment scores from the scores of unrelated sequences in a database search. Using a new method for measuring the accuracy of p-values, we show that our maximum-likelihood-based algorithm is more accurate than existing regression-based and lookup table methods. We explore a more sophisticated way of modeling and estimating the score distributions (using a two-component mixture model and expectation maximization), but conclude that this does not improve significantly over simply ignoring scores with small E-values during estimation. Finally, we measure the classification accuracy of p-values estimated in different ways and observe that inaccurate p-values can, somewhat paradoxically, lead to higher classification accuracy. We explain this paradox and argue that statistical accuracy, not classification accuracy, should be the primary criterion in comparisons of similarity search methods that return p-values that adjust for target sequence length.  相似文献   

4.

Background

Gender differences in gene expression were estimated in liver samples from 9 males and 9 females. The study tested 31,110 genes for a gender difference using a design that adjusted for sources of variation associated with cDNA arrays, normalization, hybridizations and processing conditions.

Results

The genes were split into 2,800 that were clearly expressed (expressed genes) and 28,310 that had expression levels in the background range (not expressed genes). The distribution of p-values from the 'not expressed' group was consistent with no gender differences. The distribution of p-values from the 'expressed' group suggested that 8 % of these genes differed by gender, but the estimated fold-changes (expression in males / expression in females) were small. The largest observed fold-change was 1.55. The 95 % confidence bounds on the estimated fold-changes were less than 1.4 fold for 79.3 %, and few (1.1%) exceed 2-fold.

Conclusion

Observed gender differences in gene expression were small. When selecting genes with gender differences based upon their p-values, false discovery rates exceed 80 % for any set of genes, essentially making it impossible to identify any specific genes with a gender difference.
  相似文献   

5.
6.
In the search for genes associated with disease, statistical analysis yields a key towards reproducible results. To avoid a plethora of type I errors, classical gene selection procedures strike a balance between magnitude and precision of observed effects in terms of p-values. Protecting false discovery rates recovers some power but still ranks genes according to classical p-values. In contrast, we propose a selection procedure driven by the concern to detect well-specified important alternatives. By summarizing evidence from the perspective of both the null and such an alternative hypothesis, genes line up in a substantially different order with different genes yielding powerful signals. A cutoff point for a measure of relative evidence which balances the standard p-value, p0, with its counterpart, p1, derived from the perspective of the target alternative, determines our gene selection. We find the cutoff point that maximizes an expected specific gain. This yields an optimal decision which exploits gene-specific variances and thus involves different type I and type II errors across genes. We show the dramatic impact of this alternative perspective on the detection of differentially expressed genes in hereditary breast cancer. Our analysis does not rely on parametric assumptions on the data.  相似文献   

7.
Wu X  Naiman DQ 《Human heredity》2005,59(4):190-200
A standard approach to calculation of critical values for affected sib pair multiple testing is based on: (a) fully informative markers, (b) Haldane map function assumptions leading to a Markov chain model for inheritance vectors, (c) central limit approximation to averages of sampled inheritance vectors leading to an Ornstein-Uhlenbeck process approximation, and (d) simple approximations to the maximum of such a process. Under these assumptions, assuming equispaced or close to equispaced markers, if the sample size is large, an approximation is available that is easy to calculate and performs well. However, for small sample sizes, a large number of markers, and for small p-values, there is good reason to be cautious about the use of the Gaussian approximation. We develop an algorithm for calculation of multiple testing p-values based on the standard Markov chain model, avoiding the use of Gaussian (large sample) approximation. We illustrate the use of this algorithm by demonstrating some inadequacies of the Gaussian approximation.  相似文献   

8.
MOTIVATION: Expressed sequence tag (EST) data reflects variation in gene expression, but previous methods for finding coexpressed genes in EST data are subject to bias and vastly overstate the statistical significance of putatively coexpressed genes. RESULTS: We introduce a new method (LNP) that reports reasonable p-values and also detects more biological relationships in human dbEST than do previous methods. In simulations with human dbEST library sizes, previous methods report p-values as low as 10(-30) on 1/1000 uncorrelated pairs, while LNP reports significance correctly. We validate the analysis on real human genes by comparing coexpressed pairs to gene ontology annotations and find that LNP is more sensitive than the three previous methods. We also find a small but statistically significant level of coexpression between interacting proteins relative to randomized controls. The LNP method is based on a log-normal prior on the distribution of expression levels.  相似文献   

9.
DNA-microarrays find broad employment in biochemical research. This technology allows the monitoring of the expression levels of thousands of genes at the same time. Often, the goal of a microarray study is to find differentially expressed genes in two different types of tissue, for example normal and cancerous. Multiple hypothesis testing is a useful statistical tool for such studies. One approach using multiple hypothesis testing is nonparametric analysis for replicated microarray experiments. In this paper we present an improved version of this method. We also show how p-values are calculated for all significant genes detected with this testing procedure. All algorithms were implemented in an R-package, and instructions on it's use are included. The package can be downloaded at http://www.statistik.unidortmund.de/de/content/einrichtungen/lehrstuehle/personen/jung.html  相似文献   

10.
11.
We analyzed 12 combined mitochondrial and nuclear gene datasets in seven orders of insects using both equal weights parsimony (to evaluate phylogenetic utility) and Bayesian methods (to investigate substitution patterns). For the Bayesian analyses we used relatively complex models (e.g., general time reversible models with rate variation) that allowed us to quantitatively compare relative rates among genes and codon positions, patterns of rate variation among genes, and substitution patterns within genes. Our analyses indicate that nuclear and mitochondrial genes differ in a number of important ways, some of which are correlated with phylogenetic utility. First and most obviously, nuclear genes generally evolve more slowly than mitochondrial genes (except in one case), making them better markers for deep divergences. Second, nuclear genes showed universally high values of CI and (generally) contribute more to overall tree resolution than mitochondrial genes (as measured by partitioned Bremer support). Third, nuclear genes show more homogeneous patterns of among-site rate variation (higher values of alpha than mitochondrial genes). Finally, nuclear genes show more symmetrical transformation rate matrices than mitochondrial genes. The combination of low values of alpha and highly asymmetrical transformation rate matrices may explain the overall poor performance of mitochondrial genes when compared to nuclear genes in the same analysis. Our analyses indicate that some parameters are highly correlated. For example, A/T bias was positively and significantly associated with relative rate and CI was positively and significantly associated with alpha (the shape of the gamma distribution). These results provide important insights into the substitution patterns that might characterized high quality genes for phylogenetic analysis: high values of alpha, unbiased base composition, and symmetrical transformation rate matrices. We argue that insect molecular systematists should increasingly focus on nuclear rather than mitochondrial gene datasets because nuclear genes do not suffer from the same substitutional biases that characterize mitochondrial genes.  相似文献   

12.
Filtering is a common practice used to simplify the analysis of microarray data by removing from subsequent consideration probe sets believed to be unexpressed. The m/n filter, which is widely used in the analysis of Affymetrix data, removes all probe sets having fewer than m present calls among a set of n chips. The m/n filter has been widely used without considering its statistical properties. The level and power of the m/n filter are derived. Two alternative filters, the pooled p-value filter and the error-minimizing pooled p-value filter are proposed. The pooled p-value filter combines information from the present-absent p-values into a single summary p-value which is subsequently compared to a selected significance threshold. We show that pooled p-value filter is the uniformly most powerful statistical test under a reasonable beta model and that it exhibits greater power than the m/n filter in all scenarios considered in a simulation study. The error-minimizing pooled p-value filter compares the summary p-value with a threshold determined to minimize a total-error criterion based on a partition of the distribution of all probes' summary p-values. The pooled p-value and error-minimizing pooled p-value filters clearly perform better than the m/n filter in a case-study analysis. The case-study analysis also demonstrates a proposed method for estimating the number of differentially expressed probe sets excluded by filtering and subsequent impact on the final analysis. The filter impact analysis shows that the use of even the best filter may hinder, rather than enhance, the ability to discover interesting probe sets or genes. S-plus and R routines to implement the pooled p-value and error-minimizing pooled p-value filters have been developed and are available from www.stjuderesearch.org/depts/biostats/index.html.  相似文献   

13.
Groupwise functional analysis of gene variants is becoming standard in next-generation sequencing studies. As the function of many genes is unknown and their classification to pathways is scant, functional associations between genes are often inferred from large-scale omics data. Such data types—including protein–protein interactions and gene co-expression networks—are used to examine the interrelations of the implicated genes. Statistical significance is assessed by comparing the interconnectedness of the mutated genes with that of random gene sets. However, interconnectedness can be affected by confounding bias, potentially resulting in false positive findings. We show that genes implicated through de novo sequence variants are biased in their coding-sequence length and longer genes tend to cluster together, which leads to exaggerated p-values in functional studies; we present here an integrative method that addresses these bias. To discern molecular pathways relevant to complex disease, we have inferred functional associations between human genes from diverse data types and assessed them with a novel phenotype-based method. Examining the functional association between de novo gene variants, we control for the heretofore unexplored confounding bias in coding-sequence length. We test different data types and networks and find that the disease-associated genes cluster more significantly in an integrated phenotypic-linkage network than in other gene networks. We present a tool of superior power to identify functional associations among genes mutated in the same disease even after accounting for significant sequencing study bias and demonstrate the suitability of this method to functionally cluster variant genes underlying polygenic disorders.  相似文献   

14.
Robust estimation of the false discovery rate   总被引:2,自引:0,他引:2  
MOTIVATION: Presently available methods that use p-values to estimate or control the false discovery rate (FDR) implicitly assume that p-values are continuously distributed and based on two-sided tests. Therefore, it is difficult to reliably estimate the FDR when p-values are discrete or based on one-sided tests. RESULTS: A simple and robust method to estimate the FDR is proposed. The proposed method does not rely on implicit assumptions that tests are two-sided or yield continuously distributed p-values. The proposed method is proven to be conservative and have desirable large-sample properties. In addition, the proposed method was among the best performers across a series of 'real data simulations' comparing the performance of five currently available methods. AVAILABILITY: Libraries of S-plus and R routines to implement the method are freely available from www.stjuderesearch.org/depts/biostats.  相似文献   

15.
16.

Background  

This paper presents a unified framework for finding differentially expressed genes (DEGs) from the microarray data. The proposed framework has three interrelated modules: (i) gene ranking, ii) significance analysis of genes and (iii) validation. The first module uses two gene selection algorithms, namely, a) two-way clustering and b) combined adaptive ranking to rank the genes. The second module converts the gene ranks into p-values using an R-test and fuses the two sets of p-values using the Fisher's omnibus criterion. The DEGs are selected using the FDR analysis. The third module performs three fold validations of the obtained DEGs. The robustness of the proposed unified framework in gene selection is first illustrated using false discovery rate analysis. In addition, the clustering-based validation of the DEGs is performed by employing an adaptive subspace-based clustering algorithm on the training and the test datasets. Finally, a projection-based visualization is performed to validate the DEGs obtained using the unified framework.  相似文献   

17.
We present a new method to efficiently estimate very large numbers of p-values using empirically constructed null distributions of a test statistic. The need to evaluate a very large number of p-values is increasingly common with modern genomic data, and when interaction effects are of interest, the number of tests can easily run into billions. When the asymptotic distribution is not easily available, permutations are typically used to obtain p-values but these can be computationally infeasible in large problems. Our method constructs a prediction model to obtain a first approximation to the p-values and uses Bayesian methods to choose a fraction of these to be refined by permutations. We apply and evaluate our method on the study of association between 2-way interactions of genetic markers and colorectal cancer using the data from the first phase of a large, genome-wide case-control study. The results show enormous computational savings as compared to evaluating a full set of permutations, with little decrease in accuracy.  相似文献   

18.
In the Haseman-Elston approach the squared phenotypic difference is regressed on the proportion of alleles shared identical by descent (IBD) to map a quantitative trait to a genetic marker. In applications the IBD distribution is estimated and usually cannot be determined uniquely owing to incomplete marker information. At Genetic Analysis Workshop (GAW) 13, Jacobs et al. [BMC Genet 2003, 4(Suppl 1):S82] proposed to improve the power of the Haseman-Elston algorithm by weighting for information available from marker genotypes. The authors did not show, however, the validity of the employed asymptotic distribution. In this paper, we use the simulated data provided for GAW 14 and show that weighting Haseman-Elston by marker information results in increased type I error rates. Specifically, we demonstrate that the number of significant findings throughout the chromosome is significantly increased with weighting schemes. Furthermore, we show that the classical Haseman-Elston method keeps its nominal significance level when applied to the same data. We therefore recommend to use Haseman-Elston with marker informativity weights only in conjunction with empirical p-values. Whether this approach in fact yields an increase in power needs to be investigated further.  相似文献   

19.
20.
alpha(1)-Proteinase inhibitor (alpha(1)-PI) is a member of the serpin superfamily of serine proteinase inhibitors, which function in maintaining homeostasis through regulation of numerous proteolytic processes. In laboratory mice (Mus musculus domesticus), alpha(1)-PI occurs in multiple isoforms encoded by a family of three to five genes that are polymorphic among inbred strains and that are located at the Serpina1 locus on chromosome 12. In the present study, we have characterized the alpha(1)-PI gene family of inbred mice in more detail. We show that mice express seven isoforms, all of which are encoded by genes that map to the Serpina1 locus. In addition, polymorphism at the locus is defined by three haplotypes (Serpina1(b), Serpina1(c), and Serpina1(l)) that differ with regard to both the number and identity of alpha(1)-PI genes. Finally, we present the complete sequence of an 84-kb region of Serpina1 containing a tandem repeat of two alpha(1)-PI genes.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号