首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Recently, Efron (2007) provided methods for assessing the effect of correlation on false discovery rate (FDR) in large‐scale testing problems in the context of microarray data. Although FDR procedure does not require independence of the tests, existence of correlation grossly under‐ or overestimates the number of critical genes. Here, we briefly review Efron's method and apply it to a relatively smaller spectrometry proteomics data. We show that even here the correlation can affect the FDR values and the number of proteins declared as critical.  相似文献   

2.
In biostatistical practice, it is common to use information criteria as a guide for model selection. We propose new versions of the focused information criterion (FIC) for variable selection in logistic regression. The FIC gives, depending on the quantity to be estimated, possibly different sets of selected variables. The standard version of the FIC measures the mean squared error of the estimator of the quantity of interest in the selected model. In this article, we propose more general versions of the FIC, allowing other risk measures such as the one based on L(p) error. When prediction of an event is important, as is often the case in medical applications, we construct an FIC using the error rate as a natural risk measure. The advantages of using an information criterion which depends on both the quantity of interest and the selected risk measure are illustrated by means of a simulation study and application to a study on diabetic retinopathy.  相似文献   

3.
Mass spectrometric profiling approaches such as MALDI‐TOF and SELDI‐TOF are increasingly being used in disease marker discovery, particularly in the lower molecular weight proteome. However, little consideration has been given to the issue of sample size in experimental design. The aim of this study was to develop a protocol for the use of sample size calculations in proteomic profiling studies using MS. These sample size calculations can be based on a simple linear mixed model which allows the inclusion of estimates of biological and technical variation inherent in the experiment. The use of a pilot experiment to estimate these components of variance is investigated and is shown to work well when compared with larger studies. Examination of data from a number of studies using different sample types and different chromatographic surfaces shows the need for sample‐ and preparation‐specific sample size calculations.  相似文献   

4.
Yang X  Belin TR  Boscardin WJ 《Biometrics》2005,61(2):498-506
Across multiply imputed data sets, variable selection methods such as stepwise regression and other criterion-based strategies that include or exclude particular variables typically result in models with different selected predictors, thus presenting a problem for combining the results from separate complete-data analyses. Here, drawing on a Bayesian framework, we propose two alternative strategies to address the problem of choosing among linear regression models when there are missing covariates. One approach, which we call "impute, then select" (ITS) involves initially performing multiple imputation and then applying Bayesian variable selection to the multiply imputed data sets. A second strategy is to conduct Bayesian variable selection and missing data imputation simultaneously within one Gibbs sampling process, which we call "simultaneously impute and select" (SIAS). The methods are implemented and evaluated using the Bayesian procedure known as stochastic search variable selection for multivariate normal data sets, but both strategies offer general frameworks within which different Bayesian variable selection algorithms could be used for other types of data sets. A study of mental health services utilization among children in foster care programs is used to illustrate the techniques. Simulation studies show that both ITS and SIAS outperform complete-case analysis with stepwise variable selection and that SIAS slightly outperforms ITS.  相似文献   

5.
Cheung YK 《Biometrics》2008,64(3):940-949
Summary .   In situations when many regimens are possible candidates for a large phase III study, but too few resources are available to evaluate each relative to the standard, conducting a multi-armed randomized selection trial is a useful strategy to remove inferior treatments from further consideration. When the study has a relatively quick endpoint such as an imaging-based lesion volume change in acute stroke patients, frequent interim monitoring of the trial is ethically and practically appealing to clinicians. In this article, I propose a class of sequential selection boundaries for multi-armed clinical trials, in which the objective is to select a treatment with a clinically significant improvement upon the control group, or to declare futility if no such treatment exists. The proposed boundaries are easy to implement in a blinded fashion, and can be applied on a flexible monitoring schedule in terms of calendar time. Design calibration with respect to prespecified levels of confidence is simple, and can be accomplished when the response rate of the control group is known only up to an interval. One of the proposed methods is applied to redesign a selection trial with an imaging endpoint in acute stroke patients, and is compared to an optimal two-stage design via simulations: The proposed method imposes smaller sample size on average than the two-stage design; this advantage is substantial when there is in fact a superior treatment to the control group.  相似文献   

6.
In health services and outcome research, count outcomes are frequently encountered and often have a large proportion of zeros. The zero‐inflated negative binomial (ZINB) regression model has important applications for this type of data. With many possible candidate risk factors, this paper proposes new variable selection methods for the ZINB model. We consider maximum likelihood function plus a penalty including the least absolute shrinkage and selection operator (LASSO), smoothly clipped absolute deviation (SCAD), and minimax concave penalty (MCP). An EM (expectation‐maximization) algorithm is proposed for estimating the model parameters and conducting variable selection simultaneously. This algorithm consists of estimating penalized weighted negative binomial models and penalized logistic models via the coordinated descent algorithm. Furthermore, statistical properties including the standard error formulae are provided. A simulation study shows that the new algorithm not only has more accurate or at least comparable estimation, but also is more robust than the traditional stepwise variable selection. The proposed methods are applied to analyze the health care demand in Germany using the open‐source R package mpath .  相似文献   

7.
Bondell HD  Reich BJ 《Biometrics》2008,64(1):115-123
Summary .   Variable selection can be challenging, particularly in situations with a large number of predictors with possibly high correlations, such as gene expression data. In this article, a new method called the OSCAR (octagonal shrinkage and clustering algorithm for regression) is proposed to simultaneously select variables while grouping them into predictive clusters. In addition to improving prediction accuracy and interpretation, these resulting groups can then be investigated further to discover what contributes to the group having a similar behavior. The technique is based on penalized least squares with a geometrically intuitive penalty function that shrinks some coefficients to exactly zero. Additionally, this penalty yields exact equality of some coefficients, encouraging correlated predictors that have a similar effect on the response to form predictive clusters represented by a single coefficient. The proposed procedure is shown to compare favorably to the existing shrinkage and variable selection techniques in terms of both prediction error and model complexity, while yielding the additional grouping information.  相似文献   

8.
Liu Z  Tan M 《Biometrics》2008,64(4):1155-1161
SUMMARY: In medical diagnosis, the diseased and nondiseased classes are usually unbalanced and one class may be more important than the other depending on the diagnosis purpose. Most standard classification methods, however, are designed to maximize the overall accuracy and cannot incorporate different costs to different classes explicitly. In this article, we propose a novel nonparametric method to directly maximize the weighted specificity and sensitivity of the receiver operating characteristic curve. Combining advances in machine learning, optimization theory, and statistics, the proposed method has excellent generalization property and assigns different error costs to different classes explicitly. We present experiments that compare the proposed algorithms with support vector machines and regularized logistic regression using data from a study on HIV-1 protease as well as six public available datasets. Our main conclusion is that the performance of proposed algorithm is significantly better in most cases than the other classifiers tested. Software package in MATLAB is available upon request.  相似文献   

9.
10.
11.
Defining the target population based on predictive biomarkers plays an important role during clinical development. After establishing a relationship between a biomarker candidate and response to treatment in exploratory phases, a subsequent confirmatory trial ideally involves only subjects with high potential of benefiting from the new compound. In order to identify those subjects in case of a continuous biomarker, a cut-off is needed. Usually, a cut-off is chosen that resulted in a subgroup with a large observed treatment effect in an exploratory trial. However, such a data-driven selection may lead to overoptimistic expectations for the subsequent confirmatory trial. Treatment effect estimates, probability of success, and posterior probabilities are useful measures for deciding whether or not to conduct a confirmatory trial enrolling the biomarker-defined population. These measures need to be adjusted for selection bias. We extend previously introduced Approximate Bayesian Computation techniques for adjustment of subgroup selection bias to a time-to-event setting with cut-off selection. Challenges in this setting are that treatment effects become time-dependent and that subsets are defined by the biomarker distribution. Simulation studies show that the proposed method provides adjusted statistical measures which are superior to naïve Maximum Likelihood estimators as well as simple shrinkage estimators.  相似文献   

12.
Detection of positive Darwinian selection has become ever more important with the rapid growth of genomic data sets. Recent branch-site models of codon substitution account for variation of selective pressure over branches on the tree and across sites in the sequence and provide a means to detect short episodes of molecular adaptation affecting just a few sites. In likelihood ratio tests based on such models, the branches to be tested for positive selection have to be specified a priori. In the absence of a biological hypothesis to designate so-called foreground branches, one may test many branches, but a correction for multiple testing becomes necessary. In this paper, we employ computer simulation to evaluate the performance of 6 multiple test correction procedures when the branch-site models are used to test every branch on the phylogeny for positive selection. Four of the methods control the familywise error rates (FWERs), whereas the other 2 control the false discovery rate (FDR). We found that all correction procedures achieved acceptable FWER except for extremely divergent sequences and serious model violations, when the test may become unreliable. The power of the test to detect positive selection is influenced by the strength of selection and the sequence divergence, with the highest power observed at intermediate divergences. The 4 correction procedures that control the FWER had similar power. We recommend Rom's procedure for its slightly higher power, but the simple Bonferroni correction is useable as well. The 2 correction procedures that control the FDR had slightly more power and also higher FWER. We demonstrate the multiple test procedures by analyzing gene sequences from the extracellular domain of the cluster of differentiation 2 (CD2) gene from 10 mammalian species. Both our simulation and real data analysis suggest that the multiple test procedures are useful when multiple branches have to be tested on the same data set.  相似文献   

13.
Tsai CA  Hsueh HM  Chen JJ 《Biometrics》2003,59(4):1071-1081
Testing for significance with gene expression data from DNA microarray experiments involves simultaneous comparisons of hundreds or thousands of genes. If R denotes the number of rejections (declared significant genes) and V denotes the number of false rejections, then V/R, if R > 0, is the proportion of false rejected hypotheses. This paper proposes a model for the distribution of the number of rejections and the conditional distribution of V given R, V / R. Under the independence assumption, the distribution of R is a convolution of two binomials and the distribution of V / R has a noncentral hypergeometric distribution. Under an equicorrelated model, the distributions are more complex and are also derived. Five false discovery rate probability error measures are considered: FDR = E(V/R), pFDR = E(V/R / R > 0) (positive FDR), cFDR = E(V/R / R = r) (conditional FDR), mFDR = E(V)/E(R) (marginal FDR), and eFDR = E(V)/r (empirical FDR). The pFDR, cFDR, and mFDR are shown to be equivalent under the Bayesian framework, in which the number of true null hypotheses is modeled as a random variable. We present a parametric and a bootstrap procedure to estimate the FDRs. Monte Carlo simulations were conducted to evaluate the performance of these two methods. The bootstrap procedure appears to perform reasonably well, even when the alternative hypotheses are correlated (rho = .25). An example from a toxicogenomic microarray experiment is presented for illustration.  相似文献   

14.
Bayes factors comparing two or more competing hypotheses are often estimated by constructing a Markov chain Monte Carlo (MCMC) sampler to explore the joint space of the hypotheses. To obtain efficient Bayes factor estimates, Carlin and Chib (1995, Journal of the Royal Statistical Society, Series B57, 473-484) suggest adjusting the prior odds of the competing hypotheses so that the posterior odds are approximately one, then estimating the Bayes factor by simple division. A byproduct is that one often produces several independent MCMC chains, only one of which is actually used for estimation. We extend this approach to incorporate output from multiple chains by proposing three statistical models. The first assumes independent sampler draws and models the hypothesis indicator function using logistic regression for various choices of the prior odds. The two more complex models relax the independence assumption by allowing for higher-lag dependence within the MCMC output. These models allow us to estimate the uncertainty in our Bayes factor calculation and to fully use several different MCMC chains even when the prior odds of the hypotheses vary from chain to chain. We apply these methods to calculate Bayes factors for tests of monophyly in two phylogenetic examples. The first example explores the relationship of an unknown pathogen to a set of known pathogens. Identification of the unknown's monophyletic relationship may affect antibiotic choice in a clinical setting. The second example focuses on HIV recombination detection. For potential clinical application, these types of analyses must be completed as efficiently as possible.  相似文献   

15.
16.
Polley MY  Cheung YK 《Biometrics》2008,64(1):232-241
Summary.   We deal with the design problem of early phase dose-finding clinical trials with monotone biologic endpoints, such as biological measurements, laboratory values of serum level, and gene expression. A specific objective of this type of trial is to identify the minimum dose that exhibits adequate drug activity and shifts the mean of the endpoint from a zero dose to the so-called minimum effective dose. Stepwise test procedures for dose finding have been well studied in the context of nonhuman studies where the sampling plan is done in one stage. In this article, we extend the notion of stepwise testing to a two-stage enrollment plan in an attempt to reduce the potential sample size requirement by shutting down unpromising doses in a futility interim. In particular, we examine four two-stage designs and apply them to design a statin trial with four doses and a placebo in patients with Hodgkin's disease. We discuss the calibration of the design parameters and the implementation of these proposed methods. In the context of the statin trial, a calibrated two-stage design can reduce the average total sample size up to 38% (from 125 to 78) from a one-stage step-down test, while maintaining comparable error rates and probability of correct selection. The price for the reduction in the average sample size is the slight increase in the maximum total sample size from 125 to 130.  相似文献   

17.
18.
In the analysis of data generated by change-point processes, one critical challenge is to determine the number of change-points. The classic Bayes information criterion (BIC) statistic does not work well here because of irregularities in the likelihood function. By asymptotic approximation of the Bayes factor, we derive a modified BIC for the model of Brownian motion with changing drift. The modified BIC is similar to the classic BIC in the sense that the first term consists of the log likelihood, but it differs in the terms that penalize for model dimension. As an example of application, this new statistic is used to analyze array-based comparative genomic hybridization (array-CGH) data. Array-CGH measures the number of chromosome copies at each genome location of a cell sample, and is useful for finding the regions of genome deletion and amplification in tumor cells. The modified BIC performs well compared to existing methods in accurately choosing the number of regions of changed copy number. Unlike existing methods, it does not rely on tuning parameters or intensive computing. Thus it is impartial and easier to understand and to use.  相似文献   

19.
Kuan PF  Chiang DY 《Biometrics》2012,68(3):774-783
Summary DNA methylation has emerged as an important hallmark of epigenetics. Numerous platforms including tiling arrays and next generation sequencing, and experimental protocols are available for profiling DNA methylation. Similar to other tiling array data, DNA methylation data shares the characteristics of inherent correlation structure among nearby probes. However, unlike gene expression or protein DNA binding data, the varying CpG density which gives rise to CpG island, shore and shelf definition provides exogenous information in detecting differential methylation. This article aims to introduce a robust testing and probe ranking procedure based on a nonhomogeneous hidden Markov model that incorporates the above‐mentioned features for detecting differential methylation. We revisit the seminal work of Sun and Cai (2009, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71 , 393–424) and propose modeling the nonnull using a nonparametric symmetric distribution in two‐sided hypothesis testing. We show that this model improves probe ranking and is robust to model misspecification based on extensive simulation studies. We further illustrate that our proposed framework achieves good operating characteristics as compared to commonly used methods in real DNA methylation data that aims to detect differential methylation sites.  相似文献   

20.
Uncertainty is a crucial issue in statistics which can be considered from different points of view. One type of uncertainty, typically referred to as sampling uncertainty, arises through the variability of results obtained when the same analysis strategy is applied to different samples. Another type of uncertainty arises through the variability of results obtained when using the same sample but different analysis strategies addressing the same research question. We denote this latter type of uncertainty as method uncertainty. It results from all the choices to be made for an analysis, for example, decisions related to data preparation, method choice, or model selection. In medical sciences, a large part of omics research is focused on the identification of molecular biomarkers, which can either be performed through ranking or by selection from among a large number of candidates. In this paper, we introduce a general resampling-based framework to quantify and compare sampling and method uncertainty. For illustration, we apply this framework to different scenarios related to the selection and ranking of omics biomarkers in the context of acute myeloid leukemia: variable selection in multivariable regression using different types of omics markers, the ranking of biomarkers according to their predictive performance, and the identification of differentially expressed genes from RNA-seq data. For all three scenarios, our findings suggest highly unstable results when the same analysis strategy is applied to two independent samples, indicating high sampling uncertainty and a comparatively smaller, but non-negligible method uncertainty, which strongly depends on the methods being compared.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号