首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The bootstrap method has become a widely used tool applied in diverse areas where results based on asymptotic theory are scarce. It can be applied, for example, for assessing the variance of a statistic, a quantile of interest or for significance testing by resampling from the null hypothesis. Recently, some approaches have been proposed in the biometrical field where hypothesis testing or model selection is performed on a bootstrap sample as if it were the original sample. P‐values computed from bootstrap samples have been used, for example, in the statistics and bioinformatics literature for ranking genes with respect to their differential expression, for estimating the variability of p‐values and for model stability investigations. Procedures which make use of bootstrapped information criteria are often applied in model stability investigations and model averaging approaches as well as when estimating the error of model selection procedures which involve tuning parameters. From the literature, however, there is evidence that p‐values and model selection criteria evaluated on bootstrap data sets do not represent what would be obtained on the original data or new data drawn from the overall population. We explain the reasons for this and, through the use of a real data set and simulations, we assess the practical impact on procedures relevant to biometrical applications in cases where it has not yet been studied. Moreover, we investigate the behavior of subsampling (i.e., drawing from a data set without replacement) as a potential alternative solution to the bootstrap for these procedures.  相似文献   

2.
Many approaches for variable selection with multiply imputed data in the development of a prognostic model have been proposed. However, no method prevails as uniformly best. We conducted a simulation study with a binary outcome and a logistic regression model to compare two classes of variable selection methods in the presence of MI data: (I) Model selection on bootstrap data, using backward elimination based on AIC or lasso, and fit the final model based on the most frequently (e.g. ) selected variables over all MI and bootstrap data sets; (II) Model selection on original MI data, using lasso. The final model is obtained by (i) averaging estimates of variables that were selected in any MI data set or (ii) in 50% of the MI data; (iii) performing lasso on the stacked MI data, and (iv) as in (iii) but using individual weights as determined by the fraction of missingness. In all lasso models, we used both the optimal penalty and the 1‐se rule. We considered recalibrating models to correct for overshrinkage due to the suboptimal penalty by refitting the linear predictor or all individual variables. We applied the methods on a real dataset of 951 adult patients with tuberculous meningitis to predict mortality within nine months. Overall, applying lasso selection with the 1‐se penalty shows the best performance, both in approach I and II. Stacking MI data is an attractive approach because it does not require choosing a selection threshold when combining results from separate MI data sets  相似文献   

3.
Statistical models support medical research by facilitating individualized outcome prognostication conditional on independent variables or by estimating effects of risk factors adjusted for covariates. Theory of statistical models is well‐established if the set of independent variables to consider is fixed and small. Hence, we can assume that effect estimates are unbiased and the usual methods for confidence interval estimation are valid. In routine work, however, it is not known a priori which covariates should be included in a model, and often we are confronted with the number of candidate variables in the range 10–30. This number is often too large to be considered in a statistical model. We provide an overview of various available variable selection methods that are based on significance or information criteria, penalized likelihood, the change‐in‐estimate criterion, background knowledge, or combinations thereof. These methods were usually developed in the context of a linear regression model and then transferred to more generalized linear models or models for censored survival data. Variable selection, in particular if used in explanatory modeling where effect estimates are of central interest, can compromise stability of a final model, unbiasedness of regression coefficients, and validity of p‐values or confidence intervals. Therefore, we give pragmatic recommendations for the practicing statistician on application of variable selection methods in general (low‐dimensional) modeling problems and on performing stability investigations and inference. We also propose some quantities based on resampling the entire variable selection process to be routinely reported by software packages offering automated variable selection algorithms.  相似文献   

4.
The recently-developed statistical method known as the “bootstrap” can be used to place confidence intervals on phylogenies. It involves resampling points from one's own data, with replacement, to create a series of bootstrap samples of the same size as the original data. Each of these is analyzed, and the variation among the resulting estimates taken to indicate the size of the error involved in making estimates from the original data. In the case of phylogenies, it is argued that the proper method of resampling is to keep all of the original species while sampling characters with replacement, under the assumption that the characters have been independently drawn by the systematist and have evolved independently. Majority-rule consensus trees can be used to construct a phylogeny showing all of the inferred monophyletic groups that occurred in a majority of the bootstrap samples. If a group shows up 95% of the time or more, the evidence for it is taken to be statistically significant. Existing computer programs can be used to analyze different bootstrap samples by using weights on the characters, the weight of a character being how many times it was drawn in bootstrap sampling. When all characters are perfectly compatible, as envisioned by Hennig, bootstrap sampling becomes unnecessary; the bootstrap method would show significant evidence for a group if it is defined by three or more characters.  相似文献   

5.
The bootstrap is a tool that allows for efficient evaluation of prediction performance of statistical techniques without having to set aside data for validation. This is especially important for high-dimensional data, e.g., arising from microarrays, because there the number of observations is often limited. For avoiding overoptimism the statistical technique to be evaluated has to be applied to every bootstrap sample in the same manner it would be used on new data. This includes a selection of complexity, e.g., the number of boosting steps for gradient boosting algorithms. Using the latter, we demonstrate in a simulation study that complexity selection in conventional bootstrap samples, drawn with replacement, is severely biased in many scenarios. This translates into a considerable bias of prediction error estimates, often underestimating the amount of information that can be extracted from high-dimensional data. Potential remedies for this complexity selection bias, such as alternatively using a fixed level of complexity or of using sampling without replacement are investigated and it is shown that the latter works well in many settings. We focus on high-dimensional binary response data, with bootstrap .632+ estimates of the Brier score for performance evaluation, and censored time-to-event data with .632+ prediction error curve estimates. The latter, with the modified bootstrap procedure, is then applied to an example with microarray data from patients with diffuse large B-cell lymphoma.  相似文献   

6.
Variable selection is usually performed to increase interpretability, as sparser models are easier to understand than full models. However, a focus on sparsity is not always suitable, for example, when features are related due to contextual similarities or high correlations. Here, it may be more appropriate to identify groups and their predictive members, a task that can be accomplished with bi-level selection procedures. To investigate whether such techniques lead to increased interpretability, group exponential LASSO (GEL), sparse group LASSO (SGL), composite minimax concave penalty (cMCP), and least absolute shrinkage, and selection operator (LASSO) as reference methods were used to select predictors in time-to-event, regression, and classification tasks in bootstrap samples from a cohort of 1001 patients. Different groupings based on prior knowledge, correlation structure, and random assignment were compared in terms of selection relevance, group consistency, and collinearity tolerance. The results show that bi-level selection methods are superior to LASSO in all criteria. The cMCP demonstrated superiority in selection relevance, while SGL was convincing in group consistency. An all-round capacity was achieved by GEL: the approach jointly selected correlated and content-related predictors while maintaining high selection relevance. This method seems recommendable when variables are grouped, and interpretation is of primary interest.  相似文献   

7.
For independent data, non-parametric bootstrap is realised by resampling the data with replacement. This approach fails for dependent data such as time series. If the data generating process is at least stationary and mixing, the blockwise bootstrap by drawing subsamples or blocks of the data saves the concept. For the blockwise bootstrap a blocklength has to be selected. We propose a method for selecting the optimal blocklength. To improve the finite size properties of the blockwise bootstrap, studentised statistics is considered. If the statistic can be represented as a smooth function model this studentisation can be approximated efficiently. The studentised blockwise bootstrap method is applied for testing hypotheses on medical time series.  相似文献   

8.
Two variable selection procedures are evaluated for classification problems: a forward stepwise discrimination procedure, and a stepwise procedure preceded by a preliminary screening of variables on the basis of individual t statistics. Expected probability of correct classification is used as the measure of performance. A comparison is made of the procedures using samples from multi-variate normal populations and from several nonnormal populations. The study demonstrated some situations where the use of all variables is preferable to the use of a stepwise discriminant procedure stopping after a few steps, though usually the latter procedure was superior in performance. However where the stepwise procedure performed better than using all variables, the modified stepwise procedure performed still better. The use of modified stepwise procedures in which not all the covariances of the problem need be estimated seems promising.  相似文献   

9.

Background  

Variable importance measures for random forests have been receiving increased attention as a means of variable selection in many classification tasks in bioinformatics and related scientific fields, for instance to select a subset of genetic markers relevant for the prediction of a certain disease. We show that random forest variable importance measures are a sensible means for variable selection in many applications, but are not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories. This is particularly important in genomics and computational biology, where predictors often include variables of different types, for example when predictors include both sequence data and continuous variables such as folding energy, or when amino acid sequence data show different numbers of categories.  相似文献   

10.
In this article, we address a missing data problem that occurs in transplant survival studies. Recipients of organ transplants are followed up from transplantation and their survival times recorded, together with various explanatory variables. Due to differences in data collection procedures in different centers or over time, a particular explanatory variable (or set of variables) may only be recorded for certain recipients, which results in this variable being missing for a substantial number of records in the data. The variable may also turn out to be an important predictor of survival and so it is important to handle this missing-by-design problem appropriately. Consensus in the literature is to handle this problem with complete case analysis, as the missing data are assumed to arise under an appropriate missing at random mechanism that gives consistent estimates here. Specifically, the missing values can reasonably be assumed not to be related to the survival time. In this article, we investigate the potential for multiple imputation to handle this problem in a relevant study on survival after kidney transplantation, and show that it comprehensively outperforms complete case analysis on a range of measures. This is a particularly important finding in the medical context as imputing large amounts of missing data is often viewed with scepticism.  相似文献   

11.
The development of clinical prediction models requires the selection of suitable predictor variables. Techniques to perform objective Bayesian variable selection in the linear model are well developed and have been extended to the generalized linear model setting as well as to the Cox proportional hazards model. Here, we consider discrete time‐to‐event data with competing risks and propose methodology to develop a clinical prediction model for the daily risk of acquiring a ventilator‐associated pneumonia (VAP) attributed to P. aeruginosa (PA) in intensive care units. The competing events for a PA VAP are extubation, death, and VAP due to other bacteria. Baseline variables are potentially important to predict the outcome at the start of ventilation, but may lose some of their predictive power after a certain time. Therefore, we use a landmark approach for dynamic Bayesian variable selection where the set of relevant predictors depends on the time already spent at risk. We finally determine the direct impact of a variable on each competing event through cause‐specific variable selection.  相似文献   

12.
S M Snapinn  J D Knoke 《Biometrics》1989,45(1):289-299
Accurate estimation of misclassification rates in discriminant analysis with selection of variables by, for example, a stepwise algorithm, is complicated by the large optimistic bias inherent in standard estimators such as those obtained by the resubstitution method. Application of a bootstrap adjustment can reduce the bias of the resubstitution method; however, the bootstrap technique requires the variable selection procedure to be repeated many times and is therefore difficult to compute. In this paper we propose a smoothed estimator that requires relatively little computation and which, on the basis of a Monte Carlo sampling study, is found to perform generally at least as well as the bootstrap method.  相似文献   

13.
Recent genomic evaluation studies using real data and predicting genetic gain by modeling breeding programs have reported moderate expected benefits from the replacement of classic selection schemes by genomic selection (GS) in small ruminants. The objectives of this study were to compare the cost, monetary genetic gain and economic efficiency of classic selection and GS schemes in the meat sheep industry. Deterministic methods were used to model selection based on multi-trait indices from a sheep meat breeding program. Decisional variables related to male selection candidates and progeny testing were optimized to maximize the annual monetary genetic gain (AMGG), that is, a weighted sum of meat and maternal traits annual genetic gains. For GS, a reference population of 2000 individuals was assumed and genomic information was available for evaluation of male candidates only. In the classic selection scheme, males breeding values were estimated from own and offspring phenotypes. In GS, different scenarios were considered, differing by the information used to select males (genomic only, genomic+own performance, genomic+offspring phenotypes). The results showed that all GS scenarios were associated with higher total variable costs than classic selection (if the cost of genotyping was 123 euros/animal). In terms of AMGG and economic returns, GS scenarios were found to be superior to classic selection only if genomic information was combined with their own meat phenotypes (GS-Pheno) or with their progeny test information. The predicted economic efficiency, defined as returns (proportional to number of expressions of AMGG in the nucleus and commercial flocks) minus total variable costs, showed that the best GS scenario (GS-Pheno) was up to 15% more efficient than classic selection. For all selection scenarios, optimization increased the overall AMGG, returns and economic efficiency. As a conclusion, our study shows that some forms of GS strategies are more advantageous than classic selection, provided that GS is already initiated (i.e. the initial reference population is available). Optimizing decisional variables of the classic selection scheme could be of greater benefit than including genomic information in optimized designs.  相似文献   

14.
Latent class model diagnosis   总被引:1,自引:0,他引:1  
Garrett ES  Zeger SL 《Biometrics》2000,56(4):1055-1067
In many areas of medical research, such as psychiatry and gerontology, latent class variables are used to classify individuals into disease categories, often with the intention of hierarchical modeling. Problems arise when it is not clear how many disease classes are appropriate, creating a need for model selection and diagnostic techniques. Previous work has shown that the Pearson chi 2 statistic and the log-likelihood ratio G2 statistic are not valid test statistics for evaluating latent class models. Other methods, such as information criteria, provide decision rules without providing explicit information about where discrepancies occur between a model and the data. Identifiability issues further complicate these problems. This paper develops procedures for assessing Markov chain Monte Carlo convergence and model diagnosis and for selecting the number of categories for the latent variable based on evidence in the data using Markov chain Monte Carlo techniques. Simulations and a psychiatric example are presented to demonstrate the effective use of these methods.  相似文献   

15.
Tutz G  Binder H 《Biometrics》2006,62(4):961-971
The use of generalized additive models in statistical data analysis suffers from the restriction to few explanatory variables and the problems of selection of smoothing parameters. Generalized additive model boosting circumvents these problems by means of stagewise fitting of weak learners. A fitting procedure is derived which works for all simple exponential family distributions, including binomial, Poisson, and normal response variables. The procedure combines the selection of variables and the determination of the appropriate amount of smoothing. Penalized regression splines and the newly introduced penalized stumps are considered as weak learners. Estimates of standard deviations and stopping criteria, which are notorious problems in iterative procedures, are based on an approximate hat matrix. The method is shown to be a strong competitor to common procedures for the fitting of generalized additive models. In particular, in high-dimensional settings with many nuisance predictor variables it performs very well.  相似文献   

16.
Meinshausen  Nicolai 《Biometrika》2008,95(2):265-278
A frequently encountered challenge in high-dimensional regressionis the detection of relevant variables. Variable selection suffersfrom instability and the power to detect relevant variablesis typically low if predictor variables are highly correlated.When taking the multiplicity of the testing problem into account,the power diminishes even further. To gain power and insight,it can be advantageous to look for influence not at the levelof individual variables but rather at the level of clustersof highly correlated variables. We propose a hierarchical approach.Variable importance is first tested at the coarsest level, correspondingto the global null hypothesis. The method then tries to attributeany effect to smaller subclusters or even individual variables.The smallest possible clusters, which still exhibit a significantinfluence on the response variable, are retained. It is shownthat the proposed testing procedure controls the familywiseerror rate at a prespecified level, simultaneously over allresolution levels. The method has power comparable to the Bonferroni–Holmprocedure on the level of individual variables and dramaticallylarger power for coarser resolution levels. The best resolutionlevel is selected adaptively.  相似文献   

17.
Bilder CR  Loughin TM 《Biometrics》2002,58(1):200-208
Survey respondents are often prompted to pick any number of responses from a set of possible responses. Categorical variables that summarize this kind of data are called pick any/c variables. Counts from surveys that contain a pick any/c variable along with a group variable (r levels) and stratification variable (q levels) can be marginally summarized into an r x c x q contingency table. A question that may naturally arise from this setup is to determine if the group and pick any/c variable are marginally independent given the stratification variable. A test for conditional multiple marginal independence (CMMI) can be used to answer this question. Since subjects may pick any number out of c possible responses, the Cochran (1954, Biometrics 10, 417-451) and Mantel and Haenszel (1959, Journal of the National Cancer Institute 22, 719-748) tests cannot be used directly because they assume that units in the contingency table are independent of each other. Therefore, new testing methods are developed. Cochran's test statistic is extended to r x 2 x q tables, and a modified version of this statistic is proposed to test CMMI. Its sampling distribution can be approximated through bootstrapping. Other CMMI testing methods discussed are bootstrap p-value combination methods and Bonferroni adjustments. Simulation findings suggest that the proposed bootstrap procedures and the Bonferroni adjustments consistently hold the correct size and provide power against various alternatives.  相似文献   

18.
Summary Variable selection for clustering is an important and challenging problem in high‐dimensional data analysis. Existing variable selection methods for model‐based clustering select informative variables in a “one‐in‐all‐out” manner; that is, a variable is selected if at least one pair of clusters is separable by this variable and removed if it cannot separate any of the clusters. In many applications, however, it is of interest to further establish exactly which clusters are separable by each informative variable. To address this question, we propose a pairwise variable selection method for high‐dimensional model‐based clustering. The method is based on a new pairwise penalty. Results on simulated and real data show that the new method performs better than alternative approaches that use ?1 and ? penalties and offers better interpretation.  相似文献   

19.
Most theoretical research in sexual selection has focused on indirect selection. However, empirical studies have not strongly supported indirect selection. A well-established finding is that direct benefits and costs exert a strong influence on the evolution of mate choice. We present an analytical model in which unilateral mate choice evolves solely by direct sexual selection on choosiness. We show this is sufficient to generate the evolution of all possible levels of choosiness, because of the fundamental trade-off between mating rate and mating benefits. We further identify the relative searching time (RST, i.e. the proportion of lifetime devoted to searching for mates) as a predictor of the effect of any variable affecting the mating rate on the evolution of choosiness. We show that the RST: (i) allows one to make predictions about the evolution of choosiness across a wide variety of mating systems; (ii) encompasses all alternative variables proposed thus far to explain the evolution of choosiness by direct sexual selection; and (iii) can be empirically used to infer qualitative differences in choosiness.  相似文献   

20.
The problem of variable selection in the generalized linear‐mixed models (GLMMs) is pervasive in statistical practice. For the purpose of variable selection, many methodologies for determining the best subset of explanatory variables currently exist according to the model complexity and differences between applications. In this paper, we develop a “higher posterior probability model with bootstrap” (HPMB) approach to select explanatory variables without fitting all possible GLMMs involving a small or moderate number of explanatory variables. Furthermore, to save computational load, we propose an efficient approximation approach with Laplace's method and Taylor's expansion to approximate intractable integrals in GLMMs. Simulation studies and an application of HapMap data provide evidence that this selection approach is computationally feasible and reliable for exploring true candidate genes and gene–gene associations, after adjusting for complex structures among clusters.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号