Categorical variables with many categories are preferentially selected in bootstrap‐based model selection procedures for multivariable regression models期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

Categorical variables with many categories are preferentially selected in bootstrap‐based model selection procedures for multivariable regression models

Authors:	Anne‐Laure Boulesteix

Institution:	Department of Medical Informatics, Biometry and Epidemology, University of Munich, Germany

Abstract:	Automated variable selection procedures, such as backward elimination, are commonly employed to perform model selection in the context of multivariable regression. The stability of such procedures can be investigated using a bootstrap‐based approach. The idea is to apply the variable selection procedure on a large number of bootstrap samples successively and to examine the obtained models, for instance, in terms of the inclusion of specific predictor variables. In this paper, we aim to investigate a particular important problem affecting this method in the case of categorical predictor variables with different numbers of categories and to give recommendations on how to avoid it. For this purpose, we systematically assess the behavior of automated variable selection based on the likelihood ratio test using either bootstrap samples drawn with replacement or subsamples drawn without replacement from the original dataset. Our study consists of extensive simulations and a real data example from the NHANES study. Our main result is that if automated variable selection is conducted on bootstrap samples, variables with more categories are substantially favored over variables with fewer categories and over metric variables even if none of them have any effect. Importantly, variables with no effect and many categories may be (wrongly) preferred to variables with an effect but few categories. We suggest the use of subsamples instead of bootstrap samples to bypass these drawbacks.

Keywords:	Automated selection procedures Bootstrap samples Categorical variables Likelihood ratio test Model selection

设为首页 | 免责声明 | 关于勤云 | 加入收藏