Analyzing large datasets with bootstrap penalization期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

Analyzing large datasets with bootstrap penalization

Authors:	Kuangnan Fang Shuangge Ma

Institution:	1. Department of Statistics, Xiamen University, Xiamen, Fujian, China;2. Department of Biostatistics, Yale University, New Haven, CT, USA

Abstract:	Data with a large p (number of covariates) and/or a large n (sample size) are now commonly encountered. For many problems, regularization especially penalization is adopted for estimation and variable selection. The straightforward application of penalization to large datasets demands a “big computer” with high computational power. To improve computational feasibility, we develop bootstrap penalization, which dissects a big penalized estimation into a set of small ones, which can be executed in a highly parallel manner and each only demands a “small computer”. The proposed approach takes different strategies for data with different characteristics. For data with a large p but a small to moderate n, covariates are first clustered into relatively homogeneous blocks. The proposed approach consists of two sequential steps. In each step and for each bootstrap sample, we select blocks of covariates and run penalization. The results from multiple bootstrap samples are pooled to generate the final estimate. For data with a large n but a small to moderate p, we bootstrap a small number of subjects, apply penalized estimation, and then conduct a weighted average over multiple bootstrap samples. For data with a large p and a large n, the natural marriage of the previous two methods is applied. Numerical studies, including simulations and data analysis, show that the proposed approach has computational and numerical advantages over the straightforward application of penalization. An R package has been developed to implement the proposed methods.

Keywords:	Bootstrap Computational feasibility Large datasets Penalization

设为首页 | 免责声明 | 关于勤云 | 加入收藏