首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Summary We consider selecting both fixed and random effects in a general class of mixed effects models using maximum penalized likelihood (MPL) estimation along with the smoothly clipped absolute deviation (SCAD) and adaptive least absolute shrinkage and selection operator (ALASSO) penalty functions. The MPL estimates are shown to possess consistency and sparsity properties and asymptotic normality. A model selection criterion, called the ICQ statistic, is proposed for selecting the penalty parameters ( Ibrahim, Zhu, and Tang, 2008 , Journal of the American Statistical Association 103, 1648–1658). The variable selection procedure based on ICQ is shown to consistently select important fixed and random effects. The methodology is very general and can be applied to numerous situations involving random effects, including generalized linear mixed models. Simulation studies and a real data set from a Yale infant growth study are used to illustrate the proposed methodology.  相似文献   

2.
This article presents a novel algorithm that efficiently computes L1 penalized (lasso) estimates of parameters in high‐dimensional models. The lasso has the property that it simultaneously performs variable selection and shrinkage, which makes it very useful for finding interpretable prediction rules in high‐dimensional data. The new algorithm is based on a combination of gradient ascent optimization with the Newton–Raphson algorithm. It is described for a general likelihood function and can be applied in generalized linear models and other models with an L1 penalty. The algorithm is demonstrated in the Cox proportional hazards model, predicting survival of breast cancer patients using gene expression data, and its performance is compared with competing approaches. An R package, penalized , that implements the method, is available on CRAN.  相似文献   

3.
Yan J  Huang J 《Biometrics》2012,68(2):419-428
Summary Cox models with time-varying coefficients offer great flexibility in capturing the temporal dynamics of covariate effects on right-censored failure times. Because not all covariate coefficients are time varying, model selection for such models presents an additional challenge, which is to distinguish covariates with time-varying coefficient from those with time-independent coefficient. We propose an adaptive group lasso method that not only selects important variables but also selects between time-independent and time-varying specifications of their presence in the model. Each covariate effect is partitioned into a time-independent part and a time-varying part, the latter of which is characterized by a group of coefficients of basis splines without intercept. Model selection and estimation are carried out through a fast, iterative group shooting algorithm. Our approach is shown to have good properties in a simulation study that mimics realistic situations with up to 20 variables. A real example illustrates the utility of the method.  相似文献   

4.
Errors‐in‐variables models in high‐dimensional settings pose two challenges in application. First, the number of observed covariates is larger than the sample size, while only a small number of covariates are true predictors under an assumption of model sparsity. Second, the presence of measurement error can result in severely biased parameter estimates, and also affects the ability of penalized methods such as the lasso to recover the true sparsity pattern. A new estimation procedure called SIMulation‐SELection‐EXtrapolation (SIMSELEX) is proposed. This procedure makes double use of lasso methodology. First, the lasso is used to estimate sparse solutions in the simulation step, after which a group lasso is implemented to do variable selection. The SIMSELEX estimator is shown to perform well in variable selection, and has significantly lower estimation error than naive estimators that ignore measurement error. SIMSELEX can be applied in a variety of errors‐in‐variables settings, including linear models, generalized linear models, and Cox survival models. It is furthermore shown in the Supporting Information how SIMSELEX can be applied to spline‐based regression models. A simulation study is conducted to compare the SIMSELEX estimators to existing methods in the linear and logistic model settings, and to evaluate performance compared to naive methods in the Cox and spline models. Finally, the method is used to analyze a microarray dataset that contains gene expression measurements of favorable histology Wilms tumors.  相似文献   

5.
In this paper, we propose a frequentist model averaging method for quantile regression with high-dimensional covariates. Although research on these subjects has proliferated as separate approaches, no study has considered them in conjunction. Our method entails reducing the covariate dimensions through ranking the covariates based on marginal quantile utilities. The second step of our method implements model averaging on the models containing the covariates that survive the screening of the first step. We use a delete-one cross-validation method to select the model weights, and prove that the resultant estimator possesses an optimal asymptotic property uniformly over any compact (0,1) subset of the quantile indices. Our proof, which relies on empirical process theory, is arguably more challenging than proofs of similar results in other contexts owing to the high-dimensional nature of the problem and our relaxation of the conventional assumption of the weights summing to one. Our investigation of finite-sample performance demonstrates that the proposed method exhibits very favorable properties compared to the least absolute shrinkage and selection operator (LASSO) and smoothly clipped absolute deviation (SCAD) penalized regression methods. The method is applied to a microarray gene expression data set.  相似文献   

6.
Summary .  We consider variable selection in the Cox regression model ( Cox, 1975 ,  Biometrika   362, 269–276) with covariates missing at random. We investigate the smoothly clipped absolute deviation penalty and adaptive least absolute shrinkage and selection operator (LASSO) penalty, and propose a unified model selection and estimation procedure. A computationally attractive algorithm is developed, which simultaneously optimizes the penalized likelihood function and penalty parameters. We also optimize a model selection criterion, called the   IC Q    statistic ( Ibrahim, Zhu, and Tang, 2008 ,  Journal of the American Statistical Association   103, 1648–1658), to estimate the penalty parameters and show that it consistently selects all important covariates. Simulations are performed to evaluate the finite sample performance of the penalty estimates. Also, two lung cancer data sets are analyzed to demonstrate the proposed methodology.  相似文献   

7.
Cross-validation is the standard method for hyperparameter tuning, or calibration, of machine learning algorithms. The adaptive lasso is a popular class of penalized approaches based on weighted L1-norm penalties, with weights derived from an initial estimate of the model parameter. Although it violates the paramount principle of cross-validation, according to which no information from the hold-out test set should be used when constructing the model on the training set, a “naive” cross-validation scheme is often implemented for the calibration of the adaptive lasso. The unsuitability of this naive cross-validation scheme in this context has not been well documented in the literature. In this work, we recall why the naive scheme is theoretically unsuitable and how proper cross-validation should be implemented in this particular context. Using both synthetic and real-world examples and considering several versions of the adaptive lasso, we illustrate the flaws of the naive scheme in practice. In particular, we show that it can lead to the selection of adaptive lasso estimates that perform substantially worse than those selected via a proper scheme in terms of both support recovery and prediction error. In other words, our results show that the theoretical unsuitability of the naive scheme translates into suboptimality in practice, and call for abandoning it.  相似文献   

8.
Shortreed and Ertefaie introduced a clever propensity score variable selection approach for estimating average causal effects, namely, the outcome adaptive lasso (OAL). OAL aims to select desirable covariates, confounders, and predictors of outcome, to build an unbiased and statistically efficient propensity score estimator. Due to its design, a potential limitation of OAL is how it handles the collinearity problem, which is often encountered in high-dimensional data. As seen in Shortreed and Ertefaie, OAL's performance degraded with increased correlation between covariates. In this note, we propose the generalized OAL (GOAL) that combines the strengths of the adaptively weighted L1 penalty and the elastic net to better handle the selection of correlated covariates. Two different versions of GOAL, which differ in their procedure (algorithm), are proposed. We compared OAL and GOAL in simulation scenarios that mimic those examined by Shortreed and Ertefaie. Although all approaches performed equivalently with independent covariates, we found that both GOAL versions were more performant than OAL in low and high dimensions with correlated covariates.  相似文献   

9.
Buckley–James (BJ) model is a typical semiparametric accelerated failure time model, which is closely related to the ordinary least squares method and easy to be constructed. However, traditional BJ model built on linearity assumption only captures simple linear relationships, while it has difficulty in processing nonlinear problems. To overcome this difficulty, in this paper, we develop a novel regression model for right-censored survival data within the learning framework of BJ model, basing on random survival forests (RSF), extreme learning machine (ELM), and L2 boosting algorithm. The proposed method, referred to as ELM-based BJ boosting model, employs RSF for covariates imputation first, then develops a new ensemble of ELMs—ELM-based boosting algorithm for regression by ensemble scheme of L2 boosting, and finally, uses the output function of the proposed ELM-based boosting model to replace the linear combination of covariates in BJ model. Due to fitting the logarithm of survival time with covariates by the nonparametric ELM-based boosting method instead of the least square method, the ELM-based BJ boosting model can capture both linear covariate effects and nonlinear covariate effects. In both simulation studies and real data applications, in terms of concordance index and integrated Brier sore, the proposed ELM-based BJ boosting model can outperform traditional BJ model, two kinds of BJ boosting models proposed by Wang et al., RSF, and Cox proportional hazards model.  相似文献   

10.
Zhiguo Li  Peter Gilbert  Bin Nan 《Biometrics》2008,64(4):1247-1255
Summary Grouped failure time data arise often in HIV studies. In a recent preventive HIV vaccine efficacy trial, immune responses generated by the vaccine were measured from a case–cohort sample of vaccine recipients, who were subsequently evaluated for the study endpoint of HIV infection at prespecified follow‐up visits. Gilbert et al. (2005, Journal of Infectious Diseases 191 , 666–677) and Forthal et al. (2007, Journal of Immunology 178, 6596–6603) analyzed the association between the immune responses and HIV incidence with a Cox proportional hazards model, treating the HIV infection diagnosis time as a right‐censored random variable. The data, however, are of the form of grouped failure time data with case–cohort covariate sampling, and we propose an inverse selection probability‐weighted likelihood method for fitting the Cox model to these data. The method allows covariates to be time dependent, and uses multiple imputation to accommodate covariate data that are missing at random. We establish asymptotic properties of the proposed estimators, and present simulation results showing their good finite sample performance. We apply the method to the HIV vaccine trial data, showing that higher antibody levels are associated with a lower hazard of HIV infection.  相似文献   

11.
Prognostic models based on survival data frequently make use of the Cox proportional hazards model. Developing reliable Cox models with few events relative to the number of predictors can be challenging, even in low-dimensional datasets, with a much larger number of observations than variables. In such a setting we examined the performance of methods used to estimate a Cox model, including (i) full model using all available predictors and estimated by standard tech-niques, (ii) backward elimination (BE), (iii) ridge regression, (iv) least absolute shrinkage and selec-tion operator (lasso), and (v) elastic net. Based on a prospective cohort of patients with manifest coronary artery disease (CAD), we performed a simulation study to compare the predictive accu-racy, calibration, and discrimination of these approaches. Candidate predictors for incident cardio-vascular events we used included clinical variables, biomarkers, and a selection of genetic variants associated with CAD. The penalized methods, i.e., ridge, lasso, and elastic net, showed a compara-ble performance, in terms of predictive accuracy, calibration, and discrimination, and outperformed BE and the full model. Excessive shrinkage was observed in some cases for the penalized methods, mostly on the simulation scenarios having the lowest ratio of a number of events to the number of variables. We conclude that in similar settings, these three penalized methods can be used interchangeably. The full model and backward elimination are not recommended in rare event scenarios.  相似文献   

12.
The popularity of penalized regression in high‐dimensional data analysis has led to a demand for new inferential tools for these models. False discovery rate control is widely used in high‐dimensional hypothesis testing, but has only recently been considered in the context of penalized regression. Almost all of this work, however, has focused on lasso‐penalized linear regression. In this paper, we derive a general method for controlling the marginal false discovery rate that can be applied to any penalized likelihood‐based model, such as logistic regression and Cox regression. Our approach is fast, flexible and can be used with a variety of penalty functions including lasso, elastic net, MCP, and MNet. We derive theoretical results under which the proposed method is valid, and use simulation studies to demonstrate that the approach is reasonably robust, albeit slightly conservative, when these assumptions are violated. Despite being conservative, we show that our method often offers more power to select causally important features than existing approaches. Finally, the practical utility of the method is demonstrated on gene expression datasets with binary and time‐to‐event outcomes.  相似文献   

13.
This paper focuses on the problems of estimation and variable selection in the functional linear regression model (FLM) with functional response and scalar covariates. To this end, two different types of regularization (L1 and L2) are considered in this paper. On the one hand, a sample approach for functional LASSO in terms of basis representation of the sample values of the response variable is proposed. On the other hand, we propose a penalized version of the FLM by introducing a P-spline penalty in the least squares fitting criterion. But our aim is to propose P-splines as a powerful tool simultaneously for variable selection and functional parameters estimation. In that sense, the importance of smoothing the response variable before fitting the model is also studied. In summary, penalized (L1 and L2) and nonpenalized regression are combined with a presmoothing of the response variable sample curves, based on regression splines or P-splines, providing a total of six approaches to be compared in two simulation schemes. Finally, the most competitive approach is applied to a real data set based on the graft-versus-host disease, which is one of the most frequent complications (30% –50%) in allogeneic hematopoietic stem-cell transplantation.  相似文献   

14.
Survival prediction from a large number of covariates is a current focus of statistical and medical research. In this paper, we study a methodology known as the compound covariate prediction performed under univariate Cox proportional hazard models. We demonstrate via simulations and real data analysis that the compound covariate method generally competes well with ridge regression and Lasso methods, both already well-studied methods for predicting survival outcomes with a large number of covariates. Furthermore, we develop a refinement of the compound covariate method by incorporating likelihood information from multivariate Cox models. The new proposal is an adaptive method that borrows information contained in both the univariate and multivariate Cox regression estimators. We show that the new proposal has a theoretical justification from a statistical large sample theory and is naturally interpreted as a shrinkage-type estimator, a popular class of estimators in statistical literature. Two datasets, the primary biliary cirrhosis of the liver data and the non-small-cell lung cancer data, are used for illustration. The proposed method is implemented in R package “compound.Cox” available in CRAN at http://cran.r-project.org/.  相似文献   

15.
The standard Cox model is perhaps the most commonly used model for regression analysis of failure time data but it has some limitations such as the assumption on linear covariate effects. To relax this, the nonparametric additive Cox model, which allows for nonlinear covariate effects, is often employed, and this paper will discuss variable selection and structure estimation for this general model. For the problem, we propose a penalized sieve maximum likelihood approach with the use of Bernstein polynomials approximation and group penalization. To implement the proposed method, an efficient group coordinate descent algorithm is developed and can be easily carried out for both low- and high-dimensional scenarios. Furthermore, a simulation study is performed to assess the performance of the presented approach and suggests that it works well in practice. The proposed method is applied to an Alzheimer's disease study for identifying important and relevant genetic factors.  相似文献   

16.
Habitat‐selection analysis lacks an appropriate measure of the ecological significance of the statistical estimates—a practical interpretation of the magnitude of the selection coefficients. There is a need for a standard approach that allows relating the strength of selection to a change in habitat conditions across space, a quantification of the estimated effect size that can be compared both within and across studies. We offer a solution, based on the epidemiological risk ratio, which we term the relative selection strength (RSS ). For a “used‐available” design with an exponential selection function, the RSS provides an appropriate interpretation of the magnitude of the estimated selection coefficients, conditional on all other covariates being fixed. This is similar to the interpretation of the regression coefficients in any multivariable regression analysis. Although technically correct, the conditional interpretation may be inappropriate when attempting to predict habitat use across a given landscape. Hence, we also provide a simple graphical tool that communicates both the conditional and average effect of the change in one covariate. The average‐effect plot answers the question: What is the average change in the space use probability as we change the covariate of interest, while averaging over possible values of other covariates? We illustrate an application of the average‐effect plot for the average effect of distance to road on space use for elk (Cervus elaphus ) during the hunting season. We provide a list of potentially useful RSS expressions and discuss the utility of the RSS in the context of common ecological applications.  相似文献   

17.
Gene expression measurements have successfully been used for building prognostic signatures, i.e for identifying a short list of important genes that can predict patient outcome. Mostly microarray measurements have been considered, and there is little advice available for building multivariable risk prediction models from RNA-Seq data. We specifically consider penalized regression techniques, such as the lasso and componentwise boosting, which can simultaneously consider all measurements and provide both, multivariable regression models for prediction and automated variable selection. However, they might be affected by the typical skewness, mean-variance-dependency or extreme values of RNA-Seq covariates and therefore could benefit from transformations of the latter. In an analytical part, we highlight preferential selection of covariates with large variances, which is problematic due to the mean-variance dependency of RNA-Seq data. In a simulation study, we compare different transformations of RNA-Seq data for potentially improving detection of important genes. Specifically, we consider standardization, the log transformation, a variance-stabilizing transformation, the Box-Cox transformation, and rank-based transformations. In addition, the prediction performance for real data from patients with kidney cancer and acute myeloid leukemia is considered. We show that signature size, identification performance, and prediction performance critically depend on the choice of a suitable transformation. Rank-based transformations perform well in all scenarios and can even outperform complex variance-stabilizing approaches. Generally, the results illustrate that the distribution and potential transformations of RNA-Seq data need to be considered as a critical step when building risk prediction models by penalized regression techniques.  相似文献   

18.
This paper introduces a flexible and adaptive nonparametric method for estimating the association between multiple covariates and power spectra of multiple time series. The proposed approach uses a Bayesian sum of trees model to capture complex dependencies and interactions between covariates and the power spectrum, which are often observed in studies of biomedical time series. Local power spectra corresponding to terminal nodes within trees are estimated nonparametrically using Bayesian penalized linear splines. The trees are considered to be random and fit using a Bayesian backfitting Markov chain Monte Carlo (MCMC) algorithm that sequentially considers tree modifications via reversible-jump MCMC techniques. For high-dimensional covariates, a sparsity-inducing Dirichlet hyperprior on tree splitting proportions is considered, which provides sparse estimation of covariate effects and efficient variable selection. By averaging over the posterior distribution of trees, the proposed method can recover both smooth and abrupt changes in the power spectrum across multiple covariates. Empirical performance is evaluated via simulations to demonstrate the proposed method's ability to accurately recover complex relationships and interactions. The proposed methodology is used to study gait maturation in young children by evaluating age-related changes in power spectra of stride interval time series in the presence of other covariates.  相似文献   

19.
For regression with covariates missing not at random where the missingness depends on the missing covariate values, complete-case (CC) analysis leads to consistent estimation when the missingness is independent of the response given all covariates, but it may not have the desired level of efficiency. We propose a general empirical likelihood framework to improve estimation efficiency over the CC analysis. We expand on methods in Bartlett et al. (2014, Biostatistics 15 , 719–730) and Xie and Zhang (2017, Int J Biostat 13 , 1–20) that improve efficiency by modeling the missingness probability conditional on the response and fully observed covariates by allowing the possibility of modeling other data distribution-related quantities. We also give guidelines on what quantities to model and demonstrate that our proposal has the potential to yield smaller biases than existing methods when the missingness probability model is incorrect. Simulation studies are presented, as well as an application to data collected from the US National Health and Nutrition Examination Survey.  相似文献   

20.
Summary Dietary assessment of episodically consumed foods gives rise to nonnegative data that have excess zeros and measurement error. Tooze et al. (2006, Journal of the American Dietetic Association 106 , 1575–1587) describe a general statistical approach (National Cancer Institute method) for modeling such food intakes reported on two or more 24‐hour recalls (24HRs) and demonstrate its use to estimate the distribution of the food's usual intake in the general population. In this article, we propose an extension of this method to predict individual usual intake of such foods and to evaluate the relationships of usual intakes with health outcomes. Following the regression calibration approach for measurement error correction, individual usual intake is generally predicted as the conditional mean intake given 24HR‐reported intake and other covariates in the health model. One feature of the proposed method is that additional covariates potentially related to usual intake may be used to increase the precision of estimates of usual intake and of diet‐health outcome associations. Applying the method to data from the Eating at America's Table Study, we quantify the increased precision obtained from including reported frequency of intake on a food frequency questionnaire (FFQ) as a covariate in the calibration model. We then demonstrate the method in evaluating the linear relationship between log blood mercury levels and fish intake in women by using data from the National Health and Nutrition Examination Survey, and show increased precision when including the FFQ information. Finally, we present simulation results evaluating the performance of the proposed method in this context.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号