首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 437 毫秒
1.
2.
Summary .  We consider variable selection in the Cox regression model ( Cox, 1975 ,  Biometrika   362, 269–276) with covariates missing at random. We investigate the smoothly clipped absolute deviation penalty and adaptive least absolute shrinkage and selection operator (LASSO) penalty, and propose a unified model selection and estimation procedure. A computationally attractive algorithm is developed, which simultaneously optimizes the penalized likelihood function and penalty parameters. We also optimize a model selection criterion, called the   IC Q    statistic ( Ibrahim, Zhu, and Tang, 2008 ,  Journal of the American Statistical Association   103, 1648–1658), to estimate the penalty parameters and show that it consistently selects all important covariates. Simulations are performed to evaluate the finite sample performance of the penalty estimates. Also, two lung cancer data sets are analyzed to demonstrate the proposed methodology.  相似文献   

3.
Statistical models are simple mathematical rules derived from empirical data describing the association between an outcome and several explanatory variables. In a typical modeling situation statistical analysis often involves a large number of potential explanatory variables and frequently only partial subject-matter knowledge is available. Therefore, selecting the most suitable variables for a model in an objective and practical manner is usually a non-trivial task. We briefly revisit the purposeful variable selection procedure suggested by Hosmer and Lemeshow which combines significance and change-in-estimate criteria for variable selection and critically discuss the change-in-estimate criterion. We show that using a significance-based threshold for the change-in-estimate criterion reduces to a simple significance-based selection of variables, as if the change-in-estimate criterion is not considered at all. Various extensions to the purposeful variable selection procedure are suggested. We propose to use backward elimination augmented with a standardized change-in-estimate criterion on the quantity of interest usually reported and interpreted in a model for variable selection. Augmented backward elimination has been implemented in a SAS macro for linear, logistic and Cox proportional hazards regression. The algorithm and its implementation were evaluated by means of a simulation study. Augmented backward elimination tends to select larger models than backward elimination and approximates the unselected model up to negligible differences in point estimates of the regression coefficients. On average, regression coefficients obtained after applying augmented backward elimination were less biased relative to the coefficients of correctly specified models than after backward elimination. In summary, we propose augmented backward elimination as a reproducible variable selection algorithm that gives the analyst more flexibility in adopting model selection to a specific statistical modeling situation.  相似文献   

4.
Statistical models support medical research by facilitating individualized outcome prognostication conditional on independent variables or by estimating effects of risk factors adjusted for covariates. Theory of statistical models is well‐established if the set of independent variables to consider is fixed and small. Hence, we can assume that effect estimates are unbiased and the usual methods for confidence interval estimation are valid. In routine work, however, it is not known a priori which covariates should be included in a model, and often we are confronted with the number of candidate variables in the range 10–30. This number is often too large to be considered in a statistical model. We provide an overview of various available variable selection methods that are based on significance or information criteria, penalized likelihood, the change‐in‐estimate criterion, background knowledge, or combinations thereof. These methods were usually developed in the context of a linear regression model and then transferred to more generalized linear models or models for censored survival data. Variable selection, in particular if used in explanatory modeling where effect estimates are of central interest, can compromise stability of a final model, unbiasedness of regression coefficients, and validity of p‐values or confidence intervals. Therefore, we give pragmatic recommendations for the practicing statistician on application of variable selection methods in general (low‐dimensional) modeling problems and on performing stability investigations and inference. We also propose some quantities based on resampling the entire variable selection process to be routinely reported by software packages offering automated variable selection algorithms.  相似文献   

5.
A recurring methodological problem in the evaluation of the predictive validity of selection methods is that the values of the criterion variable are available for selected applicants only. This so-called range restriction problem causes biased population estimates. Correction methods for direct and indirect range restriction scenarios have widely studied for continuous criterion variables but not for dichotomous ones. The few existing approaches are inapplicable because they do not consider the unknown base rate of success. Hence, there is a lack of scientific research on suitable correction methods and the systematic analysis of their accuracies in the cases of a naturally or artificially dichotomous criterion. We aim to overcome this deficiency by viewing the range restriction problem as a missing data mechanism. We used multiple imputation by chained equations to generate complete criterion data before estimating the predictive validity and the base rate of success. Monte Carlo simulations were conducted to investigate the accuracy of the proposed correction in dependence of selection ratio, predictive validity, and base rate of success in an experimental design. In addition, we compared our proposed missing data approach with Thorndike’s well-known correction formulas that have only been used in the case of continuous criterion variables so far. The results show that the missing data approach is more accurate in estimating the predictive validity than Thorndike’s correction formulas. The accuracy of our proposed correction increases as the selection ratio and the correlation between predictor and criterion increase. Furthermore, the missing data approach provides a valid estimate of the unknown base rate of success. On the basis of our findings, we argue for the use of multiple imputation by chained equations in the evaluation of the predictive validity of selection methods when the criterion is dichotomous.  相似文献   

6.
Nonparametric mixed effects models for unequally sampled noisy curves   总被引:7,自引:0,他引:7  
Rice JA  Wu CO 《Biometrics》2001,57(1):253-259
We propose a method of analyzing collections of related curves in which the individual curves are modeled as spline functions with random coefficients. The method is applicable when the individual curves are sampled at variable and irregularly spaced points. This produces a low-rank, low-frequency approximation to the covariance structure, which can be estimated naturally by the EM algorithm. Smooth curves for individual trajectories are constructed as best linear unbiased predictor (BLUP) estimates, combining data from that individual and the entire collection. This framework leads naturally to methods for examining the effects of covariates on the shapes of the curves. We use model selection techniques--Akaike information criterion (AIC), Bayesian information criterion (BIC), and cross-validation--to select the number of breakpoints for the spline approximation. We believe that the methodology we propose provides a simple, flexible, and computationally efficient means of functional data analysis.  相似文献   

7.
Boos DD  Stefanski LA  Wu Y 《Biometrics》2009,65(3):692-700
Summary .  A new version of the false selection rate variable selection method of Wu, Boos, and Stefanski (2007,  Journal of the American Statistical Association   102, 235–243) is developed that requires no simulation. This version allows the tuning parameter in forward selection to be estimated simply by hand calculation from a summary table of output even for situations where the number of explanatory variables is larger than the sample size. Because of the computational simplicity, the method can be used in permutation tests and inside bagging loops for improved prediction. Illustration is provided in clinical trials for linear regression, logistic regression, and Cox proportional hazards regression.  相似文献   

8.
Donohue MC  Overholser R  Xu R  Vaida F 《Biometrika》2011,98(3):685-700
We study model selection for clustered data, when the focus is on cluster specific inference. Such data are often modelled using random effects, and conditional Akaike information was proposed in Vaida & Blanchard (2005) and used to derive an information criterion under linear mixed models. Here we extend the approach to generalized linear and proportional hazards mixed models. Outside the normal linear mixed models, exact calculations are not available and we resort to asymptotic approximations. In the presence of nuisance parameters, a profile conditional Akaike information is proposed. Bootstrap methods are considered for their potential advantage in finite samples. Simulations show that the performance of the bootstrap and the analytic criteria are comparable, with bootstrap demonstrating some advantages for larger cluster sizes. The proposed criteria are applied to two cancer datasets to select models when the cluster-specific inference is of interest.  相似文献   

9.
A new method for the choice of variables with the greatest discriminatory power in the location model for mixed variable discriminant analysis is presented in the paper. The procedure based on the multivariate discriminatory measure enables a simultaneous reduction of the number of discrete and continuous variables. The introduced criterion can be used for both optimal or step-wise selection of variable subset. As an example the results of the stepwise variable selection for some medical data are presented in the paper.  相似文献   

10.
Sliced inverse regression with regularizations   总被引:2,自引:0,他引:2  
Li L  Yin X 《Biometrics》2008,64(1):124-131
Summary .   In high-dimensional data analysis, sliced inverse regression (SIR) has proven to be an effective dimension reduction tool and has enjoyed wide applications. The usual SIR, however, cannot work with problems where the number of predictors, p , exceeds the sample size, n , and can suffer when there is high collinearity among the predictors. In addition, the reduced dimensional space consists of linear combinations of all the original predictors and no variable selection is achieved. In this article, we propose a regularized SIR approach based on the least-squares formulation of SIR. The L 2 regularization is introduced, and an alternating least-squares algorithm is developed, to enable SIR to work with   n < p   and highly correlated predictors. The L 1 regularization is further introduced to achieve simultaneous reduction estimation and predictor selection. Both simulations and the analysis of a microarray expression data set demonstrate the usefulness of the proposed method.  相似文献   

11.
Xu S 《Biometrics》2007,63(2):513-521
Summary .   The genetic variance of a quantitative trait is often controlled by the segregation of multiple interacting loci. Linear model regression analysis is usually applied to estimating and testing effects of these quantitative trait loci (QTL). Including all the main effects and the effects of interaction (epistatic effects), the dimension of the linear model can be extremely high. Variable selection via stepwise regression or stochastic search variable selection (SSVS) is the common procedure for epistatic effect QTL analysis. These methods are computationally intensive, yet they may not be optimal. The LASSO (least absolute shrinkage and selection operator) method is computationally more efficient than the above methods. As a result, it has been widely used in regression analysis for large models. However, LASSO has never been applied to genetic mapping for epistatic QTL, where the number of model effects is typically many times larger than the sample size. In this study, we developed an empirical Bayes method (E-BAYES) to map epistatic QTL under the mixed model framework. We also tested the feasibility of using LASSO to estimate epistatic effects, examined the fully Bayesian SSVS, and reevaluated the penalized likelihood (PENAL) methods in mapping epistatic QTL. Simulation studies showed that all the above methods performed satisfactorily well. However, E-BAYES appears to outperform all other methods in terms of minimizing the mean-squared error (MSE) with relatively short computing time. Application of the new method to real data was demonstrated using a barley dataset.  相似文献   

12.
Wang L  Li R 《Biometrics》2009,65(2):564-571
Summary .  Shrinkage-type variable selection procedures have recently seen increasing applications in biomedical research. However, their performance can be adversely influenced by outliers in either the response or the covariate space. This article proposes a weighted Wilcoxon-type smoothly clipped absolute deviation (WW-SCAD) method, which deals with robust variable selection and robust estimation simultaneously. The new procedure can be conveniently implemented with the statistical software R . We establish that the WW-SCAD correctly identifies the set of zero coefficients with probability approaching one and estimates the nonzero coefficients with the rate   n −1/2  . Moreover, with appropriately chosen weights the WW-SCAD is robust with respect to outliers in both the x and y directions. The important special case with constant weights yields an oracle-type estimator with high efficiency in the presence of heavier-tailed random errors. The robustness of the WW-SCAD is partly justified by its asymptotic performance under local shrinking contamination. We propose a Bayesian information criterion type tuning parameter selector for the WW-SCAD. The performance of the WW-SCAD is demonstrated via simulations and by an application to a study that investigates the effects of personal characteristics and dietary factors on plasma beta-carotene level.  相似文献   

13.
Sun W  Li L 《Biometrics》2012,68(1):12-22
Despite recent flourish of proposals on variable selection, genome-wide multiple loci mapping remains to be challenging. The majority of existing variable selection methods impose a model, and often the homoscedastic linear model, prior to selection. However, the true association between the phenotypical trait and the genetic markers is rarely known a priori, and the presence of epistatic interactions makes the association more complex than a linear relation. Model-free variable selection offers a useful alternative in this context, but the fact that the number of markers p often far exceeds the number of experimental units n renders all the existing model-free solutions that require n > p inapplicable. In this article, we examine a number of model-free variable selection methods for small-n-large-p regressions in the context of genome-wide multiple loci mapping. We propose and advocate a multivariate group-wise adaptive penalization solution, which requires no model prespecification and thus works for complex trait-marker association, and handles one variable at a time so that works for n < p. Effectiveness of the new method is demonstrated through both intensive simulations and a comprehensive real data analysis across 6100 gene expression traits.  相似文献   

14.
Bayesian multimodel inference for geostatistical regression models   总被引:2,自引:0,他引:2  
Johnson DS  Hoeting JA 《PloS one》2011,6(11):e25677
The problem of simultaneous covariate selection and parameter inference for spatial regression models is considered. Previous research has shown that failure to take spatial correlation into account can influence the outcome of standard model selection methods. A Markov chain Monte Carlo (MCMC) method is investigated for the calculation of parameter estimates and posterior model probabilities for spatial regression models. The method can accommodate normal and non-normal response data and a large number of covariates. Thus the method is very flexible and can be used to fit spatial linear models, spatial linear mixed models, and spatial generalized linear mixed models (GLMMs). The Bayesian MCMC method also allows a priori unequal weighting of covariates, which is not possible with many model selection methods such as Akaike's information criterion (AIC). The proposed method is demonstrated on two data sets. The first is the whiptail lizard data set which has been previously analyzed by other researchers investigating model selection methods. Our results confirmed the previous analysis suggesting that sandy soil and ant abundance were strongly associated with lizard abundance. The second data set concerned pollution tolerant fish abundance in relation to several environmental factors. Results indicate that abundance is positively related to Strahler stream order and a habitat quality index. Abundance is negatively related to percent watershed disturbance.  相似文献   

15.
We develop a new method for variable selection in a nonlinear additive function-on-scalar regression (FOSR) model. Existing methods for variable selection in FOSR have focused on the linear effects of scalar predictors, which can be a restrictive assumption in the presence of multiple continuously measured covariates. We propose a computationally efficient approach for variable selection in existing linear FOSR using functional principal component scores of the functional response and extend this framework to a nonlinear additive function-on-scalar model. The proposed method provides a unified and flexible framework for variable selection in FOSR, allowing nonlinear effects of the covariates. Numerical analysis using simulation study illustrates the advantages of the proposed method over existing variable selection methods in FOSR even when the underlying covariate effects are all linear. The proposed procedure is demonstrated on accelerometer data from the 2003–2004 cohorts of the National Health and Nutrition Examination Survey (NHANES) in understanding the association between diurnal patterns of physical activity and demographic, lifestyle, and health characteristics of the participants.  相似文献   

16.
Semiparametric Regression in Size-Biased Sampling   总被引:1,自引:0,他引:1  
Ying Qing Chen 《Biometrics》2010,66(1):149-158
Summary .  Size-biased sampling arises when a positive-valued outcome variable is sampled with selection probability proportional to its size. In this article, we propose a semiparametric linear regression model to analyze size-biased outcomes. In our proposed model, the regression parameters of covariates are of major interest, while the distribution of random errors is unspecified. Under the proposed model, we discover that regression parameters are invariant regardless of size-biased sampling. Following this invariance property, we develop a simple estimation procedure for inferences. Our proposed methods are evaluated in simulation studies and applied to two real data analyses.  相似文献   

17.
Summary.   The present article deals with informative missing (IM) exposure data in matched case–control studies. When the missingness mechanism depends on the unobserved exposure values, modeling the missing data mechanism is inevitable. Therefore, a full likelihood-based approach for handling IM data has been proposed by positing a model for selection probability, and a parametric model for the partially missing exposure variable among the control population along with a disease risk model. We develop an EM algorithm to estimate the model parameters. Three special cases: (a) binary exposure variable, (b) normally distributed exposure variable, and (c) lognormally distributed exposure variable are discussed in detail. The method is illustrated by analyzing a real matched case–control data with missing exposure variable. The performance of the proposed method is evaluated through simulation studies, and the robustness of the proposed method for violation of different types of model assumptions has been considered.  相似文献   

18.
The problem of variable selection in the generalized linear‐mixed models (GLMMs) is pervasive in statistical practice. For the purpose of variable selection, many methodologies for determining the best subset of explanatory variables currently exist according to the model complexity and differences between applications. In this paper, we develop a “higher posterior probability model with bootstrap” (HPMB) approach to select explanatory variables without fitting all possible GLMMs involving a small or moderate number of explanatory variables. Furthermore, to save computational load, we propose an efficient approximation approach with Laplace's method and Taylor's expansion to approximate intractable integrals in GLMMs. Simulation studies and an application of HapMap data provide evidence that this selection approach is computationally feasible and reliable for exploring true candidate genes and gene–gene associations, after adjusting for complex structures among clusters.  相似文献   

19.
This paper focuses on the problems of estimation and variable selection in the functional linear regression model (FLM) with functional response and scalar covariates. To this end, two different types of regularization (L1 and L2) are considered in this paper. On the one hand, a sample approach for functional LASSO in terms of basis representation of the sample values of the response variable is proposed. On the other hand, we propose a penalized version of the FLM by introducing a P-spline penalty in the least squares fitting criterion. But our aim is to propose P-splines as a powerful tool simultaneously for variable selection and functional parameters estimation. In that sense, the importance of smoothing the response variable before fitting the model is also studied. In summary, penalized (L1 and L2) and nonpenalized regression are combined with a presmoothing of the response variable sample curves, based on regression splines or P-splines, providing a total of six approaches to be compared in two simulation schemes. Finally, the most competitive approach is applied to a real data set based on the graft-versus-host disease, which is one of the most frequent complications (30% –50%) in allogeneic hematopoietic stem-cell transplantation.  相似文献   

20.
Economic weights have been estimated in two breeds (Latxa and Manchega) using economic and technical data collected in 41 Latxa and 12 Manchega dairy sheep flocks. The traits considered were fertility (lambing per year), prolificacy (number of lambs), milk yield (litres) and longevity (as productive life, in years). A linear function was used, relating these traits to the different costs in the flock. The variable costs involved in the profit function were feed and labour. From this function, economic weights were obtained. Labour is considered in the Latxa breed to be a constraint. Moreover, farm profits are unusually high, which probably means that some costs were not included according to the economic theory. For that reason, a rescaling procedure was applied constraining total labour time at the farm. Genetic gains were estimated with the resulting economic weights to test if they give any practical difference. Milk yield only as selection criterion was also considered. The medians of the estimated economic weights for fertility, prolificacy, milk yield and longevity were 138.60 € per lambing, 40.00 € per lamb, 1.18 € per l, 1.66 € per year, and 137.66 € per lambing, 34.17 € per lamb, 0.73 € per l, 2.16 € per year under the linear approach in the Latxa and Manchega breeds respectively. Most differences between breeds can be related to differences in production systems. As for the genetic gains, they were very similar for all economic weights, except when only milk yield was considered, where a correlated decrease in fertility led to a strong decrease in profit. It is concluded that the estimates are robust for practical purposes and that breeding programmes should consider inclusion of fertility. More research is needed to include other traits such as somatic cell score, milk composition and udder traits.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号