首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Statistical models support medical research by facilitating individualized outcome prognostication conditional on independent variables or by estimating effects of risk factors adjusted for covariates. Theory of statistical models is well‐established if the set of independent variables to consider is fixed and small. Hence, we can assume that effect estimates are unbiased and the usual methods for confidence interval estimation are valid. In routine work, however, it is not known a priori which covariates should be included in a model, and often we are confronted with the number of candidate variables in the range 10–30. This number is often too large to be considered in a statistical model. We provide an overview of various available variable selection methods that are based on significance or information criteria, penalized likelihood, the change‐in‐estimate criterion, background knowledge, or combinations thereof. These methods were usually developed in the context of a linear regression model and then transferred to more generalized linear models or models for censored survival data. Variable selection, in particular if used in explanatory modeling where effect estimates are of central interest, can compromise stability of a final model, unbiasedness of regression coefficients, and validity of p‐values or confidence intervals. Therefore, we give pragmatic recommendations for the practicing statistician on application of variable selection methods in general (low‐dimensional) modeling problems and on performing stability investigations and inference. We also propose some quantities based on resampling the entire variable selection process to be routinely reported by software packages offering automated variable selection algorithms.  相似文献   

2.
In health services and outcome research, count outcomes are frequently encountered and often have a large proportion of zeros. The zero‐inflated negative binomial (ZINB) regression model has important applications for this type of data. With many possible candidate risk factors, this paper proposes new variable selection methods for the ZINB model. We consider maximum likelihood function plus a penalty including the least absolute shrinkage and selection operator (LASSO), smoothly clipped absolute deviation (SCAD), and minimax concave penalty (MCP). An EM (expectation‐maximization) algorithm is proposed for estimating the model parameters and conducting variable selection simultaneously. This algorithm consists of estimating penalized weighted negative binomial models and penalized logistic models via the coordinated descent algorithm. Furthermore, statistical properties including the standard error formulae are provided. A simulation study shows that the new algorithm not only has more accurate or at least comparable estimation, but also is more robust than the traditional stepwise variable selection. The proposed methods are applied to analyze the health care demand in Germany using the open‐source R package mpath .  相似文献   

3.
Summary Gene co‐expressions have been widely used in the analysis of microarray gene expression data. However, the co‐expression patterns between two genes can be mediated by cellular states, as reflected by expression of other genes, single nucleotide polymorphisms, and activity of protein kinases. In this article, we introduce a bivariate conditional normal model for identifying the variables that can mediate the co‐expression patterns between two genes. Based on this model, we introduce a likelihood ratio (LR) test and a penalized likelihood procedure for identifying the mediators that affect gene co‐expression patterns. We propose an efficient computational algorithm based on iterative reweighted least squares and cyclic coordinate descent and have shown that when the tuning parameter in the penalized likelihood is appropriately selected, such a procedure has the oracle property in selecting the variables. We present simulation results to compare with existing methods and show that the LR‐based approach can perform similarly or better than the existing method of liquid association and the penalized likelihood procedure can be quite effective in selecting the mediators. We apply the proposed method to yeast gene expression data in order to identify the kinases or single nucleotide polymorphisms that mediate the co‐expression patterns between genes.  相似文献   

4.
Bondell HD  Reich BJ 《Biometrics》2008,64(1):115-123
Summary .   Variable selection can be challenging, particularly in situations with a large number of predictors with possibly high correlations, such as gene expression data. In this article, a new method called the OSCAR (octagonal shrinkage and clustering algorithm for regression) is proposed to simultaneously select variables while grouping them into predictive clusters. In addition to improving prediction accuracy and interpretation, these resulting groups can then be investigated further to discover what contributes to the group having a similar behavior. The technique is based on penalized least squares with a geometrically intuitive penalty function that shrinks some coefficients to exactly zero. Additionally, this penalty yields exact equality of some coefficients, encouraging correlated predictors that have a similar effect on the response to form predictive clusters represented by a single coefficient. The proposed procedure is shown to compare favorably to the existing shrinkage and variable selection techniques in terms of both prediction error and model complexity, while yielding the additional grouping information.  相似文献   

5.
MOTIVATION: In the context of sample (e.g. tumor) classifications with microarray gene expression data, many methods have been proposed. However, almost all the methods ignore existing biological knowledge and treat all the genes equally a priori. On the other hand, because some genes have been identified by previous studies to have biological functions or to be involved in pathways related to the outcome (e.g. cancer), incorporating this type of prior knowledge into a classifier can potentially improve both the predictive performance and interpretability of the resulting model. RESULTS: We propose a simple and general framework to incorporate such prior knowledge into building a penalized classifier. As two concrete examples, we apply the idea to two penalized classifiers, nearest shrunken centroids (also called PAM) and penalized partial least squares (PPLS). Instead of treating all the genes equally a priori as in standard penalized methods, we group the genes according to their functional associations based on existing biological knowledge or data, and adopt group-specific penalty terms and penalization parameters. Simulated and real data examples demonstrate that, if prior knowledge on gene grouping is indeed informative, our new methods perform better than the two standard penalized methods, yielding higher predictive accuracy and screening out more irrelevant genes.  相似文献   

6.
Prenatal exposure to carcinogenic polycyclic aromatic hydrocarbons (c‐PAHs) through maternal inhalation induces higher risk for a wide range of fetotoxic effects. However, the most health‐relevant dose function from chronic gestational exposure remains unclear. Whether there is a gestational window during which the human embryo/fetus is particularly vulnerable to PAHs has not been examined thoroughly. We consider a longitudinal semiparametric‐mixed effect model to characterize the individual prenatal PAH exposure trajectory, where a nonparametric cyclic smooth function plus a linear function are used to model the time effect and random effects are used to account for the within‐subject correlation. We propose a penalized least squares approach to estimate the parametric regression coefficients and the nonparametric function of time. The smoothing parameter and variance components are selected using the generalized cross‐validation (GCV) criteria. The estimated subject‐specific trajectory of prenatal exposure is linked to the birth outcomes through a set of functional linear models, where the coefficient of log PAH exposure is a fully nonparametric function of gestational age. This allows the effect of PAH exposure on each birth outcome to vary at different gestational ages, and the window associated with significant adverse effect is identified as a vulnerable prenatal window to PAHs on fetal growth. We minimize the penalized sum of squared errors using a spline‐based expansion of the nonparametric coefficient function to draw statistical inferences, and the smoothing parameter is chosen through GCV.  相似文献   

7.
Yuanjia Wang  Huaihou Chen 《Biometrics》2012,68(4):1113-1125
Summary We examine a generalized F ‐test of a nonparametric function through penalized splines and a linear mixed effects model representation. With a mixed effects model representation of penalized splines, we imbed the test of an unspecified function into a test of some fixed effects and a variance component in a linear mixed effects model with nuisance variance components under the null. The procedure can be used to test a nonparametric function or varying‐coefficient with clustered data, compare two spline functions, test the significance of an unspecified function in an additive model with multiple components, and test a row or a column effect in a two‐way analysis of variance model. Through a spectral decomposition of the residual sum of squares, we provide a fast algorithm for computing the null distribution of the test, which significantly improves the computational efficiency over bootstrap. The spectral representation reveals a connection between the likelihood ratio test (LRT) in a multiple variance components model and a single component model. We examine our methods through simulations, where we show that the power of the generalized F ‐test may be higher than the LRT, depending on the hypothesis of interest and the true model under the alternative. We apply these methods to compute the genome‐wide critical value and p ‐value of a genetic association test in a genome‐wide association study (GWAS), where the usual bootstrap is computationally intensive (up to 108 simulations) and asymptotic approximation may be unreliable and conservative.  相似文献   

8.

Aim

There is enormous interest in applying connectivity modelling to resistance surfaces for identifying corridors for conservation action. However, the multiple analytical approaches used to estimate resistance surfaces and predict connectivity across resistance surfaces have not been rigorously compared, and it is unclear what methods provide the best inferences about population connectivity. Using a large empirical data set on puma (Puma concolor), we are the first to compare several of the most common approaches for estimating resistance and modelling connectivity and validate them with dispersal data.

Location

Southern California, USA.

Methods

We estimate resistance using presence‐only data, GPS telemetry data from puma home ranges and genetic data using a variety of analytical methods. We model connectivity with cost distance and circuit theory algorithms. We then measure the ability of each data type and connectivity algorithm to capture GPS telemetry points of dispersing pumas.

Results

We found that resource selection functions based on GPS telemetry points and paths outperformed species distribution models when applied using cost distance connectivity algorithms. Point and path selection functions were not statistically different in their performance, but point selection functions were more sensitive to the transformation used to convert relative probability of use to resistance. Point and path selection functions and landscape genetics outperformed other methods when applied with cost distance; no methods outperformed one another with circuit theory.

Main conclusions

We conclude that path or point selection functions, or landscape genetic models, should be used to estimate landscape resistance for wildlife. In cases where resource limitations prohibit the collection of GPS collar or genetic data, our results suggest that species distribution models, while weaker, may still be sufficient for resistance estimation. We recommend the use of cost distance‐based approaches, such as least‐cost corridors and resistant kernels, for estimating connectivity and identifying functional corridors for terrestrial wildlife.
  相似文献   

9.
This article presents a novel algorithm that efficiently computes L1 penalized (lasso) estimates of parameters in high‐dimensional models. The lasso has the property that it simultaneously performs variable selection and shrinkage, which makes it very useful for finding interpretable prediction rules in high‐dimensional data. The new algorithm is based on a combination of gradient ascent optimization with the Newton–Raphson algorithm. It is described for a general likelihood function and can be applied in generalized linear models and other models with an L1 penalty. The algorithm is demonstrated in the Cox proportional hazards model, predicting survival of breast cancer patients using gene expression data, and its performance is compared with competing approaches. An R package, penalized , that implements the method, is available on CRAN.  相似文献   

10.
Summary : Recent studies have shown that grassland birds are declining more rapidly than any other group of terrestrial birds. Current methods of estimating avian age‐specific nest survival rates require knowing the ages of nests, assuming homogeneous nests in terms of nest survival rates, or treating the hazard function as a piecewise step function. In this article, we propose a Bayesian hierarchical model with nest‐specific covariates to estimate age‐specific daily survival probabilities without the above requirements. The model provides a smooth estimate of the nest survival curve and identifies the factors that are related to the nest survival. The model can handle irregular visiting schedules and it has the least restrictive assumptions compared to existing methods. Without assuming proportional hazards, we use a multinomial semiparametric logit model to specify a direct relation between age‐specific nest failure probability and nest‐specific covariates. An intrinsic autoregressive prior is employed for the nest age effect. This nonparametric prior provides a more flexible alternative to the parametric assumptions. The Bayesian computation is efficient because the full conditional posterior distributions either have closed forms or are log concave. We use the method to analyze a Missouri dickcissel dataset and find that (1) nest survival is not homogeneous during the nesting period, and it reaches its lowest at the transition from incubation to nestling; and (2) nest survival is related to grass cover and vegetation height in the study area.  相似文献   

11.
Errors‐in‐variables models in high‐dimensional settings pose two challenges in application. First, the number of observed covariates is larger than the sample size, while only a small number of covariates are true predictors under an assumption of model sparsity. Second, the presence of measurement error can result in severely biased parameter estimates, and also affects the ability of penalized methods such as the lasso to recover the true sparsity pattern. A new estimation procedure called SIMulation‐SELection‐EXtrapolation (SIMSELEX) is proposed. This procedure makes double use of lasso methodology. First, the lasso is used to estimate sparse solutions in the simulation step, after which a group lasso is implemented to do variable selection. The SIMSELEX estimator is shown to perform well in variable selection, and has significantly lower estimation error than naive estimators that ignore measurement error. SIMSELEX can be applied in a variety of errors‐in‐variables settings, including linear models, generalized linear models, and Cox survival models. It is furthermore shown in the Supporting Information how SIMSELEX can be applied to spline‐based regression models. A simulation study is conducted to compare the SIMSELEX estimators to existing methods in the linear and logistic model settings, and to evaluate performance compared to naive methods in the Cox and spline models. Finally, the method is used to analyze a microarray dataset that contains gene expression measurements of favorable histology Wilms tumors.  相似文献   

12.
Summary We consider penalized linear regression, especially for “large p, small n” problems, for which the relationships among predictors are described a priori by a network. A class of motivating examples includes modeling a phenotype through gene expression profiles while accounting for coordinated functioning of genes in the form of biological pathways or networks. To incorporate the prior knowledge of the similar effect sizes of neighboring predictors in a network, we propose a grouped penalty based on the Lγ ‐norm that smoothes the regression coefficients of the predictors over the network. The main feature of the proposed method is its ability to automatically realize grouped variable selection and exploit grouping effects. We also discuss effects of the choices of the γ and some weights inside the Lγ ‐norm. Simulation studies demonstrate the superior finite‐sample performance of the proposed method as compared to Lasso, elastic net, and a recently proposed network‐based method. The new method performs best in variable selection across all simulation set‐ups considered. For illustration, the method is applied to a microarray dataset to predict survival times for some glioblastoma patients using a gene expression dataset and a gene network compiled from some Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways.  相似文献   

13.
Krafty RT  Gimotty PA  Holtz D  Coukos G  Guo W 《Biometrics》2008,64(4):1023-1031
SUMMARY: In this article we develop a nonparametric estimation procedure for the varying coefficient model when the within-subject covariance is unknown. Extending the idea of iterative reweighted least squares to the functional setting, we iterate between estimating the coefficients conditional on the covariance and estimating the functional covariance conditional on the coefficients. Smoothing splines for correlated errors are used to estimate the functional coefficients with smoothing parameters selected via the generalized maximum likelihood. The covariance is nonparametrically estimated using a penalized estimator with smoothing parameters chosen via a Kullback-Leibler criterion. Empirical properties of the proposed method are demonstrated in simulations and the method is applied to the data collected from an ovarian tumor study in mice to analyze the effects of different chemotherapy treatments on the volumes of two classes of tumors.  相似文献   

14.
Hokeun Sun  Hongzhe Li 《Biometrics》2012,68(4):1197-1206
Summary Gaussian graphical models have been widely used as an effective method for studying the conditional independency structure among genes and for constructing genetic networks. However, gene expression data typically have heavier tails or more outlying observations than the standard Gaussian distribution. Such outliers in gene expression data can lead to wrong inference on the dependency structure among the genes. We propose a l1 penalized estimation procedure for the sparse Gaussian graphical models that is robustified against possible outliers. The likelihood function is weighted according to how the observation is deviated, where the deviation of the observation is measured based on its own likelihood. An efficient computational algorithm based on the coordinate gradient descent method is developed to obtain the minimizer of the negative penalized robustified‐likelihood, where nonzero elements of the concentration matrix represents the graphical links among the genes. After the graphical structure is obtained, we re‐estimate the positive definite concentration matrix using an iterative proportional fitting algorithm. Through simulations, we demonstrate that the proposed robust method performs much better than the graphical Lasso for the Gaussian graphical models in terms of both graph structure selection and estimation when outliers are present. We apply the robust estimation procedure to an analysis of yeast gene expression data and show that the resulting graph has better biological interpretation than that obtained from the graphical Lasso.  相似文献   

15.
MOTIVATION: Discriminant analysis for high-dimensional and low-sample-sized data has become a hot research topic in bioinformatics, mainly motivated by its importance and challenge in applications to tumor classifications for high-dimensional microarray data. Two of the popular methods are the nearest shrunken centroids, also called predictive analysis of microarray (PAM), and shrunken centroids regularized discriminant analysis (SCRDA). Both methods are modifications to the classic linear discriminant analysis (LDA) in two aspects tailored to high-dimensional and low-sample-sized data: one is the regularization of the covariance matrix, and the other is variable selection through shrinkage. In spite of their usefulness, there are potential limitations with each method. The main concern is that both PAM and SCRDA are possibly too extreme: the covariance matrix in the former is restricted to be diagonal while in the latter there is barely any restriction. Based on the biology of gene functions and given the feature of the data, it may be beneficial to estimate the covariance matrix as an intermediate between the two; furthermore, more effective shrinkage schemes may be possible. RESULTS: We propose modified LDA methods to integrate biological knowledge of gene functions (or variable groups) into classification of microarray data. Instead of simply treating all the genes independently or imposing no restriction on the correlations among the genes, we group the genes according to their biological functions extracted from existing biological knowledge or data, and propose regularized covariance estimators that encourages between-group gene independence and within-group gene correlations while maintaining the flexibility of any general covariance structure. Furthermore, we propose a shrinkage scheme on groups of genes that tends to retain or remove a whole group of the genes altogether, in contrast to the standard shrinkage on individual genes. We show that one of the proposed methods performed better than PAM and SCRDA in a simulation study and several real data examples.  相似文献   

16.
This paper focuses on the problems of estimation and variable selection in the functional linear regression model (FLM) with functional response and scalar covariates. To this end, two different types of regularization (L1 and L2) are considered in this paper. On the one hand, a sample approach for functional LASSO in terms of basis representation of the sample values of the response variable is proposed. On the other hand, we propose a penalized version of the FLM by introducing a P-spline penalty in the least squares fitting criterion. But our aim is to propose P-splines as a powerful tool simultaneously for variable selection and functional parameters estimation. In that sense, the importance of smoothing the response variable before fitting the model is also studied. In summary, penalized (L1 and L2) and nonpenalized regression are combined with a presmoothing of the response variable sample curves, based on regression splines or P-splines, providing a total of six approaches to be compared in two simulation schemes. Finally, the most competitive approach is applied to a real data set based on the graft-versus-host disease, which is one of the most frequent complications (30% –50%) in allogeneic hematopoietic stem-cell transplantation.  相似文献   

17.
Hazard regression for interval-censored data with penalized spline   总被引:1,自引:0,他引:1  
Cai T  Betensky RA 《Biometrics》2003,59(3):570-579
This article introduces a new approach for estimating the hazard function for possibly interval- and right-censored survival data. We weakly parameterize the log-hazard function with a piecewise-linear spline and provide a smoothed estimate of the hazard function by maximizing the penalized likelihood through a mixed model-based approach. We also provide a method to estimate the amount of smoothing from the data. We illustrate our approach with two well-known interval-censored data sets. Extensive numerical studies are conducted to evaluate the efficacy of the new procedure.  相似文献   

18.
Summary .  We consider variable selection in the Cox regression model ( Cox, 1975 ,  Biometrika   362, 269–276) with covariates missing at random. We investigate the smoothly clipped absolute deviation penalty and adaptive least absolute shrinkage and selection operator (LASSO) penalty, and propose a unified model selection and estimation procedure. A computationally attractive algorithm is developed, which simultaneously optimizes the penalized likelihood function and penalty parameters. We also optimize a model selection criterion, called the   IC Q    statistic ( Ibrahim, Zhu, and Tang, 2008 ,  Journal of the American Statistical Association   103, 1648–1658), to estimate the penalty parameters and show that it consistently selects all important covariates. Simulations are performed to evaluate the finite sample performance of the penalty estimates. Also, two lung cancer data sets are analyzed to demonstrate the proposed methodology.  相似文献   

19.
After variable selection, standard inferential procedures for regression parameters may not be uniformly valid; there is no finite-sample size at which a standard test is guaranteed to approximately attain its nominal size. This problem is exacerbated in high-dimensional settings, where variable selection becomes unavoidable. This has prompted a flurry of activity in developing uniformly valid hypothesis tests for a low-dimensional regression parameter (eg, the causal effect of an exposure A on an outcome Y) in high-dimensional models. So far there has been limited focus on model misspecification, although this is inevitable in high-dimensional settings. We propose tests of the null that are uniformly valid under sparsity conditions weaker than those typically invoked in the literature, assuming working models for the exposure and outcome are both correctly specified. When one of the models is misspecified, by amending the procedure for estimating the nuisance parameters, our tests continue to be valid; hence, they are doubly robust. Our proposals are straightforward to implement using existing software for penalized maximum likelihood estimation and do not require sample splitting. We illustrate them in simulations and an analysis of data obtained from the Ghent University intensive care unit.  相似文献   

20.
Genitalia are among the most variable of morphological traits, and recent research suggests that this variability may be the result of sexual selection. For example, large bacula may undergo post‐copulatory selection by females as a signal of male size and age. This should lead to positive allometry in baculum size. In addition to hyperallometry, sexually selected traits that undergo strong directional selection should exhibit high phenotypic variation. Nonetheless, in species in which pre‐copulatory selection predominates over post‐copulatory selection (such as those with male‐biased sexual size dimorphism), baculum allometry may be isometric or exhibit negative allometry. We tested this hypothesis using data collected from two highly dimorphic species of the Mustelidae, the American marten (Martes americana) and the fisher (Martes pennanti). Allometric relationships were weak, with only 4.5–10.1% of the variation in baculum length explained by body length. Because of this weak relationship, there was a large discrepancy in slope estimates derived from ordinary least squares and reduced major axis regression models. We conclude that stabilizing selection rather than sexual selection is the evolutionary force shaping variation in baculum length because allometric slopes were less than one (using the ordinary least squares regression model), a very low proportion of variance in baculum length was explained by body length, and there was low phenotypic variability in baculum length relative to other traits. We hypothesize that this pattern occurs because post‐copulatory selection plays a smaller role than pre‐copulatory selection (manifested as male‐biased sexual size dimorphism). We suggest a broader analysis of baculum allometry and sexual size dimorphism in the Mustelidae, and other taxonomic groups, coupled with a comparative analysis and with phylogenetic contrasts to test our hypothesis. © 2011 The Linnean Society of London, Biological Journal of the Linnean Society, 2011, 104 , 955–963.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号