首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Sparse sufficient dimension reduction   总被引:2,自引:0,他引:2  
Li  Lexin 《Biometrika》2007,94(3):603-613
Existing sufficient dimension reduction methods suffer fromthe fact that each dimension reduction component is a linearcombination of all the original predictors, so that it is difficultto interpret the resulting estimates. We propose a unified estimationstrategy, which combines a regression-type formulation of sufficientdimension reduction methods and shrinkage estimation, to producesparse and accurate solutions. The method can be applied tomost existing sufficient dimension reduction methods such assliced inverse regression, sliced average variance estimationand principal Hessian directions. We demonstrate the effectivenessof the proposed method by both simulations and real data analysis.  相似文献   

2.
In data analysis using dimension reduction methods, the main goal is to summarize how the response is related to the covariates through a few linear combinations. One key issue is to determine the number of independent, relevant covariate combinations, which is the dimension of the sufficient dimension reduction (SDR) subspace. In this work, we propose an easily-applied approach to conduct inference for the dimension of the SDR subspace, based on augmentation of the covariate set with simulated pseudo-covariates. Applying the partitioning principal to the possible dimensions, we use rigorous sequential testing to select the dimensionality, by comparing the strength of the signal arising from the actual covariates to that appearing to arise from the pseudo-covariates. We show that under a “uniform direction” condition, our approach can be used in conjunction with several popular SDR methods, including sliced inverse regression. In these settings, the test statistic asymptotically follows a beta distribution and therefore is easily calibrated. Moreover, the family-wise type I error rate of our sequential testing is rigorously controlled. Simulation studies and an analysis of newborn anthropometric data demonstrate the robustness of the proposed approach, and indicate that the power is comparable to or greater than the alternatives.  相似文献   

3.
Research has shown that high blood glucose levels are important predictors of incident diabetes. However, they are also strongly associated with other cardiometabolic risk factors such as high blood pressure, adiposity, and cholesterol, which are also highly correlated with one another. The aim of this analysis was to ascertain how these highly correlated cardiometabolic risk factors might be associated with high levels of blood glucose in older adults aged 50 or older from wave 2 of the English Longitudinal Study of Ageing (ELSA). Due to the high collinearity of predictor variables and our interest in extreme values of blood glucose we proposed a new method, called quantile profile regression, to answer this question. Profile regression, a Bayesian nonparametric model for clustering responses and covariates simultaneously, is a powerful tool to model the relationship between a response variable and covariates, but the standard approach of using a mixture of Gaussian distributions for the response model will not identify the underlying clusters correctly, particularly with outliers in the data or heavy tail distribution of the response. Therefore, we propose quantile profile regression to model the response variable with an asymmetric Laplace distribution, allowing us to model more accurately clusters that are asymmetric and predict more accurately for extreme values of the response variable and/or outliers. Our new method performs more accurately in simulations when compared to Normal profile regression approach as well as robustly when outliers are present in the data. We conclude with an analysis of the ELSA.  相似文献   

4.
5.
CRC cancer is one of the deadliest diseases in Western countries. In order to develop prognostic biomarkers for CRC (colorectal cancer) aggressiveness, we analyzed retrospectively 267 CRC patients via a novel, multidimensional biomarker platform. Using nanofluidic technology for qPCR analysis and quantitative fluorescent immunohistochemistry for protein analysis, we assessed 33 microRNAs, 124 mRNAs and 9 protein antigens. Analysis was conducted in each single dimension (microRNA, gene or protein) using both the multivariate Cox model and Kaplan-Meier method. Thereafter, we simplified the censored survival data into binary response data (aggressive vs. non aggressive cancer). Subsequently, we integrated the data into a diagnostic score using sliced inverse regression for sufficient dimension reduction. Accuracy was assessed using area under the receiver operating characteristic curve (AUC). Single dimension analysis led to the discovery of individual factors that were significant predictors of outcome. These included seven specific microRNAs, four genes, and one protein. When these factors were quantified individually as predictors of aggressive disease, the highest demonstrable area under the curve (AUC) was 0.68. By contrast, when all results from single dimensions were combined into integrated biomarkers, AUCs were dramatically increased with values approaching and even exceeding 0.9. Single dimension analysis generates statistically significant predictors, but their predictive strengths are suboptimal for clinical utility. A novel, multidimensional integrated approach overcomes these deficiencies. Newly derived integrated biomarkers have the potential to meaningfully guide the selection of therapeutic strategies for individual patients while elucidating molecular mechanisms driving disease progression.  相似文献   

6.
Lu W  Li L 《Biometrics》2011,67(2):513-523
Methodology of sufficient dimension reduction (SDR) has offered an effective means to facilitate regression analysis of high-dimensional data. When the response is censored, however, most existing SDR estimators cannot be applied, or require some restrictive conditions. In this article, we propose a new class of inverse censoring probability weighted SDR estimators for censored regressions. Moreover, regularization is introduced to achieve simultaneous variable selection and dimension reduction. Asymptotic properties and empirical performance of the proposed methods are examined.  相似文献   

7.
The analysis of global gene expression data from microarrays is breaking new ground in genetics research, while confronting modelers and statisticians with many critical issues. In this paper, we consider data sets in which a categorical or continuous response is recorded, along with gene expression, on a given number of experimental samples. Data of this type are usually employed to create a prediction mechanism for the response based on gene expression, and to identify a subset of relevant genes. This defines a regression setting characterized by a dramatic under-resolution with respect to the predictors (genes), whose number exceeds by orders of magnitude the number of available observations (samples). We present a dimension reduction strategy that, under appropriate assumptions, allows us to restrict attention to a few linear combinations of the original expression profiles, and thus to overcome under-resolution. These linear combinations can then be used to build and validate a regression model with standard techniques. Moreover, they can be used to rank original predictors, and ultimately to select a subset of them through comparison with a background 'chance scenario' based on a number of independent randomizations. We apply this strategy to publicly available data on leukemia classification.  相似文献   

8.
We propose a constrained maximum partial likelihood estimator for dimension reduction in integrative (e.g., pan-cancer) survival analysis with high-dimensional predictors. We assume that for each population in the study, the hazard function follows a distinct Cox proportional hazards model. To borrow information across populations, we assume that each of the hazard functions depend only on a small number of linear combinations of the predictors (i.e., “factors”). We estimate these linear combinations using an algorithm based on “distance-to-set” penalties. This allows us to impose both low-rankness and sparsity on the regression coefficient matrix estimator. We derive asymptotic results that reveal that our estimator is more efficient than fitting a separate proportional hazards model for each population. Numerical experiments suggest that our method outperforms competitors under various data generating models. We use our method to perform a pan-cancer survival analysis relating protein expression to survival across 18 distinct cancer types. Our approach identifies six linear combinations, depending on only 20 proteins, which explain survival across the cancer types. Finally, to validate our fitted model, we show that our estimated factors can lead to better prediction than competitors on four external datasets.  相似文献   

9.
Dimension reduction methods have been proposed for regression analysis with predictors of high dimension, but have not received much attention on the problems with censored data. In this article, we present an iterative imputed spline approach based on principal Hessian directions (PHD) for censored survival data in order to reduce the dimension of predictors without requiring a prespecified parametric model. Our proposal is to replace the right-censored survival time with its conditional expectation for adjusting the censoring effect by using the Kaplan-Meier estimator and an adaptive polynomial spline regression in the residual imputation. A sparse estimation strategy is incorporated in our approach to enhance the interpretation of variable selection. This approach can be implemented in not only PHD, but also other methods developed for estimating the central mean subspace. Simulation studies with right-censored data are conducted for the imputed spline approach to PHD (IS-PHD) in comparison with two methods of sliced inverse regression, minimum average variance estimation, and naive PHD in ignorance of censoring. The results demonstrate that the proposed IS-PHD method is particularly useful for survival time responses approximating symmetric or bending structures. Illustrative applications to two real data sets are also presented.  相似文献   

10.
A model free approach to combining biomarkers   总被引:1,自引:0,他引:1  
For most diseases, single biomarkers do not have adequate sensitivity or specificity for practical purposes. We present an approach to combine several biomarkers into a composite marker score without assuming a model for the distribution of the predictors. Using sufficient dimension reduction techniques, we replace the original markers with a lower-dimensional version, obtained through linear transformations of markers that contain sufficient information for regression of the predictors on the outcome. We combine the linear transformations using their asymptotic properties into a scalar diagnostic score via the likelihood ratio statistic. The performance of this score is assessed by the area under the receiver-operator characteristics curve (ROC), a popular summary measure of the discriminatory ability of a single continuous diagnostic marker for binary disease outcomes. An asymptotic chi-squared test for assessing individual biomarker contribution to the diagnostic score is also derived.  相似文献   

11.
Sliced inverse regression with regularizations   总被引:2,自引:0,他引:2  
Li L  Yin X 《Biometrics》2008,64(1):124-131
Summary .   In high-dimensional data analysis, sliced inverse regression (SIR) has proven to be an effective dimension reduction tool and has enjoyed wide applications. The usual SIR, however, cannot work with problems where the number of predictors, p , exceeds the sample size, n , and can suffer when there is high collinearity among the predictors. In addition, the reduced dimensional space consists of linear combinations of all the original predictors and no variable selection is achieved. In this article, we propose a regularized SIR approach based on the least-squares formulation of SIR. The L 2 regularization is introduced, and an alternating least-squares algorithm is developed, to enable SIR to work with   n < p   and highly correlated predictors. The L 1 regularization is further introduced to achieve simultaneous reduction estimation and predictor selection. Both simulations and the analysis of a microarray expression data set demonstrate the usefulness of the proposed method.  相似文献   

12.
Asymmetric regression is an alternative to conventional linear regression that allows us to model the relationship between predictor variables and the response variable while accommodating skewness. Advantages of asymmetric regression include incorporating realistic ecological patterns observed in data, robustness to model misspecification and less sensitivity to outliers. Bayesian asymmetric regression relies on asymmetric distributions such as the asymmetric Laplace (ALD) or asymmetric normal (AND) in place of the normal distribution used in classic linear regression models. Asymmetric regression concepts can be used for process and parameter components of hierarchical Bayesian models and have a wide range of applications in data analyses. In particular, asymmetric regression allows us to fit more realistic statistical models to skewed data and pairs well with Bayesian inference. We first describe asymmetric regression using the ALD and AND. Second, we show how the ALD and AND can be used for Bayesian quantile and expectile regression for continuous response data. Third, we consider an extension to generalize Bayesian asymmetric regression to survey data consisting of counts of objects. Fourth, we describe a regression model using the ALD, and show that it can be applied to add needed flexibility, resulting in better predictive models compared to Poisson or negative binomial regression. We demonstrate concepts by analyzing a data set consisting of counts of Henslow’s sparrows following prescribed fire and provide annotated computer code to facilitate implementation. Our results suggest Bayesian asymmetric regression is an essential component of a scientist’s statistical toolbox.  相似文献   

13.
Flexible estimation of multiple conditional quantiles is of interest in numerous applications, such as studying the effect of pregnancy-related factors on low and high birth weight. We propose a Bayesian nonparametric method to simultaneously estimate noncrossing, nonlinear quantile curves. We expand the conditional distribution function of the response in I-spline basis functions where the covariate-dependent coefficients are modeled using neural networks. By leveraging the approximation power of splines and neural networks, our model can approximate any continuous quantile function. Compared to existing models, our model estimates all rather than a finite subset of quantiles, scales well to high dimensions, and accounts for estimation uncertainty. While the model is arbitrarily flexible, interpretable marginal quantile effects are estimated using accumulative local effect plots and variable importance measures. A simulation study shows that our model can better recover quantiles of the response distribution when the data are sparse, and an analysis of birth weight data is presented.  相似文献   

14.
Accurate prognostic prediction using molecular information is a challenging area of research, which is essential to develop precision medicine. In this paper, we develop translational models to identify major actionable proteins that are associated with clinical outcomes, like the survival time of patients. There are considerable statistical and computational challenges due to the large dimension of the problems. Furthermore, data are available for different tumor types; hence data integration for various tumors is desirable. Having censored survival outcomes escalates one more level of complexity in the inferential procedure. We develop Bayesian hierarchical survival models, which accommodate all the challenges mentioned here. We use the hierarchical Bayesian accelerated failure time model for survival regression. Furthermore, we assume sparse horseshoe prior distribution for the regression coefficients to identify the major proteomic drivers. We borrow strength across tumor groups by introducing a correlation structure among the prior distributions. The proposed methods have been used to analyze data from the recently curated “The Cancer Proteome Atlas” (TCPA), which contains reverse-phase protein arrays–based high-quality protein expression data as well as detailed clinical annotation, including survival times. Our simulation and the TCPA data analysis illustrate the efficacy of the proposed integrative model, which links different tumors with the correlated prior structures.  相似文献   

15.
Classification tree models are flexible analysis tools which have the ability to evaluate interactions among predictors as well as generate predictions for responses of interest. We describe Bayesian analysis of a specific class of tree models in which binary response data arise from a retrospective case-control design. We are also particularly interested in problems with potentially very many candidate predictors. This scenario is common in studies concerning gene expression data, which is a key motivating example context. Innovations here include the introduction of tree models that explicitly address and incorporate the retrospective design, and the use of nonparametric Bayesian models involving Dirichlet process priors on the distributions of predictor variables. The model specification influences the generation of trees through Bayes' factor based tests of association that determine significant binary partitions of nodes during a process of forward generation of trees. We describe this constructive process and discuss questions of generating and combining multiple trees via Bayesian model averaging for prediction. Additional discussion of parameter selection and sensitivity is given in the context of an example which concerns prediction of breast tumour status utilizing high-dimensional gene expression data; the example demonstrates the exploratory/explanatory uses of such models as well as their primary utility in prediction. Shortcomings of the approach and comparison with alternative tree modelling algorithms are also discussed, as are issues of modelling and computational extensions.  相似文献   

16.
Kinney SK  Dunson DB 《Biometrics》2007,63(3):690-698
We address the problem of selecting which variables should be included in the fixed and random components of logistic mixed effects models for correlated data. A fully Bayesian variable selection is implemented using a stochastic search Gibbs sampler to estimate the exact model-averaged posterior distribution. This approach automatically identifies subsets of predictors having nonzero fixed effect coefficients or nonzero random effects variance, while allowing uncertainty in the model selection process. Default priors are proposed for the variance components and an efficient parameter expansion Gibbs sampler is developed for posterior computation. The approach is illustrated using simulated data and an epidemiologic example.  相似文献   

17.

Summary

We consider a functional linear Cox regression model for characterizing the association between time‐to‐event data and a set of functional and scalar predictors. The functional linear Cox regression model incorporates a functional principal component analysis for modeling the functional predictors and a high‐dimensional Cox regression model to characterize the joint effects of both functional and scalar predictors on the time‐to‐event data. We develop an algorithm to calculate the maximum approximate partial likelihood estimates of unknown finite and infinite dimensional parameters. We also systematically investigate the rate of convergence of the maximum approximate partial likelihood estimates and a score test statistic for testing the nullity of the slope function associated with the functional predictors. We demonstrate our estimation and testing procedures by using simulations and the analysis of the Alzheimer's Disease Neuroimaging Initiative (ADNI) data. Our real data analyses show that high‐dimensional hippocampus surface data may be an important marker for predicting time to conversion to Alzheimer's disease. Data used in the preparation of this article were obtained from the ADNI database ( adni.loni.usc.edu ).  相似文献   

18.
Multivariate meta‐analysis is becoming more commonly used. Methods for fitting the multivariate random effects model include maximum likelihood, restricted maximum likelihood, Bayesian estimation and multivariate generalisations of the standard univariate method of moments. Here, we provide a new multivariate method of moments for estimating the between‐study covariance matrix with the properties that (1) it allows for either complete or incomplete outcomes and (2) it allows for covariates through meta‐regression. Further, for complete data, it is invariant to linear transformations. Our method reduces to the usual univariate method of moments, proposed by DerSimonian and Laird, in a single dimension. We illustrate our method and compare it with some of the alternatives using a simulation study and a real example.  相似文献   

19.
Bayesian inference allows the transparent communication and systematic updating of model uncertainty as new data become available. When applied to material flow analysis (MFA), however, Bayesian inference is undermined by the difficulty of defining proper priors for the MFA parameters and quantifying the noise in the collected data. We start to address these issues by first deriving and implementing an expert elicitation procedure suitable for generating MFA parameter priors. Second, we propose to learn the data noise concurrent with the parametric uncertainty. These methods are demonstrated using a case study on the 2012 US steel flow. Eight experts are interviewed to elicit distributions on steel flow uncertainty from raw materials to intermediate goods. The experts' distributions are combined and weighted according to the expertise demonstrated in response to seeding questions. These aggregated distributions form our model parameters' informative priors. Sensible, weakly informative priors are adopted for learning the data noise. Bayesian inference is then performed to update the parametric and data noise uncertainty given MFA data collected from the United States Geological Survey and the World Steel Association. The results show a reduction in MFA parametric uncertainty when incorporating the collected data. Only a modest reduction in data noise uncertainty was observed using 2012 data; however, greater reductions were achieved when using data from multiple years in the inference. These methods generate transparent MFA and data noise uncertainties learned from data rather than pre-assumed data noise levels, providing a more robust basis for decision-making that affects the system.  相似文献   

20.
This paper demonstrates the advantages of sharing information about unknown features of covariates across multiple model components in various nonparametric regression problems including multivariate, heteroscedastic, and semicontinuous responses. In this paper, we present a methodology which allows for information to be shared nonparametrically across various model components using Bayesian sum-of-tree models. Our simulation results demonstrate that sharing of information across related model components is often very beneficial, particularly in sparse high-dimensional problems in which variable selection must be conducted. We illustrate our methodology by analyzing medical expenditure data from the Medical Expenditure Panel Survey (MEPS). To facilitate the Bayesian nonparametric regression analysis, we develop two novel models for analyzing the MEPS data using Bayesian additive regression trees—a heteroskedastic log-normal hurdle model with a “shrink-toward-homoskedasticity” prior and a gamma hurdle model.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号