首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Microarray studies, in order to identify genes associated with an outcome of interest, usually produce noisy measurements for a large number of gene expression features from a small number of subjects. One common approach to analyzing such high-dimensional data is to use linear errors-in-variables (EIV) models; however, current methods for fitting such models are computationally expensive. In this paper, we present two efficient screening procedures, namely, corrected penalized marginal screening (PMSc) and corrected sure independence screening (SISc), to reduce the number of variables for final model building. Both screening procedures are based on fitting corrected marginal regression models relating the outcome to each contaminated covariate separately, which can be computed efficiently even with a large number of features. Under mild conditions, we show that these procedures achieve screening consistency and reduce the number of features substantially, even when the number of covariates grows exponentially with sample size. In addition, if the true covariates are weakly correlated, we show that PMSc can achieve full variable selection consistency. Through a simulation study and an analysis of gene expression data for bone mineral density of Norwegian women, we demonstrate that the two new screening procedures make estimation of linear EIV models computationally scalable in high-dimensional settings, and improve finite sample estimation and selection performance compared with estimators that do not employ a screening stage.  相似文献   

2.
After variable selection, standard inferential procedures for regression parameters may not be uniformly valid; there is no finite-sample size at which a standard test is guaranteed to approximately attain its nominal size. This problem is exacerbated in high-dimensional settings, where variable selection becomes unavoidable. This has prompted a flurry of activity in developing uniformly valid hypothesis tests for a low-dimensional regression parameter (eg, the causal effect of an exposure A on an outcome Y) in high-dimensional models. So far there has been limited focus on model misspecification, although this is inevitable in high-dimensional settings. We propose tests of the null that are uniformly valid under sparsity conditions weaker than those typically invoked in the literature, assuming working models for the exposure and outcome are both correctly specified. When one of the models is misspecified, by amending the procedure for estimating the nuisance parameters, our tests continue to be valid; hence, they are doubly robust. Our proposals are straightforward to implement using existing software for penalized maximum likelihood estimation and do not require sample splitting. We illustrate them in simulations and an analysis of data obtained from the Ghent University intensive care unit.  相似文献   

3.
Hundreds of 'molecular signatures' have been proposed in the literature to predict patient outcome in clinical settings from high-dimensional data, many of which eventually failed to get validated. Validation of such molecular research findings is thus becoming an increasingly important branch of clinical bioinformatics. Moreover, in practice well-known clinical predictors are often already available. From a statistical and bioinformatics point of view, poor attention has been given to the evaluation of the added predictive value of a molecular signature given that clinical predictors or an established index are available. This article reviews procedures that assess and validate the added predictive value of high-dimensional molecular data. It critically surveys various approaches for the construction of combined prediction models using both clinical and molecular data, for validating added predictive value based on independent data, and for assessing added predictive value using a single data set.  相似文献   

4.
Diagnostic plots in Cox's regression model.   总被引:3,自引:0,他引:3  
C H Chen  P C Wang 《Biometrics》1991,47(3):841-850
Two diagnostic plots are presented for validating the fitting of a Cox proportional hazards model. The added variable plot is developed to assess the effect of adding a covariate to the model. The constructed variable plot is applied to detect nonlinearity of a fitted covariate. Both plots are also useful for identifying influential observations on the issues of interest. The methods are illustrated on examples of multiple myeloma and lung cancer data.  相似文献   

5.
There are a number of applied settings where a response is measured repeatedly over time, and the impact of a stimulus at one time is distributed over several subsequent response measures. In the motivating application the stimulus is an air pollutant such as airborne particulate matter and the response is mortality. However, several other variables (e.g. daily temperature) impact the response in a possibly non-linear fashion. To quantify the effect of the stimulus in the presence of covariate data we combine two established regression techniques: generalized additive models and distributed lag models. Generalized additive models extend multiple linear regression by allowing for continuous covariates to be modeled as smooth, but otherwise unspecified, functions. Distributed lag models aim to relate the outcome variable to lagged values of a time-dependent predictor in a parsimonious fashion. The resultant, which we call generalized additive distributed lag models, are seen to effectively quantify the so-called 'mortality displacement effect' in environmental epidemiology, as illustrated through air pollution/mortality data from Milan, Italy.  相似文献   

6.
Capturing complex dependence structures between outcome variables (e.g., study endpoints) is of high relevance in contemporary biomedical data problems and medical research. Distributional copula regression provides a flexible tool to model the joint distribution of multiple outcome variables by disentangling the marginal response distributions and their dependence structure. In a regression setup, each parameter of the copula model, that is, the marginal distribution parameters and the copula dependence parameters, can be related to covariates via structured additive predictors. We propose a framework to fit distributional copula regression via model-based boosting, which is a modern estimation technique that incorporates useful features like an intrinsic variable selection mechanism, parameter shrinkage and the capability to fit regression models in high-dimensional data setting, that is, situations with more covariates than observations. Thus, model-based boosting does not only complement existing Bayesian and maximum-likelihood based estimation frameworks for this model class but rather enables unique intrinsic mechanisms that can be helpful in many applied problems. The performance of our boosting algorithm for copula regression models with continuous margins is evaluated in simulation studies that cover low- and high-dimensional data settings and situations with and without dependence between the responses. Moreover, distributional copula boosting is used to jointly analyze and predict the length and the weight of newborns conditional on sonographic measurements of the fetus before delivery together with other clinical variables.  相似文献   

7.
MOTIVATION: An important application of microarray technology is to relate gene expression profiles to various clinical phenotypes of patients. Success has been demonstrated in molecular classification of cancer in which the gene expression data serve as predictors and different types of cancer serve as a categorical outcome variable. However, there has been less research in linking gene expression profiles to the censored survival data such as patients' overall survival time or time to cancer relapse. It would be desirable to have models with good prediction accuracy and parsimony property. RESULTS: We propose to use the L(1) penalized estimation for the Cox model to select genes that are relevant to patients' survival and to build a predictive model for future prediction. The computational difficulty associated with the estimation in the high-dimensional and low-sample size settings can be efficiently solved by using the recently developed least-angle regression (LARS) method. Our simulation studies and application to real datasets on predicting survival after chemotherapy for patients with diffuse large B-cell lymphoma demonstrate that the proposed procedure, which we call the LARS-Cox procedure, can be used for identifying important genes that are related to time to death due to cancer and for building a parsimonious model for predicting the survival of future patients. The LARS-Cox regression gives better predictive performance than the L(2) penalized regression and a few other dimension-reduction based methods. CONCLUSIONS: We conclude that the proposed LARS-Cox procedure can be very useful in identifying genes relevant to survival phenotypes and in building a parsimonious predictive model that can be used for classifying future patients into clinically relevant high- and low-risk groups based on the gene expression profile and survival times of previous patients.  相似文献   

8.
Summary It has become increasingly common in epidemiological studies to pool specimens across subjects to achieve accurate quantitation of biomarkers and certain environmental chemicals. In this article, we consider the problem of fitting a binary regression model when an important exposure is subject to pooling. We take a regression calibration approach and derive several methods, including plug‐in methods that use a pooled measurement and other covariate information to predict the exposure level of an individual subject, and normality‐based methods that make further adjustments by assuming normality of calibration errors. Within each class we propose two ways to perform the calibration (covariate augmentation and imputation). These methods are shown in simulation experiments to effectively reduce the bias associated with the naive method that simply substitutes a pooled measurement for all individual measurements in the pool. In particular, the normality‐based imputation method performs reasonably well in a variety of settings, even under skewed distributions of calibration errors. The methods are illustrated using data from the Collaborative Perinatal Project.  相似文献   

9.
In this work, we introduce an entirely data-driven and automated approach to reveal disease-associated biomarker and risk factor networks from heterogeneous and high-dimensional healthcare data. Our workflow is based on Bayesian networks, which are a popular tool for analyzing the interplay of biomarkers. Usually, data require extensive manual preprocessing and dimension reduction to allow for effective learning of Bayesian networks. For heterogeneous data, this preprocessing is hard to automatize and typically requires domain-specific prior knowledge. We here combine Bayesian network learning with hierarchical variable clustering in order to detect groups of similar features and learn interactions between them entirely automated. We present an optimization algorithm for the adaptive refinement of such group Bayesian networks to account for a specific target variable, like a disease. The combination of Bayesian networks, clustering, and refinement yields low-dimensional but disease-specific interaction networks. These networks provide easily interpretable, yet accurate models of biomarker interdependencies. We test our method extensively on simulated data, as well as on data from the Study of Health in Pomerania (SHIP-TREND), and demonstrate its effectiveness using non-alcoholic fatty liver disease and hypertension as examples. We show that the group network models outperform available biomarker scores, while at the same time, they provide an easily interpretable interaction network.  相似文献   

10.
Stratified Cox regression models with large number of strata and small stratum size are useful in many settings, including matched case-control family studies. In the presence of measurement error in covariates and a large number of strata, we show that extensions of existing methods fail either to reduce the bias or to correct the bias under nonsymmetric distributions of the true covariate or the error term. We propose a nonparametric correction method for the estimation of regression coefficients, and show that the estimators are asymptotically consistent for the true parameters. Small sample properties are evaluated in a simulation study. The method is illustrated with an analysis of Framingham data.  相似文献   

11.
M Tsujitani  G G Koch 《Biometrics》1991,47(3):1135-1141
This article describes graphical diagnostic methods for log odds ratio regression models. To study the effects of an additional covariate on log odds ratio regression analysis, three types of residual plots based on weighted least squares (WLS) are discussed: (i) added variable plot (partial regression plot), (ii) partial residual plot, and (iii) augmented partial residual plot. These plots provide diagnostic procedures for identifying heterogeneity of error variances, outliers, or nonlinearity of the model. They are especially useful for clarifying whether including a covariate as a linear term is appropriate, or whether quadratic or other nonlinear transformations are preferable. A well-known data set for case-control studies is analyzed to illustrate the residual plots.  相似文献   

12.
Tian L  Wang W  Wei LJ 《Biometrics》2003,59(4):1008-1015
Suppose that the response variable in a well-executed clinical or observational study to evaluate a treatment is the time to a certain event, and a set of baseline covariates or predictors was collected for each study patient. Furthermore, suppose that a significant number of study patients had nontrivial, long-term adverse effects from the treatment. A commonly posed question is how to use these covariates from the study to identify future patients who would (or would not) benefit from the treatment. In this article, we present "point" and "interval" estimates for the set of covariate or predictor vectors associated with a specific patient survival status, e.g., long- (or short-) term survival, in the presence of censoring. These estimates can be easily displayed on a two-dimensional plane, even for the case with high-dimensional covariate vectors. These simple numerical and graphical procedures provide useful information for patient management and/or the design of future studies, which are key issues in pharmacogenomics with genetic markers. The new proposal is illustrated with a data set from a cancer study for treating multiple myeloma.  相似文献   

13.
Leung Lai T  Shih MC  Wong SP 《Biometrics》2006,62(1):159-167
To circumvent the computational complexity of likelihood inference in generalized mixed models that assume linear or more general additive regression models of covariate effects, Laplace's approximations to multiple integrals in the likelihood have been commonly used without addressing the issue of adequacy of the approximations for individuals with sparse observations. In this article, we propose a hybrid estimation scheme to address this issue. The likelihoods for subjects with sparse observations use Monte Carlo approximations involving importance sampling, while Laplace's approximation is used for the likelihoods of other subjects that satisfy a certain diagnostic check on the adequacy of Laplace's approximation. Because of its computational tractability, the proposed approach allows flexible modeling of covariate effects by using regression splines and model selection procedures for knot and variable selection. Its computational and statistical advantages are illustrated by simulation and by application to longitudinal data from a fecundity study of fruit flies, for which overdispersion is modeled via a double exponential family.  相似文献   

14.
Proportional hazards regression for cancer studies   总被引:1,自引:0,他引:1  
Ghosh D 《Biometrics》2008,64(1):141-148
Summary.   There has been some recent work in the statistical literature for modeling the relationship between the size of cancers and probability of detecting metastasis, i.e., aggressive disease. Methods for assessing covariate effects in these studies are limited. In this article, we formulate the problem as assessing covariate effects on a right-censored variable subject to two types of sampling bias. The first is the length-biased sampling that is inherent in screening studies; the second is the two-phase design in which a fraction of tumors are measured. We construct estimation procedures for the proportional hazards model that account for these two sampling issues. In addition, a Nelson–Aalen type estimator is proposed as a summary statistic. Asymptotic results for the regression methodology are provided. The methods are illustrated by application to data from an observational cancer study as well as to simulated data.  相似文献   

15.
Liu D  Lin X  Ghosh D 《Biometrics》2007,63(4):1079-1088
We consider a semiparametric regression model that relates a normal outcome to covariates and a genetic pathway, where the covariate effects are modeled parametrically and the pathway effect of multiple gene expressions is modeled parametrically or nonparametrically using least-squares kernel machines (LSKMs). This unified framework allows a flexible function for the joint effect of multiple genes within a pathway by specifying a kernel function and allows for the possibility that each gene expression effect might be nonlinear and the genes within the same pathway are likely to interact with each other in a complicated way. This semiparametric model also makes it possible to test for the overall genetic pathway effect. We show that the LSKM semiparametric regression can be formulated using a linear mixed model. Estimation and inference hence can proceed within the linear mixed model framework using standard mixed model software. Both the regression coefficients of the covariate effects and the LSKM estimator of the genetic pathway effect can be obtained using the best linear unbiased predictor in the corresponding linear mixed model formulation. The smoothing parameter and the kernel parameter can be estimated as variance components using restricted maximum likelihood. A score test is developed to test for the genetic pathway effect. Model/variable selection within the LSKM framework is discussed. The methods are illustrated using a prostate cancer data set and evaluated using simulations.  相似文献   

16.
In the analysis of gene expression by microarrays there are usually few subjects, but high-dimensional data. By means of techniques, such as the theory of spherical tests or with suitable permutation tests, it is possible to sort the endpoints or to give weights to them according to specific criteria determined by the data while controlling the multiple type I error rate. The procedures developed so far are based on a sequential analysis of weighted p-values (corresponding to the endpoints), including the most extreme situation of weighting leading to a complete order of p-values. When the data for the endpoints have approximately equal variances, these procedures show good power properties. In this paper, we consider an alternative procedure, which is based on completely sorting the endpoints, but smoothed in the sense that some perturbations in the sequence of the p-values are allowed. The procedure is relatively easy to perform, but has high power under the same restrictions as for the weight-based procedures.  相似文献   

17.
Glioblastoma multiforme (GBM) is the most common and aggressive adult primary brain cancer, with <10% of patients surviving for more than 3 years. Demographic and clinical factors (e.g. age) and individual molecular biomarkers have been associated with prolonged survival in GBM patients. However, comprehensive systems-level analyses of molecular profiles associated with long-term survival (LTS) in GBM patients are still lacking. We present an integrative study of molecular data and clinical variables in these long-term survivors (LTSs, patients surviving >3 years) to identify biomarkers associated with prolonged survival, and to assess the possible similarity of molecular characteristics between LGG and LTS GBM. We analyzed the relationship between multivariable molecular data and LTS in GBM patients from the Cancer Genome Atlas (TCGA), including germline and somatic point mutation, gene expression, DNA methylation, copy number variation (CNV) and microRNA (miRNA) expression using logistic regression models. The molecular relationship between GBM LTS and LGG tumors was examined through cluster analysis. We identified 13, 94, 43, 29, and 1 significant predictors of LTS using Lasso logistic regression from the somatic point mutation, gene expression, DNA methylation, CNV, and miRNA expression data sets, respectively. Individually, DNA methylation provided the best prediction performance (AUC = 0.84). Combining multiple classes of molecular data into joint regression models did not improve prediction accuracy, but did identify additional genes that were not significantly predictive in individual models. PCA and clustering analyses showed that GBM LTS typically had gene expression profiles similar to non-LTS GBM. Furthermore, cluster analysis did not identify a close affinity between LTS GBM and LGG, nor did we find a significant association between LTS and secondary GBM. The absence of unique LTS profiles and the lack of similarity between LTS GBM and LGG, indicates that there are multiple genetic and epigenetic pathways to LTS in GBM patients.  相似文献   

18.
MOTIVATION: New application areas of survival analysis as for example based on micro-array expression data call for novel tools able to handle high-dimensional data. While classical (semi-) parametric techniques as based on likelihood or partial likelihood functions are omnipresent in clinical studies, they are often inadequate for modelling in case when there are less observations than features in the data. Support vector machines (svms) and extensions are in general found particularly useful for such cases, both conceptually (non-parametric approach), computationally (boiling down to a convex program which can be solved efficiently), theoretically (for its intrinsic relation with learning theory) as well as empirically. This article discusses such an extension of svms which is tuned towards survival data. A particularly useful feature is that this method can incorporate such additional structure as additive models, positivity constraints of the parameters or regression constraints. RESULTS: Besides discussion of the proposed methods, an empirical case study is conducted on both clinical as well as micro-array gene expression data in the context of cancer studies. Results are expressed based on the logrank statistic, concordance index and the hazard ratio. The reported performances indicate that the present method yields better models for high-dimensional data, while it gives results which are comparable to what classical techniques based on a proportional hazard model give for clinical data.  相似文献   

19.
We present a novel method for finding low-dimensional views of high-dimensional data: Targeted Projection Pursuit. The method proceeds by finding projections of the data that best approximate a target view. Two versions of the method are introduced; one version based on Procrustes analysis and one based on an artificial neural network. These versions are capable of finding orthogonal or non-orthogonal projections, respectively. The method is quantitatively and qualitatively compared with other dimension reduction techniques. It is shown to find 2D views that display the classification of cancers from gene expression data with a visual separation equal to, or better than, existing dimension reduction techniques. AVAILABILITY: source code, additional diagrams, and original data are available from http://computing.unn.ac.uk/staff/CGJF1/tpp/bioinf.html  相似文献   

20.
Gene therapy is a very attractive strategy in experimental cancer therapy. Ideally, the approach aims to deliver therapeutic genes selectively to cancer cells. However, progress in the improvement of gene therapy formulations has been hampered by difficulties in measuring transgene delivery and in quantifying transgene expression in vivo. In clinical trials, endpoints rely almost exclusively on the analysis of biopsies by molecular and histopathological methods, which provide limited information. Therefore, to ensure the rational development of gene therapy, a crucial issue is the utilisation of technologies for the non-invasive monitoring of spatial and temporal gene expression in vivo upon administration of a gene delivery vector. Such imaging technologies would allow the generation of quantitative information about gene expression and the assessment of cancer gene therapy efficacy. In the past decade, progress has been made in the field of in vivo molecular imaging. This review highlights the various methods currently being developed in preclinical models.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号