首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Bayesian lasso for semiparametric structural equation models   总被引:1,自引:0,他引:1  
Guo R  Zhu H  Chow SM  Ibrahim JG 《Biometrics》2012,68(2):567-577
There has been great interest in developing nonlinear structural equation models and associated statistical inference procedures, including estimation and model selection methods. In this paper a general semiparametric structural equation model (SSEM) is developed in which the structural equation is composed of nonparametric functions of exogenous latent variables and fixed covariates on a set of latent endogenous variables. A basis representation is used to approximate these nonparametric functions in the structural equation and the Bayesian Lasso method coupled with a Markov Chain Monte Carlo (MCMC) algorithm is used for simultaneous estimation and model selection. The proposed method is illustrated using a simulation study and data from the Affective Dynamics and Individual Differences (ADID) study. Results demonstrate that our method can accurately estimate the unknown parameters and correctly identify the true underlying model.  相似文献   

2.
A plenitude of feature selection (FS) methods is available in the literature, most of them rising as a need to analyze data of very high dimension, usually hundreds or thousands of variables. Such data sets are now available in various application areas like combinatorial chemistry, text mining, multivariate imaging, or bioinformatics. As a general accepted rule, these methods are grouped in filters, wrappers, and embedded methods. More recently, a new group of methods has been added in the general framework of FS: ensemble techniques. The focus in this survey is on filter feature selection methods for informative feature discovery in gene expression microarray (GEM) analysis, which is also known as differentially expressed genes (DEGs) discovery, gene prioritization, or biomarker discovery. We present them in a unified framework, using standardized notations in order to reveal their technical details and to highlight their common characteristics as well as their particularities.  相似文献   

3.
Stochastic search variable selection (SSVS) is a Bayesian variable selection method that employs covariate‐specific discrete indicator variables to select which covariates (e.g., molecular markers) are included in or excluded from the model. We present a new variant of SSVS where, instead of discrete indicator variables, we use continuous‐scale weighting variables (which take also values between zero and one) to select covariates into the model. The improved model performance is shown and compared to standard SSVS using simulated and real quantitative trait locus mapping datasets. The decision making to decide phenotype‐genotype associations in our SSVS variant is based on median of posterior distribution or using Bayes factors. We also show here that by using continuous‐scale weighting variables it is possible to improve mixing properties of Markov chain Monte Carlo sampling substantially compared to standard SSVS. Also, the separation of association signals and nonsignals (control of noise level) seems to be more efficient compared to the standard SSVS. Thus, the novel method provides efficient new framework for SSVS analysis that additionally provides whole posterior distribution for pseudo‐indicators which means more information and may help in decision making.  相似文献   

4.
Yan J  Huang J 《Biometrics》2012,68(2):419-428
Summary Cox models with time-varying coefficients offer great flexibility in capturing the temporal dynamics of covariate effects on right-censored failure times. Because not all covariate coefficients are time varying, model selection for such models presents an additional challenge, which is to distinguish covariates with time-varying coefficient from those with time-independent coefficient. We propose an adaptive group lasso method that not only selects important variables but also selects between time-independent and time-varying specifications of their presence in the model. Each covariate effect is partitioned into a time-independent part and a time-varying part, the latter of which is characterized by a group of coefficients of basis splines without intercept. Model selection and estimation are carried out through a fast, iterative group shooting algorithm. Our approach is shown to have good properties in a simulation study that mimics realistic situations with up to 20 variables. A real example illustrates the utility of the method.  相似文献   

5.
VanderWeele TJ  Shpitser I 《Biometrics》2011,67(4):1406-1413
Summary We propose a new criterion for confounder selection when the underlying causal structure is unknown and only limited knowledge is available. We assume all covariates being considered are pretreatment variables and that for each covariate it is known (i) whether the covariate is a cause of treatment, and (ii) whether the covariate is a cause of the outcome. The causal relationships the covariates have with one another is assumed unknown. We propose that control be made for any covariate that is either a cause of treatment or of the outcome or both. We show that irrespective of the actual underlying causal structure, if any subset of the observed covariates suffices to control for confounding then the set of covariates chosen by our criterion will also suffice. We show that other, commonly used, criteria for confounding control do not have this property. We use formal theory concerning causal diagrams to prove our result but the application of the result does not rely on familiarity with causal diagrams. An investigator simply need ask, “Is the covariate a cause of the treatment?” and “Is the covariate a cause of the outcome?” If the answer to either question is “yes” then the covariate is included for confounder control. We discuss some additional covariate selection results that preserve unconfoundedness and that may be of interest when used with our criterion.  相似文献   

6.
Microarray studies, in order to identify genes associated with an outcome of interest, usually produce noisy measurements for a large number of gene expression features from a small number of subjects. One common approach to analyzing such high-dimensional data is to use linear errors-in-variables (EIV) models; however, current methods for fitting such models are computationally expensive. In this paper, we present two efficient screening procedures, namely, corrected penalized marginal screening (PMSc) and corrected sure independence screening (SISc), to reduce the number of variables for final model building. Both screening procedures are based on fitting corrected marginal regression models relating the outcome to each contaminated covariate separately, which can be computed efficiently even with a large number of features. Under mild conditions, we show that these procedures achieve screening consistency and reduce the number of features substantially, even when the number of covariates grows exponentially with sample size. In addition, if the true covariates are weakly correlated, we show that PMSc can achieve full variable selection consistency. Through a simulation study and an analysis of gene expression data for bone mineral density of Norwegian women, we demonstrate that the two new screening procedures make estimation of linear EIV models computationally scalable in high-dimensional settings, and improve finite sample estimation and selection performance compared with estimators that do not employ a screening stage.  相似文献   

7.
Statistical models support medical research by facilitating individualized outcome prognostication conditional on independent variables or by estimating effects of risk factors adjusted for covariates. Theory of statistical models is well‐established if the set of independent variables to consider is fixed and small. Hence, we can assume that effect estimates are unbiased and the usual methods for confidence interval estimation are valid. In routine work, however, it is not known a priori which covariates should be included in a model, and often we are confronted with the number of candidate variables in the range 10–30. This number is often too large to be considered in a statistical model. We provide an overview of various available variable selection methods that are based on significance or information criteria, penalized likelihood, the change‐in‐estimate criterion, background knowledge, or combinations thereof. These methods were usually developed in the context of a linear regression model and then transferred to more generalized linear models or models for censored survival data. Variable selection, in particular if used in explanatory modeling where effect estimates are of central interest, can compromise stability of a final model, unbiasedness of regression coefficients, and validity of p‐values or confidence intervals. Therefore, we give pragmatic recommendations for the practicing statistician on application of variable selection methods in general (low‐dimensional) modeling problems and on performing stability investigations and inference. We also propose some quantities based on resampling the entire variable selection process to be routinely reported by software packages offering automated variable selection algorithms.  相似文献   

8.
In cardiovascular disease studies, a large number of risk factors are measured but it often remains unknown whether all of them are relevant variables and whether the impact of these variables is changing with time or remains constant. In addition, more than one kind of cardiovascular disease events can be observed in the same patient and events of different types are possibly correlated. It is expected that different kinds of events are associated with different covariates and the forms of covariate effects also vary between event types. To tackle these problems, we proposed a multistate modeling framework for the joint analysis of multitype recurrent events and terminal event. Model structure selection is performed to identify covariates with time-varying coefficients, time-independent coefficients, and null effects. This helps in understanding the disease process as it can detect relevant covariates and identify the temporal dynamics of the covariate effects. It also provides a more parsimonious model to achieve better risk prediction. The performance of the proposed model and selection method is evaluated in numerical studies and illustrated on a real dataset from the Atherosclerosis Risk in Communities study.  相似文献   

9.
Meta-regression is widely used in systematic reviews to investigate sources of heterogeneity and the association of study-level covariates with treatment effectiveness. Existing meta-regression approaches are successful in adjusting for baseline covariates, which include real study-level covariates (e.g., publication year) that are invariant within a study and aggregated baseline covariates (e.g., mean age) that differ for each participant but are measured before randomization within a study. However, these methods have several limitations in adjusting for post-randomization variables. Although post-randomization variables share a handful of similarities with baseline covariates, they differ in several aspects. First, baseline covariates can be aggregated at the study level presumably because they are assumed to be balanced by the randomization, while post-randomization variables are not balanced across arms within a study and are commonly aggregated at the arm level. Second, post-randomization variables may interact dynamically with the primary outcome. Third, unlike baseline covariates, post-randomization variables are themselves often important outcomes under investigation. In light of these differences, we propose a Bayesian joint meta-regression approach adjusting for post-randomization variables. The proposed method simultaneously estimates the treatment effect on the primary outcome and on the post-randomization variables. It takes into consideration both between- and within-study variability in post-randomization variables. Studies with missing data in either the primary outcome or the post-randomization variables are included in the joint model to improve estimation. Our method is evaluated by simulations and a real meta-analysis of major depression disorder treatments.  相似文献   

10.
Dynamic treatment regimes (DTRs) consist of a sequence of decision rules, one per stage of intervention, that aim to recommend effective treatments for individual patients according to patient information history. DTRs can be estimated from models which include interactions between treatment and a (typically small) number of covariates which are often chosen a priori. However, with increasingly large and complex data being collected, it can be difficult to know which prognostic factors might be relevant in the treatment rule. Therefore, a more data-driven approach to select these covariates might improve the estimated decision rules and simplify models to make them easier to interpret. We propose a variable selection method for DTR estimation using penalized dynamic weighted least squares. Our method has the strong heredity property, that is, an interaction term can be included in the model only if the corresponding main terms have also been selected. We show our method has both the double robustness property and the oracle property theoretically; and the newly proposed method compares favorably with other variable selection approaches in numerical studies. We further illustrate the proposed method on data from the Sequenced Treatment Alternatives to Relieve Depression study.  相似文献   

11.
In this paper a novel approach is introduced for modeling and clustering gene expression time-series. The radial basis function neural networks have been used to produce a generalized and smooth characterization of the expression time-series. A co-expression coefficient is defined to evaluate the similarities of the models based on their temporal shapes and the distribution of the time points. The profiles are grouped using a fuzzy clustering algorithm incorporated with the proposed co-expression coefficient metric. The results on artificial and real data are presented to illustrate the advantages of the metric and method in grouping temporal profiles. The proposed metric has also been compared with the commonly used correlation coefficient under the same procedures and the results show that the proposed method produces better biologically relevant clusters.  相似文献   

12.
We consider the problem of predicting survival times of cancer patients from the gene expression profiles of their tumor samples via linear regression modeling of log-transformed failure times. The partial least squares (PLS) and least absolute shrinkage and selection operator (LASSO) methodologies are used for this purpose where we first modify the data to account for censoring. Three approaches of handling right censored data-reweighting, mean imputation, and multiple imputation-are considered. Their performances are examined in a detailed simulation study and compared with that of full data PLS and LASSO had there been no censoring. A major objective of this article is to investigate the performances of PLS and LASSO in the context of microarray data where the number of covariates is very large and there are extremely few samples. We demonstrate that LASSO outperforms PLS in terms of prediction error when the list of covariates includes a moderate to large percentage of useless or noise variables; otherwise, PLS may outperform LASSO. For a moderate sample size (100 with 10,000 covariates), LASSO performed better than a no covariate model (or noise-based prediction). The mean imputation method appears to best track the performance of the full data PLS or LASSO. The mean imputation scheme is used on an existing data set on lung cancer. This reanalysis using the mean imputed PLS and LASSO identifies a number of genes that were known to be related to cancer or tumor activities from previous studies.  相似文献   

13.
Daniel R. Kowal 《Biometrics》2023,79(3):1853-1867
Linear mixed models (LMMs) are instrumental for regression analysis with structured dependence, such as grouped, clustered, or multilevel data. However, selection among the covariates—while accounting for this structured dependence—remains a challenge. We introduce a Bayesian decision analysis for subset selection with LMMs. Using a Mahalanobis loss function that incorporates the structured dependence, we derive optimal linear coefficients for (i) any given subset of variables and (ii) all subsets of variables that satisfy a cardinality constraint. Crucially, these estimates inherit shrinkage or regularization and uncertainty quantification from the underlying Bayesian model, and apply for any well-specified Bayesian LMM. More broadly, our decision analysis strategy deemphasizes the role of a single “best” subset, which is often unstable and limited in its information content, and instead favors a collection of near-optimal subsets. This collection is summarized by key member subsets and variable-specific importance metrics. Customized subset search and out-of-sample approximation algorithms are provided for more scalable computing. These tools are applied to simulated data and a longitudinal physical activity dataset, and demonstrate excellent prediction, estimation, and selection ability.  相似文献   

14.
Automated variable selection procedures, such as backward elimination, are commonly employed to perform model selection in the context of multivariable regression. The stability of such procedures can be investigated using a bootstrap‐based approach. The idea is to apply the variable selection procedure on a large number of bootstrap samples successively and to examine the obtained models, for instance, in terms of the inclusion of specific predictor variables. In this paper, we aim to investigate a particular important problem affecting this method in the case of categorical predictor variables with different numbers of categories and to give recommendations on how to avoid it. For this purpose, we systematically assess the behavior of automated variable selection based on the likelihood ratio test using either bootstrap samples drawn with replacement or subsamples drawn without replacement from the original dataset. Our study consists of extensive simulations and a real data example from the NHANES study. Our main result is that if automated variable selection is conducted on bootstrap samples, variables with more categories are substantially favored over variables with fewer categories and over metric variables even if none of them have any effect. Importantly, variables with no effect and many categories may be (wrongly) preferred to variables with an effect but few categories. We suggest the use of subsamples instead of bootstrap samples to bypass these drawbacks.  相似文献   

15.
Several penalization approaches have been developed to identify homogeneous subgroups based on a regression model with subject-specific intercepts in subgroup analysis. These methods often apply concave penalty functions to pairwise comparisons of the intercepts, such that the subjects with similar intercept values are assigned to the same group, which is very similar to the procedure of the penalization approaches for variable selection. Since the Bayesian methods are commonly used in variable selection, it is worth considering the corresponding approaches to subgroup analysis in the Bayesian framework. In this paper, a Bayesian hierarchical model with appropriate prior structures is developed for the pairwise differences of intercepts based on a regression model with subject-specific intercepts, which can automatically detect and identify homogeneous subgroups. A Gibbs sampling algorithm is also provided to select the hyperparameter and estimate the intercepts and coefficients of the covariates simultaneously, which is computationally efficient for pairwise comparisons compared to the time-consuming procedures for parameter estimation of the penalization methods (e.g., alternating direction method of multiplier) in the case of large sample sizes. The effectiveness and usefulness of the proposed Bayesian method are evaluated through simulation studies and analysis of a Cleveland Heart Disease Dataset.  相似文献   

16.
Errors‐in‐variables models in high‐dimensional settings pose two challenges in application. First, the number of observed covariates is larger than the sample size, while only a small number of covariates are true predictors under an assumption of model sparsity. Second, the presence of measurement error can result in severely biased parameter estimates, and also affects the ability of penalized methods such as the lasso to recover the true sparsity pattern. A new estimation procedure called SIMulation‐SELection‐EXtrapolation (SIMSELEX) is proposed. This procedure makes double use of lasso methodology. First, the lasso is used to estimate sparse solutions in the simulation step, after which a group lasso is implemented to do variable selection. The SIMSELEX estimator is shown to perform well in variable selection, and has significantly lower estimation error than naive estimators that ignore measurement error. SIMSELEX can be applied in a variety of errors‐in‐variables settings, including linear models, generalized linear models, and Cox survival models. It is furthermore shown in the Supporting Information how SIMSELEX can be applied to spline‐based regression models. A simulation study is conducted to compare the SIMSELEX estimators to existing methods in the linear and logistic model settings, and to evaluate performance compared to naive methods in the Cox and spline models. Finally, the method is used to analyze a microarray dataset that contains gene expression measurements of favorable histology Wilms tumors.  相似文献   

17.
18.
When the explanatory variables of a linear model are split into two groups, two notions of collinearity are defined: a collinearity between the variables of each group, of which the mean is called residual collinearity, and a collinearity between the two groups called explained collinearity. Canonical correlation analysis provides information about the collinearity: large canonical correlation coefficients correspond to some small eigenvalues and eigenvectors of the correlation matrix and characterise the explained collinearity. Other small eigenvalues of this matrix correspond to the residual collinearity. A selection of predictors can be performed from the canonical correlation variables, according to their partial correlation coefficient with the explained variable. In the proposed application, the results obtained by the selection of canonical variables are better than those given by classical regression and by principal component regression.  相似文献   

19.
20.
MOTIVATION: In a typical gene expression profiling study, our prime objective is to identify the genes that are differentially expressed between the samples from two different tissue types. Commonly, standard analysis of variance (ANOVA)/regression is implemented to identify the relative effects of these genes over the two types of samples from their respective arrays of expression levels. But, this technique becomes fundamentally flawed when there are unaccounted sources of variability in these arrays (latent variables attributable to different biological, environmental or other factors relevant in the context). These factors distort the true picture of differential gene expression between the two tissue types and introduce spurious signals of expression heterogeneity. As a result, many genes which are actually differentially expressed are not detected, whereas many others are falsely identified as positives. Moreover, these distortions can be different for different genes. Thus, it is also not possible to get rid of these variations by simple array normalizations. This both-way error can lead to a serious loss in sensitivity and specificity, thereby causing a severe inefficiency in the underlying multiple testing problem. In this work, we attempt to identify the hidden effects of the underlying latent factors in a gene expression profiling study by partial least squares (PLS) and apply ANCOVA technique with the PLS-identified signatures of these hidden effects as covariates, in order to identify the genes that are truly differentially expressed between the two concerned tissue types. RESULTS: We compare the performance of our method SVA-PLS with standard ANOVA and a relatively recent technique of surrogate variable analysis (SVA), on a wide variety of simulation settings (incorporating different effects of the hidden variable, under situations with varying signal intensities and gene groupings). In all settings, our method yields the highest sensitivity while maintaining relatively reasonable values for the specificity, false discovery rate and false non-discovery rate. Application of our method to gene expression profiling for acute megakaryoblastic leukemia shows that our method detects an additional six genes, that are missed by both the standard ANOVA method as well as SVA, but may be relevant to this disease, as can be seen from mining the existing literature.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号