首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Preprocessing for high‐dimensional censored datasets, such as the microarray data, is generally considered as an important technique to gain further stability by reducing potential noise from the data. When variable selection including inference is carried out with high‐dimensional censored data the objective is to obtain a smaller subset of variables and then perform the inferential analysis using model estimates based on the selected subset of variables. This two stage inferential analysis is prone to circularity bias because of the noise that might still remain in the dataset. In this work, I propose an adaptive preprocessing technique that uses sure independence screening (SIS) idea to accomplish variable selection and reduces the circularity bias by some popularly known refined high‐dimensional methods such as the elastic net, adaptive elastic net, weighted elastic net, elastic net‐AFT, and two greedy variable selection methods known as TCS, PC‐simple all implemented with the accelerated lifetime models. The proposed technique addresses several features including the issue of collinearity between important and some unimportant covariates, which is often the case in high‐dimensional setting under variable selection framework, and different level of censoring. Simulation studies along with an empirical analysis with a real microarray data, mantle cell lymphoma, is carried out to demonstrate the performance of the adaptive pre‐processing technique.  相似文献   

2.
We consider matched case-control familial studies which match a group of patients, called "case probands," with a group of disease-free subjects, called "control probands," using a set of family-level matching variables. Family members of each proband are then recruited into the study. Of interest here is the familial aggregation of the response variable and the effects of subject-specific covariates on the response. We propose an estimating equation approach to jointly estimate the main effects and intrafamilial correlations for matched family studies with a continuous outcome. Only knowledge of the first two joint moments of the response variable is required. The induced estimators for the main effects and intrafamilial correlations are consistent and asymptotically normally distributed. We apply the proposed method to sleep apnea data. A simulation study demonstrates the usefulness of our approach.  相似文献   

3.
Summary.   The present article deals with informative missing (IM) exposure data in matched case–control studies. When the missingness mechanism depends on the unobserved exposure values, modeling the missing data mechanism is inevitable. Therefore, a full likelihood-based approach for handling IM data has been proposed by positing a model for selection probability, and a parametric model for the partially missing exposure variable among the control population along with a disease risk model. We develop an EM algorithm to estimate the model parameters. Three special cases: (a) binary exposure variable, (b) normally distributed exposure variable, and (c) lognormally distributed exposure variable are discussed in detail. The method is illustrated by analyzing a real matched case–control data with missing exposure variable. The performance of the proposed method is evaluated through simulation studies, and the robustness of the proposed method for violation of different types of model assumptions has been considered.  相似文献   

4.
Summary In individually matched case–control studies, when some covariates are incomplete, an analysis based on the complete data may result in a large loss of information both in the missing and completely observed variables. This usually results in a bias and loss of efficiency. In this article, we propose a new method for handling the problem of missing covariate data based on a missing‐data‐induced intensity approach when the missingness mechanism does not depend on case–control status and show that this leads to a generalization of the missing indicator method. We derive the asymptotic properties of the estimates from the proposed method and, using an extensive simulation study, assess the finite sample performance in terms of bias, efficiency, and 95% confidence coverage under several missing data scenarios. We also make comparisons with complete‐case analysis (CCA) and some missing data methods that have been proposed previously. Our results indicate that, under the assumption of predictable missingness, the suggested method provides valid estimation of parameters, is more efficient than CCA, and is competitive with other, more complex methods of analysis. A case–control study of multiple myeloma risk and a polymorphism in the receptor Inter‐Leukin‐6 (IL‐6‐α) is used to illustrate our findings.  相似文献   

5.
The present article deals with informative missing (IM) exposure data in matched case-control studies. When the missingness mechanism depends on the unobserved exposure values, modeling the missing data mechanism is inevitable. Therefore, a full likelihood-based approach for handling IM data has been proposed by positing a model for selection probability, and a parametric model for the partially missing exposure variable among the control population along with a disease risk model. We develop an EM algorithm to estimate the model parameters. Three special cases: (a) binary exposure variable, (b) normally distributed exposure variable, and (c) lognormally distributed exposure variable are discussed in detail. The method is illustrated by analyzing a real matched case-control data with missing exposure variable. The performance of the proposed method is evaluated through simulation studies, and the robustness of the proposed method for violation of different types of model assumptions has been considered.  相似文献   

6.
In the development of structural equation models (SEMs), observed variables are usually assumed to be normally distributed. However, this assumption is likely to be violated in many practical researches. As the non‐normality of observed variables in an SEM can be obtained from either non‐normal latent variables or non‐normal residuals or both, semiparametric modeling with unknown distribution of latent variables or unknown distribution of residuals is needed. In this article, we find that an SEM becomes nonidentifiable when both the latent variable distribution and the residual distribution are unknown. Hence, it is impossible to estimate reliably both the latent variable distribution and the residual distribution without parametric assumptions on one or the other. We also find that the residuals in the measurement equation are more sensitive to the normality assumption than the latent variables, and the negative impact on the estimation of parameters and distributions due to the non‐normality of residuals is more serious. Therefore, when there is no prior knowledge about parametric distributions for either the latent variables or the residuals, we recommend making parametric assumption on latent variables, and modeling residuals nonparametrically. We propose a semiparametric Bayesian approach using the truncated Dirichlet process with a stick breaking prior to tackle the non‐normality of residuals in the measurement equation. Simulation studies and a real data analysis demonstrate our findings, and reveal the empirical performance of the proposed methodology. A free WinBUGS code to perform the analysis is available in Supporting Information.  相似文献   

7.
Errors‐in‐variables models in high‐dimensional settings pose two challenges in application. First, the number of observed covariates is larger than the sample size, while only a small number of covariates are true predictors under an assumption of model sparsity. Second, the presence of measurement error can result in severely biased parameter estimates, and also affects the ability of penalized methods such as the lasso to recover the true sparsity pattern. A new estimation procedure called SIMulation‐SELection‐EXtrapolation (SIMSELEX) is proposed. This procedure makes double use of lasso methodology. First, the lasso is used to estimate sparse solutions in the simulation step, after which a group lasso is implemented to do variable selection. The SIMSELEX estimator is shown to perform well in variable selection, and has significantly lower estimation error than naive estimators that ignore measurement error. SIMSELEX can be applied in a variety of errors‐in‐variables settings, including linear models, generalized linear models, and Cox survival models. It is furthermore shown in the Supporting Information how SIMSELEX can be applied to spline‐based regression models. A simulation study is conducted to compare the SIMSELEX estimators to existing methods in the linear and logistic model settings, and to evaluate performance compared to naive methods in the Cox and spline models. Finally, the method is used to analyze a microarray dataset that contains gene expression measurements of favorable histology Wilms tumors.  相似文献   

8.
Microarrays are widely used for examining differential gene expression, identifying single nucleotide polymorphisms, and detecting methylation loci. Multiple testing methods in microarray data analysis aim at controlling both Type I and Type II error rates; however, real microarray data do not always fit their distribution assumptions. Smyth''s ubiquitous parametric method, for example, inadequately accommodates violations of normality assumptions, resulting in inflated Type I error rates. The Significance Analysis of Microarrays, another widely used microarray data analysis method, is based on a permutation test and is robust to non-normally distributed data; however, the Significance Analysis of Microarrays method fold change criteria are problematic, and can critically alter the conclusion of a study, as a result of compositional changes of the control data set in the analysis. We propose a novel approach, combining resampling with empirical Bayes methods: the Resampling-based empirical Bayes Methods. This approach not only reduces false discovery rates for non-normally distributed microarray data, but it is also impervious to fold change threshold since no control data set selection is needed. Through simulation studies, sensitivities, specificities, total rejections, and false discovery rates are compared across the Smyth''s parametric method, the Significance Analysis of Microarrays, and the Resampling-based empirical Bayes Methods. Differences in false discovery rates controls between each approach are illustrated through a preterm delivery methylation study. The results show that the Resampling-based empirical Bayes Methods offer significantly higher specificity and lower false discovery rates compared to Smyth''s parametric method when data are not normally distributed. The Resampling-based empirical Bayes Methods also offers higher statistical power than the Significance Analysis of Microarrays method when the proportion of significantly differentially expressed genes is large for both normally and non-normally distributed data. Finally, the Resampling-based empirical Bayes Methods are generalizable to next generation sequencing RNA-seq data analysis.  相似文献   

9.
A flexible method is proposed for group sequentially performed clinical trials which allows for an adaptive, data‐driven sample size reassessment at each stage. By also adaptively assigning different weights to the several stages the total number of study parts can be steered to an intended early or late end of the trial in dependence on all information available prior to a stage. Although at each stage the null hypothesis is tested on rejection, the full level‐α‐test is preserved at the end of the study. The proposed method is not restricted to normally distributed responses. The discussed adaptive designing is a useful tool provided that a priori information about parameters involved in the trial are not available or subject to uncertainty. The presented learning algorithm enables the complete self‐designing of a study.  相似文献   

10.
Many approaches for variable selection with multiply imputed data in the development of a prognostic model have been proposed. However, no method prevails as uniformly best. We conducted a simulation study with a binary outcome and a logistic regression model to compare two classes of variable selection methods in the presence of MI data: (I) Model selection on bootstrap data, using backward elimination based on AIC or lasso, and fit the final model based on the most frequently (e.g. ) selected variables over all MI and bootstrap data sets; (II) Model selection on original MI data, using lasso. The final model is obtained by (i) averaging estimates of variables that were selected in any MI data set or (ii) in 50% of the MI data; (iii) performing lasso on the stacked MI data, and (iv) as in (iii) but using individual weights as determined by the fraction of missingness. In all lasso models, we used both the optimal penalty and the 1‐se rule. We considered recalibrating models to correct for overshrinkage due to the suboptimal penalty by refitting the linear predictor or all individual variables. We applied the methods on a real dataset of 951 adult patients with tuberculous meningitis to predict mortality within nine months. Overall, applying lasso selection with the 1‐se penalty shows the best performance, both in approach I and II. Stacking MI data is an attractive approach because it does not require choosing a selection threshold when combining results from separate MI data sets  相似文献   

11.
Summary Clinicians are often interested in the effect of covariates on survival probabilities at prespecified study times. Because different factors can be associated with the risk of short‐ and long‐term failure, a flexible modeling strategy is pursued. Given a set of multiple candidate working models, an objective methodology is proposed that aims to construct consistent and asymptotically normal estimators of regression coefficients and average prediction error for each working model, that are free from the nuisance censoring variable. It requires the conditional distribution of censoring given covariates to be modeled. The model selection strategy uses stepup or stepdown multiple hypothesis testing procedures that control either the proportion of false positives or generalized familywise error rate when comparing models based on estimates of average prediction error. The context can actually be cast as a missing data problem, where augmented inverse probability weighted complete case estimators of regression coefficients and prediction error can be used ( Tsiatis, 2006 , Semiparametric Theory and Missing Data). A simulation study and an interesting analysis of a recent AIDS trial are provided.  相似文献   

12.
An issue for class‐imbalanced learning is what assessment metric should be employed. So far, precision‐recall curve (PRC) as a metric is rarely used in practice as compared with its alternative of receiver operating characteristic (ROC). This study investigates the performance of PRC as the evaluating criterion to address the class‐imbalanced data and focuses on the comparison of PRC with ROC. The advantages of PRC over ROC on assessing class‐imbalanced data are also investigated and tested on our proposed algorithm by tuning the whole model parameters in simulation studies and real data examples. The result shows that PRC is competitive with ROC as performance measurement for handling class‐imbalanced data in tuning the model parameters. PRC can be considered as an alternative but effective assessment for preprocessing (such as variable selection) skewed data and building a classifier in class‐imbalanced learning.  相似文献   

13.
High costs associated with many fermentation processes in an increasingly competitive industry make any prompt application of modern control techniques to industrial bioprocesses very desirable. However, this is often hampered by the lack of adequate mathematical models, on the one hand, and by the absence of continuous, on-line measurement of the most relevant process variables, on the other hand. This paper addresses these problems and offers a new strategy to control continuous bioprocesses using a hierarchical structure such that neither structured process models nor continuous measurement of all relevant variables have to be available. The control system consists of two layers. The lower layer represents a dynamic adaptive follow-up control of a continuously measured output — in our case dissolved oxygen concentration. This variable is supposed to be strongly correlated with the key output variable — in our case cellular concentration which is not continuously available for measurement. The higher layer is then designed to maintain a desired profile of the process key output using a set-point optimising control technique. The Integrated System Optimisation and Parameter Estimation method used operates on an appropriately chosen steady-state performance criterion. A prerequisite for successful application of the proposed approach is an approximate steady-state model, describing the relationship between the measured output and the process key output variable. Furthermore, occasional in situ, off-line or laboratory measurement values of the key output variable are needed. Promising simulation results of the biomass concentration control, by manipulating the air flow-rate in the continuous bakers' yeast culture are presented.  相似文献   

14.
Summary The two‐stage case–control design has been widely used in epidemiology studies for its cost‐effectiveness and improvement of the study efficiency ( White, 1982 , American Journal of Epidemiology 115, 119–128; Breslow and Cain, 1988 , Biometrika 75, 11–20). The evolution of modern biomedical studies has called for cost‐effective designs with a continuous outcome and exposure variables. In this article, we propose a new two‐stage outcome‐dependent sampling (ODS) scheme with a continuous outcome variable, where both the first‐stage data and the second‐stage data are from ODS schemes. We develop a semiparametric empirical likelihood estimation for inference about the regression parameters in the proposed design. Simulation studies were conducted to investigate the small‐sample behavior of the proposed estimator. We demonstrate that, for a given statistical power, the proposed design will require a substantially smaller sample size than the alternative designs. The proposed method is illustrated with an environmental health study conducted at National Institutes of Health.  相似文献   

15.
Fence method (Jiang and others 2008. Fence methods for mixed model selection. Annals of Statistics 36, 1669-1692) is a recently proposed strategy for model selection. It was motivated by the limitation of the traditional information criteria in selecting parsimonious models in some nonconventional situations, such as mixed model selection. Jiang and others (2009. A simplified adaptive fence procedure, Statistics & Probability Letters 79, 625-629) simplified the adaptive fence method of Jiang and others (2008) to make it more suitable and convenient to use in a wide variety of problems. Still, the current modification encounters computational difficulties when applied to high-dimensional and complex problems. To address this concern, we proposed a restricted fence procedure that combines the idea of the fence with that of the restricted maximum likelihood. Furthermore, we propose to use the wild bootstrap for choosing adaptively the tuning parameter used in the restricted fence. We focus on problems of longitudinal studies and demonstrate the performance of the new procedure and its comparison with other procedures of variable selection, including the information criteria and shrinkage methods, in simulation studies. The method is further illustrated by an example of real-data analysis.  相似文献   

16.
This paper presents a new approach for confidence interval estimation of the between-study variance in meta-analysis with normally distributed responses based on the concepts of generalized variables. Simulation study shows that the coverage probabilities of the proposed confidence intervals are generally satisfactory. Moreover, the proposed approach can easily provide P -values for hypothesis testing. For meta-analysis of controlled clinical trials or epidemiological studies, within which the responses are normally distributed, the proposed approach is an ideal candidate for making inference about the between-study variance.  相似文献   

17.
Summary Colorectal cancer is the second leading cause of cancer related deaths in the United States, with more than 130,000 new cases of colorectal cancer diagnosed each year. Clinical studies have shown that genetic alterations lead to different responses to the same treatment, despite the morphologic similarities of tumors. A molecular test prior to treatment could help in determining an optimal treatment for a patient with regard to both toxicity and efficacy. This article introduces a statistical method appropriate for predicting and comparing multiple endpoints given different treatment options and molecular profiles of an individual. A latent variable‐based multivariate regression model with structured variance covariance matrix is considered here. The latent variables account for the correlated nature of multiple endpoints and accommodate the fact that some clinical endpoints are categorical variables and others are censored variables. The mixture normal hierarchical structure admits a natural variable selection rule. Inference was conducted using the posterior distribution sampling Markov chain Monte Carlo method. We analyzed the finite‐sample properties of the proposed method using simulation studies. The application to the advanced colorectal cancer study revealed associations between multiple endpoints and particular biomarkers, demonstrating the potential of individualizing treatment based on genetic profiles.  相似文献   

18.
We develop a new method for variable selection in a nonlinear additive function-on-scalar regression (FOSR) model. Existing methods for variable selection in FOSR have focused on the linear effects of scalar predictors, which can be a restrictive assumption in the presence of multiple continuously measured covariates. We propose a computationally efficient approach for variable selection in existing linear FOSR using functional principal component scores of the functional response and extend this framework to a nonlinear additive function-on-scalar model. The proposed method provides a unified and flexible framework for variable selection in FOSR, allowing nonlinear effects of the covariates. Numerical analysis using simulation study illustrates the advantages of the proposed method over existing variable selection methods in FOSR even when the underlying covariate effects are all linear. The proposed procedure is demonstrated on accelerometer data from the 2003–2004 cohorts of the National Health and Nutrition Examination Survey (NHANES) in understanding the association between diurnal patterns of physical activity and demographic, lifestyle, and health characteristics of the participants.  相似文献   

19.
Automated variable selection procedures, such as backward elimination, are commonly employed to perform model selection in the context of multivariable regression. The stability of such procedures can be investigated using a bootstrap‐based approach. The idea is to apply the variable selection procedure on a large number of bootstrap samples successively and to examine the obtained models, for instance, in terms of the inclusion of specific predictor variables. In this paper, we aim to investigate a particular important problem affecting this method in the case of categorical predictor variables with different numbers of categories and to give recommendations on how to avoid it. For this purpose, we systematically assess the behavior of automated variable selection based on the likelihood ratio test using either bootstrap samples drawn with replacement or subsamples drawn without replacement from the original dataset. Our study consists of extensive simulations and a real data example from the NHANES study. Our main result is that if automated variable selection is conducted on bootstrap samples, variables with more categories are substantially favored over variables with fewer categories and over metric variables even if none of them have any effect. Importantly, variables with no effect and many categories may be (wrongly) preferred to variables with an effect but few categories. We suggest the use of subsamples instead of bootstrap samples to bypass these drawbacks.  相似文献   

20.
In health services and outcome research, count outcomes are frequently encountered and often have a large proportion of zeros. The zero‐inflated negative binomial (ZINB) regression model has important applications for this type of data. With many possible candidate risk factors, this paper proposes new variable selection methods for the ZINB model. We consider maximum likelihood function plus a penalty including the least absolute shrinkage and selection operator (LASSO), smoothly clipped absolute deviation (SCAD), and minimax concave penalty (MCP). An EM (expectation‐maximization) algorithm is proposed for estimating the model parameters and conducting variable selection simultaneously. This algorithm consists of estimating penalized weighted negative binomial models and penalized logistic models via the coordinated descent algorithm. Furthermore, statistical properties including the standard error formulae are provided. A simulation study shows that the new algorithm not only has more accurate or at least comparable estimation, but also is more robust than the traditional stepwise variable selection. The proposed methods are applied to analyze the health care demand in Germany using the open‐source R package mpath .  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号