首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 88 毫秒
1.
Multiple imputation (MI) is increasingly popular for handling multivariate missing data. Two general approaches are available in standard computer packages: MI based on the posterior distribution of incomplete variables under a multivariate (joint) model, and fully conditional specification (FCS), which imputes missing values using univariate conditional distributions for each incomplete variable given all the others, cycling iteratively through the univariate imputation models. In the context of longitudinal or clustered data, it is not clear whether these approaches result in consistent estimates of regression coefficient and variance component parameters when the analysis model of interest is a linear mixed effects model (LMM) that includes both random intercepts and slopes with either covariates or both covariates and outcome contain missing information. In the current paper, we compared the performance of seven different MI methods for handling missing values in longitudinal and clustered data in the context of fitting LMMs with both random intercepts and slopes. We study the theoretical compatibility between specific imputation models fitted under each of these approaches and the LMM, and also conduct simulation studies in both the longitudinal and clustered data settings. Simulations were motivated by analyses of the association between body mass index (BMI) and quality of life (QoL) in the Longitudinal Study of Australian Children (LSAC). Our findings showed that the relative performance of MI methods vary according to whether the incomplete covariate has fixed or random effects and whether there is missingnesss in the outcome variable. We showed that compatible imputation and analysis models resulted in consistent estimation of both regression parameters and variance components via simulation. We illustrate our findings with the analysis of LSAC data.  相似文献   

2.
Kidney and cardiovascular disease are widespread among populations with high prevalence of diabetes, such as American Indians participating in the Strong Heart Study (SHS). Studying these conditions simultaneously in longitudinal studies is challenging, because the morbidity and mortality associated with these diseases result in missing data, and these data are likely not missing at random. When such data are merely excluded, study findings may be compromised. In this article, a subset of 2264 participants with complete renal function data from Strong Heart Exams 1 (1989–1991), 2 (1993–1995), and 3 (1998–1999) was used to examine the performance of five methods used to impute missing data: listwise deletion, mean of serial measures, adjacent value, multiple imputation, and pattern-mixture. Three missing at random models and one non-missing at random model were used to compare the performance of the imputation techniques on randomly and non-randomly missing data. The pattern-mixture method was found to perform best for imputing renal function data that were not missing at random. Determining whether data are missing at random or not can help in choosing the imputation method that will provide the most accurate results.  相似文献   

3.
Liu M  Taylor JM  Belin TR 《Biometrics》2000,56(4):1157-1163
This paper outlines a multiple imputation method for handling missing data in designed longitudinal studies. A random coefficients model is developed to accommodate incomplete multivariate continuous longitudinal data. Multivariate repeated measures are jointly modeled; specifically, an i.i.d. normal model is assumed for time-independent variables and a hierarchical random coefficients model is assumed for time-dependent variables in a regression model conditional on the time-independent variables and time, with heterogeneous error variances across variables and time points. Gibbs sampling is used to draw model parameters and for imputations of missing observations. An application to data from a study of startle reactions illustrates the model. A simulation study compares the multiple imputation procedure to the weighting approach of Robins, Rotnitzky, and Zhao (1995, Journal of the American Statistical Association 90, 106-121) that can be used to address similar data structures.  相似文献   

4.
In cluster randomized trials (CRTs), identifiable clusters rather than individuals are randomized to study groups. Resulting data often consist of a small number of clusters with correlated observations within a treatment group. Missing data often present a problem in the analysis of such trials, and multiple imputation (MI) has been used to create complete data sets, enabling subsequent analysis with well-established analysis methods for CRTs. We discuss strategies for accounting for clustering when multiply imputing a missing continuous outcome, focusing on estimation of the variance of group means as used in an adjusted t-test or ANOVA. These analysis procedures are congenial to (can be derived from) a mixed effects imputation model; however, this imputation procedure is not yet available in commercial statistical software. An alternative approach that is readily available and has been used in recent studies is to include fixed effects for cluster, but the impact of using this convenient method has not been studied. We show that under this imputation model the MI variance estimator is positively biased and that smaller intraclass correlations (ICCs) lead to larger overestimation of the MI variance. Analytical expressions for the bias of the variance estimator are derived in the case of data missing completely at random, and cases in which data are missing at random are illustrated through simulation. Finally, various imputation methods are applied to data from the Detroit Middle School Asthma Project, a recent school-based CRT, and differences in inference are compared.  相似文献   

5.
Systolic blood pressure (SBP) is an age-dependent complex trait for which both environmental and genetic factors may play a role in explaining variability among individuals. We performed a genome-wide scan of the rate of change in SBP over time on the Framingham Heart Study data and one randomly selected replicate of the simulated data from the Genetic Analysis Workshop 13. We used a variance-component model to carry out linkage analysis and a Markov chain Monte Carlo-based multiple imputation approach to recover missing information. Furthermore, we adopted two selection strategies along with the multiple imputation to deal with subjects taking antihypertensive treatment. The simulated data were used to compare these two strategies, to explore the effectiveness of the multiple imputation in recovering varying degrees of missing information, and its impact on linkage analysis results. For the Framingham data, the marker with the highest LOD score for SBP slope was found on chromosome 7. Interestingly, we found that SBP slopes were not heritable in males but were for females; the marker with the highest LOD score was found on chromosome 18. Using the simulated data, we found that handling treated subjects using the multiple imputation improved the linkage results. We conclude that multiple imputation is a promising approach in recovering missing information in longitudinal genetic studies and hence in improving subsequent linkage analyses.  相似文献   

6.
7.

Introduction

A common problem in metabolomics data analysis is the existence of a substantial number of missing values, which can complicate, bias, or even prevent certain downstream analyses. One of the most widely-used solutions to this problem is imputation of missing values using a k-nearest neighbors (kNN) algorithm to estimate missing metabolite abundances. kNN implicitly assumes that missing values are uniformly distributed at random in the dataset, but this is typically not true in metabolomics, where many values are missing because they are below the limit of detection of the analytical instrumentation.

Objectives

Here, we explore the impact of nonuniformly distributed missing values (missing not at random, or MNAR) on imputation performance. We present a new model for generating synthetic missing data and a new algorithm, No-Skip kNN (NS-kNN), that accounts for MNAR values to provide more accurate imputations.

Methods

We compare the imputation errors of the original kNN algorithm using two distance metrics, NS-kNN, and a recently developed algorithm KNN-TN, when applied to multiple experimental datasets with different types and levels of missing data.

Results

Our results show that NS-kNN typically outperforms kNN when at least 20–30% of missing values in a dataset are MNAR. NS-kNN also has lower imputation errors than KNN-TN on realistic datasets when at least 50% of missing values are MNAR.

Conclusion

Accounting for the nonuniform distribution of missing values in metabolomics data can significantly improve the results of imputation algorithms. The NS-kNN method imputes missing metabolomics data more accurately than existing kNN-based approaches when used on realistic datasets.
  相似文献   

8.
Genetic mutations may interact to increase the risk of human complex diseases. Mapping of multiple interacting disease loci in the human genome has recently shown promise in detecting genes with little main effects. The power of interaction association mapping, however, can be greatly influenced by the set of single nucleotide polymorphism (SNP) genotyped in a case-control study. Previous imputation methods only focus on imputation of individual SNPs without considering their joint distribution of possible interactions. We present a new method that simultaneously detects multilocus interaction associations and imputes missing SNPs from a full Bayesian model. Our method treats both the case-control sample and the reference data as random observations. The output of our method is the posterior probabilities of SNPs for their marginal and interacting associations with the disease. Using simulations, we show that the method produces accurate and robust imputation with little overfitting problems. We further show that, with the type I error rate maintained at a common level, SNP imputation can consistently and sometimes substantially improve the power of detecting disease interaction associations. We use a data set of inflammatory bowel disease to demonstrate the application of our method.  相似文献   

9.
We focus on the problem of generalizing a causal effect estimated on a randomized controlled trial (RCT) to a target population described by a set of covariates from observational data. Available methods such as inverse propensity sampling weighting are not designed to handle missing values, which are however common in both data sources. In addition to coupling the assumptions for causal effect identifiability and for the missing values mechanism and to defining appropriate estimation strategies, one difficulty is to consider the specific structure of the data with two sources and treatment and outcome only available in the RCT. We propose three multiple imputation strategies to handle missing values when generalizing treatment effects, each handling the multisource structure of the problem differently (separate imputation, joint imputation with fixed effect, joint imputation ignoring source information). As an alternative to multiple imputation, we also propose a direct estimation approach that treats incomplete covariates as semidiscrete variables. The multiple imputation strategies and the latter alternative rely on different sets of assumptions concerning the impact of missing values on identifiability. We discuss these assumptions and assess the methods through an extensive simulation study. This work is motivated by the analysis of a large registry of over 20,000 major trauma patients and an RCT studying the effect of tranexamic acid administration on mortality in major trauma patients admitted to intensive care units. The analysis illustrates how the missing values handling can impact the conclusion about the effect generalized from the RCT to the target population.  相似文献   

10.
In some occupational health studies, observations occur in both exposed and unexposed individuals. If the levels of all exposed individuals have been detected, a two-part zero-inflated log-normal model is usually recommended, which assumes that the data has a probability mass at zero for unexposed individuals and a continuous response for values greater than zero for exposed individuals. However, many quantitative exposure measurements are subject to left censoring due to values falling below assay detection limits. A zero-inflated log-normal mixture model is suggested in this situation since unexposed zeros are not distinguishable from those exposed with values below detection limits. In the context of this mixture distribution, the information contributed by values falling below a fixed detection limit is used only to estimate the probability of unexposed. We consider sample size and statistical power calculation when comparing the median of exposed measurements to a regulatory limit. We calculate the required sample size for the data presented in a recent paper comparing the benzene TWA exposure data to a regulatory occupational exposure limit. A simulation study is conducted to investigate the performance of the proposed sample size calculation methods.  相似文献   

11.

Background

Imputation techniques used to handle missing data are based on the principle of replacement. It is widely advocated that multiple imputation is superior to other imputation methods, however studies have suggested that simple methods for filling missing data can be just as accurate as complex methods. The objective of this study was to implement a number of simple and more complex imputation methods, and assess the effect of these techniques on the performance of undiagnosed diabetes risk prediction models during external validation.

Methods

Data from the Cape Town Bellville-South cohort served as the basis for this study. Imputation methods and models were identified via recent systematic reviews. Models’ discrimination was assessed and compared using C-statistic and non-parametric methods, before and after recalibration through simple intercept adjustment.

Results

The study sample consisted of 1256 individuals, of whom 173 were excluded due to previously diagnosed diabetes. Of the final 1083 individuals, 329 (30.4%) had missing data. Family history had the highest proportion of missing data (25%). Imputation of the outcome, undiagnosed diabetes, was highest in stochastic regression imputation (163 individuals). Overall, deletion resulted in the lowest model performances while simple imputation yielded the highest C-statistic for the Cambridge Diabetes Risk model, Kuwaiti Risk model, Omani Diabetes Risk model and Rotterdam Predictive model. Multiple imputation only yielded the highest C-statistic for the Rotterdam Predictive model, which were matched by simpler imputation methods.

Conclusions

Deletion was confirmed as a poor technique for handling missing data. However, despite the emphasized disadvantages of simpler imputation methods, this study showed that implementing these methods results in similar predictive utility for undiagnosed diabetes when compared to multiple imputation.  相似文献   

12.
Missing outcomes or irregularly timed multivariate longitudinal data frequently occur in clinical trials or biomedical studies. The multivariate t linear mixed model (MtLMM) has been shown to be a robust approach to modeling multioutcome continuous repeated measures in the presence of outliers or heavy‐tailed noises. This paper presents a framework for fitting the MtLMM with an arbitrary missing data pattern embodied within multiple outcome variables recorded at irregular occasions. To address the serial correlation among the within‐subject errors, a damped exponential correlation structure is considered in the model. Under the missing at random mechanism, an efficient alternating expectation‐conditional maximization (AECM) algorithm is used to carry out estimation of parameters and imputation of missing values. The techniques for the estimation of random effects and the prediction of future responses are also investigated. Applications to an HIV‐AIDS study and a pregnancy study involving analysis of multivariate longitudinal data with missing outcomes as well as a simulation study have highlighted the superiority of MtLMMs on the provision of more adequate estimation, imputation and prediction performances.  相似文献   

13.
Multiple imputation has become a widely accepted technique to deal with the problem of incomplete data. Typically, imputation of missing values and the statistical analysis are performed separately. Therefore, the imputation model has to be consistent with the analysis model. If the data are analyzed with a mixture model, the parameter estimates are usually obtained iteratively. Thus, if the data are missing not at random, parameter estimation and treatment of missingness should be combined. We solve both problems by simultaneously imputing values using the data augmentation method and estimating parameters using the EM algorithm. This iterative procedure ensures that the missing values are properly imputed given the current parameter estimates. Properties of the parameter estimates were investigated in a simulation study. The results are illustrated using data from the National Health and Nutrition Examination Survey.  相似文献   

14.
Longitudinal data often encounter missingness with monotone and/or intermittent missing patterns. Multiple imputation (MI) has been popularly employed for analysis of missing longitudinal data. In particular, the MI‐GEE method has been proposed for inference of generalized estimating equations (GEE) when missing data are imputed via MI. However, little is known about how to perform model selection with multiply imputed longitudinal data. In this work, we extend the existing GEE model selection criteria, including the “quasi‐likelihood under the independence model criterion” (QIC) and the “missing longitudinal information criterion” (MLIC), to accommodate multiple imputed datasets for selection of the MI‐GEE mean model. According to real data analyses from a schizophrenia study and an AIDS study, as well as simulations under nonmonotone missingness with moderate proportion of missing observations, we conclude that: (i) more than a few imputed datasets are required for stable and reliable model selection in MI‐GEE analysis; (ii) the MI‐based GEE model selection methods with a suitable number of imputations generally perform well, while the naive application of existing model selection methods by simply ignoring missing observations may lead to very poor performance; (iii) the model selection criteria based on improper (frequentist) multiple imputation generally performs better than their analogies based on proper (Bayesian) multiple imputation.  相似文献   

15.
Multiple imputation (MI) is used to handle missing at random (MAR) data. Despite warnings from statisticians, continuous variables are often recoded into binary variables. With MI it is important that the imputation and analysis models are compatible; variables should be imputed in the same form they appear in the analysis model. With an encoded binary variable more accurate imputations may be obtained by imputing the underlying continuous variable. We conducted a simulation study to explore how best to impute a binary variable that was created from an underlying continuous variable. We generated a completely observed continuous outcome associated with an incomplete binary covariate that is a categorized version of an underlying continuous covariate, and an auxiliary variable associated with the underlying continuous covariate. We simulated data with several sample sizes, and set 25% and 50% of data in the covariate to MAR dependent on the outcome and the auxiliary variable. We compared the performance of five different imputation methods: (a) Imputation of the binary variable using logistic regression; (b) imputation of the continuous variable using linear regression, then categorizing into the binary variable; (c, d) imputation of both the continuous and binary variables using fully conditional specification (FCS) and multivariate normal imputation; (e) substantive-model compatible (SMC) FCS. Bias and standard errors were large when the continuous variable only was imputed. The other methods performed adequately. Imputation of both the binary and continuous variables using FCS often encountered mathematical difficulties. We recommend the SMC-FCS method as it performed best in our simulation studies.  相似文献   

16.
In genetic association studies, tests for Hardy-Weinberg proportions are often employed as a quality control checking procedure. Missing genotypes are typically discarded prior to testing. In this paper we show that inference for Hardy-Weinberg proportions can be biased when missing values are discarded. We propose to use multiple imputation of missing values in order to improve inference for Hardy-Weinberg proportions. For imputation we employ a multinomial logit model that uses information from allele intensities and/or neighbouring markers. Analysis of an empirical data set of single nucleotide polymorphisms possibly related to colon cancer reveals that missing genotypes are not missing completely at random. Deviation from Hardy-Weinberg proportions is mostly due to a lack of heterozygotes. Inbreeding coefficients estimated by multiple imputation of the missings are typically lowered with respect to inbreeding coefficients estimated by discarding the missings. Accounting for missings by multiple imputation qualitatively changed the results of 10 to 17% of the statistical tests performed. Estimates of inbreeding coefficients obtained by multiple imputation showed high correlation with estimates obtained by single imputation using an external reference panel. Our conclusion is that imputation of missing data leads to improved statistical inference for Hardy-Weinberg proportions.  相似文献   

17.
MOTIVATION: Missing values are problematic for the analysis of microarray data. Imputation methods have been compared in terms of the similarity between imputed and true values in simulation experiments and not of their influence on the final analysis. The focus has been on missing at random, while entries are missing also not at random. RESULTS: We investigate the influence of imputation on the detection of differentially expressed genes from cDNA microarray data. We apply ANOVA for microarrays and SAM and look to the differentially expressed genes that are lost because of imputation. We show that this new measure provides useful information that the traditional root mean squared error cannot capture. We also show that the type of missingness matters: imputing 5% missing not at random has the same effect as imputing 10-30% missing at random. We propose a new method for imputation (LinImp), fitting a simple linear model for each channel separately, and compare it with the widely used KNNimpute method. For 10% missing at random, KNNimpute leads to twice as many lost differentially expressed genes as LinImp. AVAILABILITY: The R package for LinImp is available at http://folk.uio.no/idasch/imp.  相似文献   

18.
Chen HY  Xie H  Qian Y 《Biometrics》2011,67(3):799-809
Multiple imputation is a practically useful approach to handling incompletely observed data in statistical analysis. Parameter estimation and inference based on imputed full data have been made easy by Rubin's rule for result combination. However, creating proper imputation that accommodates flexible models for statistical analysis in practice can be very challenging. We propose an imputation framework that uses conditional semiparametric odds ratio models to impute the missing values. The proposed imputation framework is more flexible and robust than the imputation approach based on the normal model. It is a compatible framework in comparison to the approach based on fully conditionally specified models. The proposed algorithms for multiple imputation through the Markov chain Monte Carlo sampling approach can be straightforwardly carried out. Simulation studies demonstrate that the proposed approach performs better than existing, commonly used imputation approaches. The proposed approach is applied to imputing missing values in bone fracture data.  相似文献   

19.
We develop a nonparametric imputation technique to test for the treatment effects in a nonparametric two-factor mixed model with incomplete data. Within each block, an arbitrary covariance structure of the repeated measurements is assumed without the explicit parametrization of the joint multivariate distribution. The number of repeated measurements is uniformly bounded whereas the number of blocks tends to infinity. The essential idea of the nonparametric imputation is to replace the unknown indicator functions of pairwise comparisons by the corresponding empirical distribution functions. The proposed nonparametric imputation method holds valid under the missing completely at random (MCAR) mechanism. We apply the nonparametric imputation on Brunner and Dette's method for the nonparametric two-factor mixed model and this extension results in a weighted partial rank transform statistic. Asymptotic relative efficiency of the nonparametric imputation method with the complete data versus the incomplete data is derived to quantify the efficiency loss due to the missing data. Monte Carlo simulation studies are conducted to demonstrate the validity and power of the proposed method in comparison with other existing methods. A migraine severity score data set is analyzed to demonstrate the application of the proposed method in the analysis of missing data.  相似文献   

20.
BackgroundPopulation-based net survival by tumour stage at diagnosis is a key measure in cancer surveillance. Unfortunately, data on tumour stage are often missing for a non-negligible proportion of patients and the mechanism giving rise to the missingness is usually anything but completely at random. In this setting, restricting analysis to the subset of complete records gives typically biased results. Multiple imputation is a promising practical approach to the issues raised by the missing data, but its use in conjunction with the Pohar-Perme method for estimating net survival has not been formally evaluated.MethodsWe performed a resampling study using colorectal cancer population-based registry data to evaluate the ability of multiple imputation, used along with the Pohar-Perme method, to deliver unbiased estimates of stage-specific net survival and recover missing stage information. We created 1000 independent data sets, each containing 5000 patients. Stage data were then made missing at random under two scenarios (30% and 50% missingness).ResultsComplete records analysis showed substantial bias and poor confidence interval coverage. Across both scenarios our multiple imputation strategy virtually eliminated the bias and greatly improved confidence interval coverage.ConclusionsIn the presence of missing stage data complete records analysis often gives severely biased results. We showed that combining multiple imputation with the Pohar-Perme estimator provides a valid practical approach for the estimation of stage-specific colorectal cancer net survival. As usual, when the percentage of missing data is high the results should be interpreted cautiously and sensitivity analyses are recommended.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号