首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Missing data is a common issue in research using observational studies to investigate the effect of treatments on health outcomes. When missingness occurs only in the covariates, a simple approach is to use missing indicators to handle the partially observed covariates. The missing indicator approach has been criticized for giving biased results in outcome regression. However, recent papers have suggested that the missing indicator approach can provide unbiased results in propensity score analysis under certain assumptions. We consider assumptions under which the missing indicator approach can provide valid inferences, namely, (1) no unmeasured confounding within missingness patterns; either (2a) covariate values of patients with missing data were conditionally independent of treatment or (2b) these values were conditionally independent of outcome; and (3) the outcome model is correctly specified: specifically, the true outcome model does not include interactions between missing indicators and fully observed covariates. We prove that, under the assumptions above, the missing indicator approach with outcome regression can provide unbiased estimates of the average treatment effect. We use a simulation study to investigate the extent of bias in estimates of the treatment effect when the assumptions are violated and we illustrate our findings using data from electronic health records. In conclusion, the missing indicator approach can provide valid inferences for outcome regression, but the plausibility of its assumptions must first be considered carefully.  相似文献   

2.
3.
Hairu Wang  Zhiping Lu  Yukun Liu 《Biometrics》2023,79(2):1268-1279
Missing data are frequently encountered in various disciplines and can be divided into three categories: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). Valid statistical approaches to missing data depend crucially on correct identification of the underlying missingness mechanism. Although the problem of testing whether this mechanism is MCAR or MAR has been extensively studied, there has been very little research on testing MAR versus MNAR. A critical challenge that is faced when dealing with this problem is the issue of model identification under MNAR. In this paper, under a logistic model for the missing probability, we develop two score tests for the problem of whether the missingness mechanism is MAR or MNAR under a parametric model and a semiparametric location model on the regression function. The implementation of the score tests circumvents the identification issue as it requires only parameter estimation under the null MAR assumption. Our simulations and analysis of human immunodeficiency virus data show that the score tests have well-controlled type I errors and desirable powers.  相似文献   

4.
Yuan Y  Little RJ 《Biometrics》2009,65(2):487-496
Summary .  Consider a meta-analysis of studies with varying proportions of patient-level missing data, and assume that each primary study has made certain missing data adjustments so that the reported estimates of treatment effect size and variance are valid. These estimates of treatment effects can be combined across studies by standard meta-analytic methods, employing a random-effects model to account for heterogeneity across studies. However, we note that a meta-analysis based on the standard random-effects model will lead to biased estimates when the attrition rates of primary studies depend on the size of the underlying study-level treatment effect. Perhaps ignorable within each study, these types of missing data are in fact not ignorable in a meta-analysis. We propose three methods to correct the bias resulting from such missing data in a meta-analysis: reweighting the DerSimonian–Laird estimate by the completion rate; incorporating the completion rate into a Bayesian random-effects model; and inference based on a Bayesian shared-parameter model that includes the completion rate. We illustrate these methods through a meta-analysis of 16 published randomized trials that examined combined pharmacotherapy and psychological treatment for depression.  相似文献   

5.
Responses of the halotolerant yeast Rhodotorula mucilaginosa YRH2 to salt stress was studied. Strain YRH2 was isolated from chemical industry park wastewater evaporation ponds that are characterized by large fluctuations in salinity and pH. Upon shift to high salt medium there is a shutdown of protein synthesis. Radiolabeling and separation of proteins from salt stressed and non-stressed cells identified down-regulated heat shock 70 proteins Ssb1/2p, by N-terminal sequencing and Western blotting. Ssb's role in salt stress in both R. mucilaginosa and S. cerevisiae was examined and we show that its response to salt stress and amino acid limitation is similar. Other proteins such as the heat shock 70 protein Kar2p/BiP and Protein Disulfide Isomerase were strongly induced in response to a shift to high salt in R. mucilaginosa and reacted in a manner similar to the effect of tunicamycin, a known unfolded protein response inducer. Also, assaying carboxypeptidase Y, we showed that high salt medium reduces the specific activity of the enzyme in R. mucilaginosa. It is suggested that the changes in the expression of the heat shock 70 proteins is a part of a mechanism which alleviates the damaging effects of high salt on protein folding in the yeast Rhodotorula mucilaginosa.  相似文献   

6.
Imputation of high-density genotypes from low- or medium-density platforms is a promising way to enhance the efficiency of whole-genome selection programs at low cost. In this study, we compared the efficiency of three widely used imputation algorithms (fastPHASE, BEAGLE and findhap) using Chinese Holstein cattle with Illumina BovineSNP50 genotypes. A total of 2108 cattle were randomly divided into a reference population and a test population to evaluate the influence of the reference population size. Three bovine chromosomes, BTA1, 16 and 28, were used to represent large, medium and small chromosome size, respectively. We simulated different scenarios by randomly masking 20%, 40%, 80% and 95% single-nucleotide polymorphisms (SNPs) on each chromosome in the test population to mimic different SNP density panels. Illumina Bovine3K and Illumina BovineLD (6909 SNPs) information was also used. We found that the three methods showed comparable accuracy when the proportion of masked SNPs was low. However, the difference became larger when more SNPs were masked. BEAGLE performed the best and was most robust with imputation accuracies >90% in almost all situations. fastPHASE was affected by the proportion of masked SNPs, especially when the masked SNP rate was high. findhap ran the fastest, whereas its accuracies were lower than those of BEAGLE but higher than those of fastPHASE. In addition, enlarging the reference population improved the imputation accuracy for BEAGLE and findhap, but did not affect fastPHASE. Considering imputation accuracy and computational requirements, BEAGLE has been found to be more reliable for imputing genotypes from low- to high-density genotyping platforms.  相似文献   

7.
For regression with covariates missing not at random where the missingness depends on the missing covariate values, complete-case (CC) analysis leads to consistent estimation when the missingness is independent of the response given all covariates, but it may not have the desired level of efficiency. We propose a general empirical likelihood framework to improve estimation efficiency over the CC analysis. We expand on methods in Bartlett et al. (2014, Biostatistics 15 , 719–730) and Xie and Zhang (2017, Int J Biostat 13 , 1–20) that improve efficiency by modeling the missingness probability conditional on the response and fully observed covariates by allowing the possibility of modeling other data distribution-related quantities. We also give guidelines on what quantities to model and demonstrate that our proposal has the potential to yield smaller biases than existing methods when the missingness probability model is incorrect. Simulation studies are presented, as well as an application to data collected from the US National Health and Nutrition Examination Survey.  相似文献   

8.
Unlike zero‐inflated Poisson regression, marginalized zero‐inflated Poisson (MZIP) models for counts with excess zeros provide estimates with direct interpretations for the overall effects of covariates on the marginal mean. In the presence of missing covariates, MZIP and many other count data models are ordinarily fitted using complete case analysis methods due to lack of appropriate statistical methods and software. This article presents an estimation method for MZIP models with missing covariates. The method, which is applicable to other missing data problems, is illustrated and compared with complete case analysis by using simulations and dental data on the caries preventive effects of a school‐based fluoride mouthrinse program.  相似文献   

9.
Sample size calculations based on two‐sample comparisons of slopes in repeated measurements have been reported by many investigators. In contrast, the literature has paid relatively little attention to the design and analysis of K‐sample trials in repeated measurements studies where K is 3 or greater. Jung and Ahn (2003) derived a closed sample size formula for two‐sample comparisons of slopes by taking into account the impact of missing data. We extend their method to compare K‐sample slopes in repeated measurement studies using the generalized estimating equation (GEE) approach based on independent working correlation structure. We investigate the performance of the sample size formula since the sample size formula is based on asymptotic theory. The proposed sample size formula is illustrated using a clinical trial example. (© 2004 WILEY‐VCH Verlag GmbH & Co. KGaA, Weinheim)  相似文献   

10.
Phylogenetic imputation has recently emerged as a potentially powerful tool for predicting missing data in functional traits datasets. As such, understanding the limitations of phylogenetic modelling in predicting trait values is critical if we are to use them in subsequent analyses. Previous studies have focused on the relationship between phylogenetic signal and clade‐level prediction accuracy, yet variability in prediction accuracy among individual tips of phylogenies remains largely unexplored. Here, we used simulations of trait evolution along the branches of phylogenetic trees to show how the accuracy of phylogenetic imputations is influenced by the combined effects of 1) the amount of phylogenetic signal in the traits and 2) the branch length of the tips to be imputed. Specifically, we conducted cross‐validation trials to estimate the variability in prediction accuracy among individual tips on the phylogenies (hereafter ‘tip‐level accuracy’). We found that under a Brownian motion model of evolution (BM, Pagel't λ = 1), tip‐level accuracy rapidly decreased with increasing tip branch‐lengths, and only tips of approximately 10% or less of the total height of the trees showed consistently accurate predictions (i.e. cross‐validation R‐squared >0.75). When phylogenetic signal was weak, the effect of tip branch‐length was reduced, becoming negligible for traits simulated with λ < 0.7, where accuracy was in any case low. Our study shows that variability in prediction accuracy among individual tips of the phylogeny should be considered when evaluating the reliability of phylogenetically imputed trait values. To address this challenge, we describe a Monte Carlo‐based method that allows one to estimate the expected tip‐level accuracy of phylogenetic predictions for continuous traits. Our approach identifies gaps in functional trait datasets for which phylogenetic imputation performs poorly, and will help ecologists to design more efficient trait collection campaigns by focusing resources on lineages whose trait values are more uncertain.  相似文献   

11.
It is not uncommon for biological anthropologists to analyze incomplete bioarcheological or forensic skeleton specimens. As many quantitative multivariate analyses cannot handle incomplete data, missing data imputation or estimation is a common preprocessing practice for such data. Using William W. Howells' Craniometric Data Set and the Goldman Osteometric Data Set, we evaluated the performance of multiple popular statistical methods for imputing missing metric measurements. Results indicated that multiple imputation methods outperformed single imputation methods, such as Bayesian principal component analysis (BPCA). Multiple imputation with Bayesian linear regression implemented in the R package norm2, the Expectation–Maximization (EM) with Bootstrapping algorithm implemented in Amelia, and the Predictive Mean Matching (PMM) method and several of the derivative linear regression models implemented in mice, perform well regarding accuracy, robustness, and speed. Based on the findings of this study, we suggest a practical procedure for choosing appropriate imputation methods.  相似文献   

12.
Edgecombe, G.D. 2010. Palaeomorphology: fossils and the inference of cladistic relationships. —Acta Zoologica (Stockholm) 91 : 72–80 Twenty years have passed since it was empirically demonstrated that inclusion of extinct taxa could overturn a phylogenetic hypothesis formulated upon extant taxa alone, challenging Colin Patterson’s bold conjecture that this phenomenon ‘may be non‐existent’. Suppositions and misconceptions about missing data, often couched in terms of ‘wildcard taxa’ and ‘the missing data problem’, continue to cloud the literature on the topic of fossils and phylogenetics. Comparisons of real data sets show that no a priori (or indeed a posteriori) decisions can be made about amounts of missing data and most properties of cladograms, and both simulated and real data sets demonstrate that even highly incomplete taxa can impact on relationships. The exclusion of fossils from phylogenetic analyses is neither theoretically nor empirically defensible.  相似文献   

13.
Multiple imputation has become a widely accepted technique to deal with the problem of incomplete data. Typically, imputation of missing values and the statistical analysis are performed separately. Therefore, the imputation model has to be consistent with the analysis model. If the data are analyzed with a mixture model, the parameter estimates are usually obtained iteratively. Thus, if the data are missing not at random, parameter estimation and treatment of missingness should be combined. We solve both problems by simultaneously imputing values using the data augmentation method and estimating parameters using the EM algorithm. This iterative procedure ensures that the missing values are properly imputed given the current parameter estimates. Properties of the parameter estimates were investigated in a simulation study. The results are illustrated using data from the National Health and Nutrition Examination Survey.  相似文献   

14.
15.
Green PE  Park T 《Biometrics》2003,59(4):886-896
Log-linear models have been shown to be useful for smoothing contingency tables when categorical outcomes are subject to nonignorable nonresponse. A log-linear model can be fit to an augmented data table that includes an indicator variable designating whether subjects are respondents or nonrespondents. Maximum likelihood estimates calculated from the augmented data table are known to suffer from instability due to boundary solutions. Park and Brown (1994, Journal of the American Statistical Association 89, 44-52) and Park (1998, Biometrics 54, 1579-1590) developed empirical Bayes models that tend to smooth estimates away from the boundary. In those approaches, estimates for nonrespondents were calculated using an EM algorithm by maximizing a posterior distribution. As an extension of their earlier work, we develop a Bayesian hierarchical model that incorporates a log-linear model in the prior specification. In addition, due to uncertainty in the variable selection process associated with just one log-linear model, we simultaneously consider a finite number of models using a stochastic search variable selection (SSVS) procedure due to George and McCulloch (1997, Statistica Sinica 7, 339-373). The integration of the SSVS procedure into a Markov chain Monte Carlo (MCMC) sampler is straightforward, and leads to estimates of cell frequencies for the nonrespondents that are averages resulting from several log-linear models. The methods are demonstrated with a data example involving serum creatinine levels of patients who survived renal transplants. A simulation study is conducted to investigate properties of the model.  相似文献   

16.
Lei Xu  Jun Shao 《Biometrics》2009,65(4):1175-1183
Summary In studies with longitudinal or panel data, missing responses often depend on values of responses through a subject‐level unobserved random effect. Besides the likelihood approach based on parametric models, there exists a semiparametric method, the approximate conditional model (ACM) approach, which relies on the availability of a summary statistic and a linear or polynomial approximation to some random effects. However, two important issues must be addressed in applying ACM. The first is how to find a summary statistic and the second is how to estimate the parameters in the original model using estimates of parameters in ACM. Our study is to address these two issues. For the first issue, we derive summary statistics under various situations. For the second issue, we propose to use a grouping method, instead of linear or polynomial approximation to random effects. Because the grouping method is a moment‐based approach, the conditions we assumed in deriving summary statistics are weaker than the existing ones in the literature. When the derived summary statistic is continuous, we propose to use a classification tree method to obtain an approximate summary statistic for grouping. Some simulation results are presented to study the finite sample performance of the proposed method. An application is illustrated using data from the study of Modification of Diet in Renal Disease.  相似文献   

17.
We focus on the problem of generalizing a causal effect estimated on a randomized controlled trial (RCT) to a target population described by a set of covariates from observational data. Available methods such as inverse propensity sampling weighting are not designed to handle missing values, which are however common in both data sources. In addition to coupling the assumptions for causal effect identifiability and for the missing values mechanism and to defining appropriate estimation strategies, one difficulty is to consider the specific structure of the data with two sources and treatment and outcome only available in the RCT. We propose three multiple imputation strategies to handle missing values when generalizing treatment effects, each handling the multisource structure of the problem differently (separate imputation, joint imputation with fixed effect, joint imputation ignoring source information). As an alternative to multiple imputation, we also propose a direct estimation approach that treats incomplete covariates as semidiscrete variables. The multiple imputation strategies and the latter alternative rely on different sets of assumptions concerning the impact of missing values on identifiability. We discuss these assumptions and assess the methods through an extensive simulation study. This work is motivated by the analysis of a large registry of over 20,000 major trauma patients and an RCT studying the effect of tranexamic acid administration on mortality in major trauma patients admitted to intensive care units. The analysis illustrates how the missing values handling can impact the conclusion about the effect generalized from the RCT to the target population.  相似文献   

18.
19.
何崔同  张瑶  姜颖  徐平 《生物工程学报》2018,34(11):1860-1869
小蛋白质 (Small proteins,SPs) 是由小开放阅读框 (Short open reading frames,sORFs) 编码长度小于100个氨基酸的多肽。研究发现小蛋白质参与了基因表达调控、细胞信号转导和代谢等重要生物学过程。然而,生命体中大多数的已注释小蛋白质尚缺少蛋白水平存在的实验证据,被称为漏检蛋白 (Missing proteins,MPs)。小蛋白质的高效鉴定是其功能研究的前提,也有助于挖掘“漏检蛋白”。文中采用小蛋白质富集策略鉴定到72个酵母小蛋白质,验证9个“漏检蛋白”,发现低分子量、高疏水性、膜结合、弱密码子使用偏性及不稳定性是蛋白漏检的主要原因,对进一步的技术优化具有指导意义。  相似文献   

20.
A "gold" standard test, providing definitive verification of disease status, may be quite invasive or expensive. Current technological advances provide less invasive, or less expensive, diagnostic tests. Ideally, a diagnostic test is evaluated by comparing it with a definitive gold standard test. However, the decision to perform the gold standard test to establish the presence or absence of disease is often influenced by the results of the diagnostic test, along with other measured, or not measured, risk factors. If only data from patients who received the gold standard test were used to assess the test performance, the commonly used measures of diagnostic test performance--sensitivity and specificity--are likely to be biased. Sensitivity would often be higher, and specificity would be lower, than the true values. This bias is called verification bias. Without adjustment for verification bias, one may possibly introduce into the medical practice a diagnostic test with apparent, but not truly, high sensitivity. In this article, verification bias is treated as a missing covariate problem. We propose a flexible modeling and computational framework for evaluating the performance of a diagnostic test, with adjustment for nonignorable verification bias. The presented computational method can be utilized with any software that can repetitively use a logistic regression module. The approach is likelihood-based, and allows use of categorical or continuous covariates. An explicit formula for the observed information matrix is presented, so that one can easily compute standard errors of estimated parameters. The methodology is illustrated with a cardiology data example. We perform a sensitivity analysis of the dependency of verification selection process on disease.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号