首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 125 毫秒
1.
This paper deals with a Cox proportional hazards regression model, where some covariates of interest are randomly right‐censored. While methods for censored outcomes have become ubiquitous in the literature, methods for censored covariates have thus far received little attention and, for the most part, dealt with the issue of limit‐of‐detection. For randomly censored covariates, an often‐used method is the inefficient complete‐case analysis (CCA) which consists in deleting censored observations in the data analysis. When censoring is not completely independent, the CCA leads to biased and spurious results. Methods for missing covariate data, including type I and type II covariate censoring as well as limit‐of‐detection do not readily apply due to the fundamentally different nature of randomly censored covariates. We develop a novel method for censored covariates using a conditional mean imputation based on either Kaplan–Meier estimates or a Cox proportional hazards model to estimate the effects of these covariates on a time‐to‐event outcome. We evaluate the performance of the proposed method through simulation studies and show that it provides good bias reduction and statistical efficiency. Finally, we illustrate the method using data from the Framingham Heart Study to assess the relationship between offspring and parental age of onset of cardiovascular events.  相似文献   

2.
Data with missing covariate values but fully observed binary outcomes are an important subset of the missing data challenge. Common approaches are complete case analysis (CCA) and multiple imputation (MI). While CCA relies on missing completely at random (MCAR), MI usually relies on a missing at random (MAR) assumption to produce unbiased results. For MI involving logistic regression models, it is also important to consider several missing not at random (MNAR) conditions under which CCA is asymptotically unbiased and, as we show, MI is also valid in some cases. We use a data application and simulation study to compare the performance of several machine learning and parametric MI methods under a fully conditional specification framework (MI-FCS). Our simulation includes five scenarios involving MCAR, MAR, and MNAR under predictable and nonpredictable conditions, where “predictable” indicates missingness is not associated with the outcome. We build on previous results in the literature to show MI and CCA can both produce unbiased results under more conditions than some analysts may realize. When both approaches were valid, we found that MI-FCS was at least as good as CCA in terms of estimated bias and coverage, and was superior when missingness involved a categorical covariate. We also demonstrate how MNAR sensitivity analysis can build confidence that unbiased results were obtained, including under MNAR-predictable, when CCA and MI are both valid. Since the missingness mechanism cannot be identified from observed data, investigators should compare results from MI and CCA when both are plausibly valid, followed by MNAR sensitivity analysis.  相似文献   

3.
Unlike zero‐inflated Poisson regression, marginalized zero‐inflated Poisson (MZIP) models for counts with excess zeros provide estimates with direct interpretations for the overall effects of covariates on the marginal mean. In the presence of missing covariates, MZIP and many other count data models are ordinarily fitted using complete case analysis methods due to lack of appropriate statistical methods and software. This article presents an estimation method for MZIP models with missing covariates. The method, which is applicable to other missing data problems, is illustrated and compared with complete case analysis by using simulations and dental data on the caries preventive effects of a school‐based fluoride mouthrinse program.  相似文献   

4.
Lu Chen  Li Hsu  Kathleen Malone 《Biometrics》2009,65(4):1105-1114
Summary The population‐based case–control study design is perhaps one of, if not the most, commonly used designs for investigating the genetic and environmental contributions to disease risk in epidemiological studies. Ages at onset and disease status of family members are routinely and systematically collected from the participants in this design. Considering age at onset in relatives as an outcome, this article is focused on using the family history information to obtain the hazard function, i.e., age‐dependent penetrance function, of candidate genes from case–control studies. A frailty‐model‐based approach is proposed to accommodate the shared risk among family members that is not accounted for by observed risk factors. This approach is further extended to accommodate missing genotypes in family members and a two‐phase case–control sampling design. Simulation results show that the proposed method performs well in realistic settings. Finally, a population‐based two‐phase case–control breast cancer study of the BRCA1 gene is used to illustrate the method.  相似文献   

5.
Summary Combining data collected from different sources can potentially enhance statistical efficiency in estimating effects of environmental or genetic factors or gene–environment interactions. However, combining data across studies becomes complicated when data are collected under different study designs, such as family‐based and unrelated individual‐based case–control design. In this article, we describe likelihood‐based approaches that permit the joint estimation of covariate effects on disease risk under study designs that include cases, relatives of cases, and unrelated individuals. Our methods accommodate familial residual correlation and a variety of ascertainment schemes. Extensive simulation experiments demonstrate that the proposed methods for estimation and inference perform well in realistic settings. Efficiencies of different designs are contrasted in the simulation. We applied the methods to data from the Colorectal Cancer Family Registry.  相似文献   

6.
7.
Zhang N  Little RJ 《Biometrics》2012,68(3):933-942
Summary We consider the linear regression of outcome Y on regressors W and Z with some values of W missing, when our main interest is the effect of Z on Y, controlling for W. Three common approaches to regression with missing covariates are (i) complete‐case analysis (CC), which discards the incomplete cases, and (ii) ignorable likelihood methods, which base inference on the likelihood based on the observed data, assuming the missing data are missing at random ( Rubin, 1976b ), and (iii) nonignorable modeling, which posits a joint distribution of the variables and missing data indicators. Another simple practical approach that has not received much theoretical attention is to drop the regressor variables containing missing values from the regression modeling (DV, for drop variables). DV does not lead to bias when either (i) the regression coefficient of W is zero or (ii) W and Z are uncorrelated. We propose a pseudo‐Bayesian approach for regression with missing covariates that compromises between the CC and DV estimates, exploiting information in the incomplete cases when the data support DV assumptions. We illustrate favorable properties of the method by simulation, and apply the proposed method to a liver cancer study. Extension of the method to more than one missing covariate is also discussed.  相似文献   

8.
Summary With advances in modern medicine and clinical diagnosis, case–control data with characterization of finer subtypes of cases are often available. In matched case–control studies, missingness in exposure values often leads to deletion of entire stratum, and thus entails a significant loss in information. When subtypes of cases are treated as categorical outcomes, the data are further stratified and deletion of observations becomes even more expensive in terms of precision of the category‐specific odds‐ratio parameters, especially using the multinomial logit model. The stereotype regression model for categorical responses lies intermediate between the proportional odds and the multinomial or baseline category logit model. The use of this class of models has been limited as the structure of the model implies certain inferential challenges with nonidentifiability and nonlinearity in the parameters. We illustrate how to handle missing data in matched case–control studies with finer disease subclassification within the cases under a stereotype regression model. We present both Monte Carlo based full Bayesian approach and expectation/conditional maximization algorithm for the estimation of model parameters in the presence of a completely general missingness mechanism. We illustrate our methods by using data from an ongoing matched case–control study of colorectal cancer. Simulation results are presented under various missing data mechanisms and departures from modeling assumptions.  相似文献   

9.
Open‐circuit voltages of lead‐halide perovskite solar cells are improving rapidly and are approaching the thermodynamic limit. Since many different perovskite compositions with different bandgap energies are actively being investigated, it is not straightforward to compare the open‐circuit voltages between these devices as long as a consistent method of referencing is missing. For the purpose of comparing open‐circuit voltages and identifying outstanding values, it is imperative to use a unique, generally accepted way of calculating the thermodynamic limit, which is currently not the case. Here a meta‐analysis of methods to determine the bandgap and a radiative limit for open‐circuit voltage is presented. The differences between the methods are analyzed and an easily applicable approach based on the solar cell quantum efficiency as a general reference is proposed.  相似文献   

10.
Ecological data sets often record the abundance of species, together with a set of explanatory variables. Multivariate statistical methods are optimal to analyze such data and are thus frequently used in ecology for exploration, visualization, and inference. Most approaches are based on pairwise distance matrices instead of the sites‐by‐species matrix, which stands in stark contrast to univariate statistics, where data models, assuming specific distributions, are the norm. However, through advances in statistical theory and computational power, models for multivariate data have gained traction. Systematic simulation‐based performance evaluations of these methods are important as guides for practitioners but still lacking. Here, we compare two model‐based methods, multivariate generalized linear models (MvGLMs) and constrained quadratic ordination (CQO), with two distance‐based methods, distance‐based redundancy analysis (dbRDA) and canonical correspondence analysis (CCA). We studied the performance of the methods to discriminate between causal variables and noise variables for 190 simulated data sets covering different sample sizes and data distributions. MvGLM and dbRDA differentiated accurately between causal and noise variables. The former had the lowest false‐positive rate (0.008), while the latter had the lowest false‐negative rate (0.027). CQO and CCA had the highest false‐negative rate (0.291) and false‐positive rate (0.256), respectively, where these error rates were typically high for data sets with linear responses. Our study shows that both model‐ and distance‐based methods have their place in the ecologist's statistical toolbox. MvGLM and dbRDA are reliable for analyzing species–environment relations, whereas both CQO and CCA exhibited considerable flaws, especially with linear environmental gradients.  相似文献   

11.
Preprocessing for high‐dimensional censored datasets, such as the microarray data, is generally considered as an important technique to gain further stability by reducing potential noise from the data. When variable selection including inference is carried out with high‐dimensional censored data the objective is to obtain a smaller subset of variables and then perform the inferential analysis using model estimates based on the selected subset of variables. This two stage inferential analysis is prone to circularity bias because of the noise that might still remain in the dataset. In this work, I propose an adaptive preprocessing technique that uses sure independence screening (SIS) idea to accomplish variable selection and reduces the circularity bias by some popularly known refined high‐dimensional methods such as the elastic net, adaptive elastic net, weighted elastic net, elastic net‐AFT, and two greedy variable selection methods known as TCS, PC‐simple all implemented with the accelerated lifetime models. The proposed technique addresses several features including the issue of collinearity between important and some unimportant covariates, which is often the case in high‐dimensional setting under variable selection framework, and different level of censoring. Simulation studies along with an empirical analysis with a real microarray data, mantle cell lymphoma, is carried out to demonstrate the performance of the adaptive pre‐processing technique.  相似文献   

12.
We develop a nonparametric imputation technique to test for the treatment effects in a nonparametric two-factor mixed model with incomplete data. Within each block, an arbitrary covariance structure of the repeated measurements is assumed without the explicit parametrization of the joint multivariate distribution. The number of repeated measurements is uniformly bounded whereas the number of blocks tends to infinity. The essential idea of the nonparametric imputation is to replace the unknown indicator functions of pairwise comparisons by the corresponding empirical distribution functions. The proposed nonparametric imputation method holds valid under the missing completely at random (MCAR) mechanism. We apply the nonparametric imputation on Brunner and Dette's method for the nonparametric two-factor mixed model and this extension results in a weighted partial rank transform statistic. Asymptotic relative efficiency of the nonparametric imputation method with the complete data versus the incomplete data is derived to quantify the efficiency loss due to the missing data. Monte Carlo simulation studies are conducted to demonstrate the validity and power of the proposed method in comparison with other existing methods. A migraine severity score data set is analyzed to demonstrate the application of the proposed method in the analysis of missing data.  相似文献   

13.
Summary Diagonal discriminant rules have been successfully used for high‐dimensional classification problems, but suffer from the serious drawback of biased discriminant scores. In this article, we propose improved diagonal discriminant rules with bias‐corrected discriminant scores for high‐dimensional classification. We show that the proposed discriminant scores dominate the standard ones under the quadratic loss function. Analytical results on why the bias‐corrected rules can potentially improve the predication accuracy are also provided. Finally, we demonstrate the improvement of the proposed rules over the original ones through extensive simulation studies and real case studies.  相似文献   

14.
Summary In estimation of the ROC curve, when the true disease status is subject to nonignorable missingness, the observed likelihood involves the missing mechanism given by a selection model. In this article, we proposed a likelihood‐based approach to estimate the ROC curve and the area under the ROC curve when the verification bias is nonignorable. We specified a parametric disease model in order to make the nonignorable selection model identifiable. With the estimated verification and disease probabilities, we constructed four types of empirical estimates of the ROC curve and its area based on imputation and reweighting methods. In practice, a reasonably large sample size is required to estimate the nonignorable selection model in our settings. Simulation studies showed that all four estimators of ROC area performed well, and imputation estimators were generally more efficient than the other estimators proposed. We applied the proposed method to a data set from research in Alzheimer's disease.  相似文献   

15.
Diagnostic or screening tests are widely used in medical fields to classify patients according to their disease status. Several statistical models for meta‐analysis of diagnostic test accuracy studies have been developed to synthesize test sensitivity and specificity of a diagnostic test of interest. Because of the correlation between test sensitivity and specificity, modeling the two measures using a bivariate model is recommended. In this paper, we extend the current standard bivariate linear mixed model (LMM) by proposing two variance‐stabilizing transformations: the arcsine square root and the Freeman–Tukey double arcsine transformation. We compared the performance of the proposed methods with the standard method through simulations using several performance measures. The simulation results showed that our proposed methods performed better than the standard LMM in terms of bias, root mean square error, and coverage probability in most of the scenarios, even when data were generated assuming the standard LMM. We also illustrated the methods using two real data sets.  相似文献   

16.
Species distribution modelling (SDM) has become an essential method in ecology and conservation. In the absence of survey data, the majority of SDMs are calibrated with opportunistic presence‐only data, incurring substantial sampling bias. We address the challenge of correcting for sampling bias in the data‐sparse situations. We modelled the relative intensity of bat records in their entire range using three modelling algorithms under the point‐process modelling framework (GLMs with subset selection, GLMs fitted with an elastic‐net penalty, and Maxent). To correct for sampling bias, we applied model‐based bias correction by incorporating spatial information on site accessibility or sampling efforts. We evaluated the effect of bias correction on the models’ predictive performance (AUC and TSS), calculated on spatial‐block cross‐validation and a holdout data set. When evaluated with independent, but also sampling‐biased test data, correction for sampling bias led to improved predictions. The predictive performance of the three modelling algorithms was very similar. Elastic‐net models have intermediate performance, with slight advantage for GLMs on cross‐validation and Maxent on hold‐out evaluation. Model‐based bias correction is very useful in data‐sparse situations, where detailed data are not available to apply other bias correction methods. However, bias correction success depends on how well the selected bias variables describe the sources of bias. In this study, accessibility covariates described bias in our data better than the effort covariate, and their use led to larger changes in predictive performance. Objectively evaluating bias correction requires bias‐free presence–absence test data, and without them the real improvement for describing a species’ environmental niche cannot be assessed.  相似文献   

17.
Imperfect detection can bias estimates of site occupancy in ecological surveys but can be corrected by estimating detection probability. Time‐to‐first‐detection (TTD) occupancy models have been proposed as a cost–effective survey method that allows detection probability to be estimated from single site visits. Nevertheless, few studies have validated the performance of occupancy‐detection models by creating a situation where occupancy is known, and model outputs can be compared with the truth. We tested the performance of TTD occupancy models in the face of detection heterogeneity using an experiment based on standard survey methods to monitor koala Phascolarctos cinereus populations in Australia. Known numbers of koala faecal pellets were placed under trees, and observers, uninformed as to which trees had pellets under them, carried out a TTD survey. We fitted five TTD occupancy models to the survey data, each making different assumptions about detectability, to evaluate how well each estimated the true occupancy status. Relative to the truth, all five models produced strongly biased estimates, overestimating detection probability and underestimating the number of occupied trees. Despite this, goodness‐of‐fit tests indicated that some models fitted the data well, with no evidence of model misfit. Hence, TTD occupancy models that appear to perform well with respect to the available data may be performing poorly. The reason for poor model performance was unaccounted for heterogeneity in detection probability, which is known to bias occupancy‐detection models. This poses a problem because unaccounted for heterogeneity could not be detected using goodness‐of‐fit tests and was only revealed because we knew the experimentally determined outcome. A challenge for occupancy‐detection models is to find ways to identify and mitigate the impacts of unobserved heterogeneity, which could unknowingly bias many models.  相似文献   

18.
Summary The two‐stage case–control design has been widely used in epidemiology studies for its cost‐effectiveness and improvement of the study efficiency ( White, 1982 , American Journal of Epidemiology 115, 119–128; Breslow and Cain, 1988 , Biometrika 75, 11–20). The evolution of modern biomedical studies has called for cost‐effective designs with a continuous outcome and exposure variables. In this article, we propose a new two‐stage outcome‐dependent sampling (ODS) scheme with a continuous outcome variable, where both the first‐stage data and the second‐stage data are from ODS schemes. We develop a semiparametric empirical likelihood estimation for inference about the regression parameters in the proposed design. Simulation studies were conducted to investigate the small‐sample behavior of the proposed estimator. We demonstrate that, for a given statistical power, the proposed design will require a substantially smaller sample size than the alternative designs. The proposed method is illustrated with an environmental health study conducted at National Institutes of Health.  相似文献   

19.
Missing data occur in genetic association studies for several reasons including missing family members and uncertain haplotype phase. Maximum likelihood is a commonly used approach to accommodate missing data, but it can be difficult to apply to family-based association studies, because of possible loss of robustness to confounding by population stratification. Here a novel likelihood for nuclear families is proposed, in which distinct sets of association parameters are used to model the parental genotypes and the offspring genotypes. This approach is robust to population structure when the data are complete, and has only minor loss of robustness when there are missing data. It also allows a novel conditioning step that gives valid analysis for multiple offspring in the presence of linkage. Unrelated subjects are included by regarding them as the children of two missing parents. Simulations and theory indicate similar operating characteristics to TRANSMIT, but with no bias with missing data in the presence of linkage. In comparison with FBAT and PCPH, the proposed model is slightly less robust to population structure but has greater power to detect strong effects. In comparison to APL and MITDT, the model is more robust to stratification and can accommodate sibships of any size. The methods are implemented for binary and continuous traits in software, UNPHASED, available from the author.  相似文献   

20.
Aim Studying relationships between species and their physical environment requires species distribution data, ideally based on presence–absence (P–A) data derived from surveys. Such data are limited in their spatial extent. Presence‐only (P‐O) data are considered inappropriate for such analyses. Our aim was to evaluate whether such data may be used when considering a multitude of species over a large spatial extent, in order to analyse the relationships between environmental factors and species composition. Location The study was conducted in virtual space. However, geographic origin of the data used is the contiguous USA. Methods We created distribution maps for 50 virtual species based on actual environmental conditions in the study. Sampling locations were based on true observations from the Global Biodiversity Information Facility. We produced P–A data by selecting ∼1000 random locations and recorded the presence/absence of all species. We produced two P‐O data sets. Full P‐O set was produced by sampling the species in locations of true occurrences of species. Partial P‐O was a subset of full P‐O data set matching the size of the P–A data set. For each data set, we recorded the environmental variables at the same locations. We used CCA to evaluate the amount of variance in species composition explained by each variable. We evaluated the bias in the data set by calculating the deviation of average values of the environmental variables in sampled locations compared to the entire area. Results P–A and P‐O data sets were similar in terms of the amount of variance explained by the different environmental variables. We found sizable environmental and spatial bias in the P‐O data set, compared to the entire study area. Main conclusions Our results suggest that although P‐O data from collections contain bias, the multitude of species, and thus the relatively large amount of information in the data, allow the use of P‐O data for analysing environmental determinants of species composition.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号