首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Zhang N  Little RJ 《Biometrics》2012,68(3):933-942
Summary We consider the linear regression of outcome Y on regressors W and Z with some values of W missing, when our main interest is the effect of Z on Y, controlling for W. Three common approaches to regression with missing covariates are (i) complete‐case analysis (CC), which discards the incomplete cases, and (ii) ignorable likelihood methods, which base inference on the likelihood based on the observed data, assuming the missing data are missing at random ( Rubin, 1976b ), and (iii) nonignorable modeling, which posits a joint distribution of the variables and missing data indicators. Another simple practical approach that has not received much theoretical attention is to drop the regressor variables containing missing values from the regression modeling (DV, for drop variables). DV does not lead to bias when either (i) the regression coefficient of W is zero or (ii) W and Z are uncorrelated. We propose a pseudo‐Bayesian approach for regression with missing covariates that compromises between the CC and DV estimates, exploiting information in the incomplete cases when the data support DV assumptions. We illustrate favorable properties of the method by simulation, and apply the proposed method to a liver cancer study. Extension of the method to more than one missing covariate is also discussed.  相似文献   

2.
Sensitivity and specificity are common measures of the accuracy of a diagnostic test. The usual estimators of these quantities are unbiased if data on the diagnostic test result and the true disease status are obtained from all subjects in an appropriately selected sample. In some studies, verification of the true disease status is performed only for a subset of subjects, possibly depending on the result of the diagnostic test and other characteristics of the subjects. Estimators of sensitivity and specificity based on this subset of subjects are typically biased; this is known as verification bias. Methods have been proposed to correct verification bias under the assumption that the missing data on disease status are missing at random (MAR), that is, the probability of missingness depends on the true (missing) disease status only through the test result and observed covariate information. When some of the covariates are continuous, or the number of covariates is relatively large, the existing methods require parametric models for the probability of disease or the probability of verification (given the test result and covariates), and hence are subject to model misspecification. We propose a new method for correcting verification bias based on the propensity score, defined as the predicted probability of verification given the test result and observed covariates. This is estimated separately for those with positive and negative test results. The new method classifies the verified sample into several subsamples that have homogeneous propensity scores and allows correction for verification bias. Simulation studies demonstrate that the new estimators are more robust to model misspecification than existing methods, but still perform well when the models for the probability of disease and probability of verification are correctly specified.  相似文献   

3.
Missing data occur in genetic association studies for several reasons including missing family members and uncertain haplotype phase. Maximum likelihood is a commonly used approach to accommodate missing data, but it can be difficult to apply to family-based association studies, because of possible loss of robustness to confounding by population stratification. Here a novel likelihood for nuclear families is proposed, in which distinct sets of association parameters are used to model the parental genotypes and the offspring genotypes. This approach is robust to population structure when the data are complete, and has only minor loss of robustness when there are missing data. It also allows a novel conditioning step that gives valid analysis for multiple offspring in the presence of linkage. Unrelated subjects are included by regarding them as the children of two missing parents. Simulations and theory indicate similar operating characteristics to TRANSMIT, but with no bias with missing data in the presence of linkage. In comparison with FBAT and PCPH, the proposed model is slightly less robust to population structure but has greater power to detect strong effects. In comparison to APL and MITDT, the model is more robust to stratification and can accommodate sibships of any size. The methods are implemented for binary and continuous traits in software, UNPHASED, available from the author.  相似文献   

4.
Dupuis JA  Schwarz CJ 《Biometrics》2007,63(4):1015-1022
This article considers a Bayesian approach to the multistate extension of the Jolly-Seber model commonly used to estimate population abundance in capture-recapture studies. It extends the work of George and Robert (1992, Biometrika79, 677-683), which dealt with the Bayesian estimation of a closed population with only a single state for all animals. A super-population is introduced to model new entrants in the population. Bayesian estimates of abundance are obtained by implementing a Gibbs sampling algorithm based on data augmentation of the missing data in the capture histories when the state of the animal is unknown. Moreover, a partitioning of the missing data is adopted to ensure the convergence of the Gibbs sampling algorithm even in the presence of impossible transitions between some states. Lastly, we apply our methodology to a population of fish to estimate abundance and movement.  相似文献   

5.
Shih JH  Chatterjee N 《Biometrics》2002,58(3):502-509
In case-control family studies with survival endpoint, age of onset of diseases can be used to assess the familial aggregation of the disease and the relationship between the disease and genetic or environmental risk factors. Because of the retrospective nature of the case--control study, methods for analyzing prospectively collected correlated failure time data do not apply directly. In this article, we propose a semiparametric quasi-partial-likelihood approach to simultaneously estimate the effect of covariates on the age of onset and the association of ages of onset among family members that does not require specification of the baseline marginal distribution. We conducted a simulation study to evaluate the performance of the proposed approach and compare it with the existing semiparametric ones. Simulation results demonstrate that the proposed approach has better performance in terms of consistency and efficiency. We illustrate the methodology using a subset of data from the Washington Ashkenazi Study.  相似文献   

6.
In some large clinical studies, it may be impractical to perform the physical examination to every subject at his/her last monitoring time in order to diagnose the occurrence of the event of interest. This gives rise to survival data with missing censoring indicators where the probability of missing may depend on time of last monitoring and some covariates. We present a fully Bayesian semi‐parametric method for such survival data to estimate regression parameters of the proportional hazards model of Cox. Theoretical investigation and simulation studies show that our method performs better than competing methods. We apply the proposed method to analyze the survival data with missing censoring indicators from the Orofacial Pain: Prospective Evaluation and Risk Assessment study.  相似文献   

7.
Exposure to air pollution is associated with increased morbidity and mortality. Recent technological advancements permit the collection of time-resolved personal exposure data. Such data are often incomplete with missing observations and exposures below the limit of detection, which limit their use in health effects studies. In this paper, we develop an infinite hidden Markov model for multiple asynchronous multivariate time series with missing data. Our model is designed to include covariates that can inform transitions among hidden states. We implement beam sampling, a combination of slice sampling and dynamic programming, to sample the hidden states, and a Bayesian multiple imputation algorithm to impute missing data. In simulation studies, our model excels in estimating hidden states and state-specific means and imputing observations that are missing at random or below the limit of detection. We validate our imputation approach on data from the Fort Collins Commuter Study. We show that the estimated hidden states improve imputations for data that are missing at random compared to existing approaches. In a case study of the Fort Collins Commuter Study, we describe the inferential gains obtained from our model including improved imputation of missing data and the ability to identify shared patterns in activity and exposure among repeated sampling days for individuals and among distinct individuals.  相似文献   

8.
Auxiliary covariate data are often collected in biomedical studies when the primary exposure variable is only assessed on a subset of the study subjects. In this study, we investigate a semiparametric‐estimated likelihood estimation for the generalized linear mixed models (GLMM) in the presence of a continuous auxiliary variable. We use a kernel smoother to handle continuous auxiliary data. The method can be used to deal with missing or mismeasured covariate data problems in a variety of applications when an auxiliary variable is available and cluster sizes are not too small. Simulation study results show that the proposed method performs better than that which ignores the random effects in GLMM and that which only uses data in the validation data set. We illustrate the proposed method with a real data set from a recent environmental epidemiology study on the maternal serum 1,1‐dichloro‐2,2‐bis(p‐chlorophenyl) ethylene level in relationship to preterm births.  相似文献   

9.
Kidney and cardiovascular disease are widespread among populations with high prevalence of diabetes, such as American Indians participating in the Strong Heart Study (SHS). Studying these conditions simultaneously in longitudinal studies is challenging, because the morbidity and mortality associated with these diseases result in missing data, and these data are likely not missing at random. When such data are merely excluded, study findings may be compromised. In this article, a subset of 2264 participants with complete renal function data from Strong Heart Exams 1 (1989–1991), 2 (1993–1995), and 3 (1998–1999) was used to examine the performance of five methods used to impute missing data: listwise deletion, mean of serial measures, adjacent value, multiple imputation, and pattern-mixture. Three missing at random models and one non-missing at random model were used to compare the performance of the imputation techniques on randomly and non-randomly missing data. The pattern-mixture method was found to perform best for imputing renal function data that were not missing at random. Determining whether data are missing at random or not can help in choosing the imputation method that will provide the most accurate results.  相似文献   

10.
Yip PS  Lin HZ  Xi L 《Biometrics》2005,61(4):1085-1092
A semiparametric estimation procedure is proposed to model capture-recapture data with the aim of estimating the population size for a closed population. Individuals' covariates are possibly time dependent and missing at noncaptured times and may be measured with error. A set of estimating equations (EEs) based on covariate process and capture-recapture data is constructed to estimate the relevant parameters and the population size. These EEs can be solved by an algorithm similar to an EM algorithm. Simulation results show that the proposed procedures work better than the naive estimate. In some cases they are even better than "ideal" estimates, for which the true values of covariates are available for all captured subjects over the entire experimental period. We apply the method to a capture-recapture experiment on the bird species Prinia flaviventris in Hong Kong.  相似文献   

11.
Satten GA  Carroll RJ 《Biometrics》2000,56(2):384-388
We consider methods for analyzing categorical regression models when some covariates (Z) are completely observed but other covariates (X) are missing for some subjects. When data on X are missing at random (i.e., when the probability that X is observed does not depend on the value of X itself), we present a likelihood approach for the observed data that allows the same nuisance parameters to be eliminated in a conditional analysis as when data are complete. An example of a matched case-control study is used to demonstrate our approach.  相似文献   

12.
Summary In individually matched case–control studies, when some covariates are incomplete, an analysis based on the complete data may result in a large loss of information both in the missing and completely observed variables. This usually results in a bias and loss of efficiency. In this article, we propose a new method for handling the problem of missing covariate data based on a missing‐data‐induced intensity approach when the missingness mechanism does not depend on case–control status and show that this leads to a generalization of the missing indicator method. We derive the asymptotic properties of the estimates from the proposed method and, using an extensive simulation study, assess the finite sample performance in terms of bias, efficiency, and 95% confidence coverage under several missing data scenarios. We also make comparisons with complete‐case analysis (CCA) and some missing data methods that have been proposed previously. Our results indicate that, under the assumption of predictable missingness, the suggested method provides valid estimation of parameters, is more efficient than CCA, and is competitive with other, more complex methods of analysis. A case–control study of multiple myeloma risk and a polymorphism in the receptor Inter‐Leukin‐6 (IL‐6‐α) is used to illustrate our findings.  相似文献   

13.
Systolic blood pressure (SBP) is an age-dependent complex trait for which both environmental and genetic factors may play a role in explaining variability among individuals. We performed a genome-wide scan of the rate of change in SBP over time on the Framingham Heart Study data and one randomly selected replicate of the simulated data from the Genetic Analysis Workshop 13. We used a variance-component model to carry out linkage analysis and a Markov chain Monte Carlo-based multiple imputation approach to recover missing information. Furthermore, we adopted two selection strategies along with the multiple imputation to deal with subjects taking antihypertensive treatment. The simulated data were used to compare these two strategies, to explore the effectiveness of the multiple imputation in recovering varying degrees of missing information, and its impact on linkage analysis results. For the Framingham data, the marker with the highest LOD score for SBP slope was found on chromosome 7. Interestingly, we found that SBP slopes were not heritable in males but were for females; the marker with the highest LOD score was found on chromosome 18. Using the simulated data, we found that handling treated subjects using the multiple imputation improved the linkage results. We conclude that multiple imputation is a promising approach in recovering missing information in longitudinal genetic studies and hence in improving subsequent linkage analyses.  相似文献   

14.
Informative missingness of parental genotype data occurs when the genotype of a parent influences the probability of the parent's genotype data being observed. Informative missingness can occur in a number of plausible ways and can affect both the validity and power of procedures that assume the data are missing at random (MAR). We propose a bootstrap calibration of MAR procedures to account for informative missingness and apply our methodology to refine the approach implemented in the TRANSMIT program. We illustrate this approach by applying it to data on hypertensive probands and their parents who participated in the Framingham Heart Study.  相似文献   

15.
The selection of an appropriate control sample for use in association mapping requires serious deliberation. Unrelated controls are generally easy to collect, but the resulting analyses are susceptible to spurious association arising from population stratification. Parental controls are popular, since triads comprising a case and two parents can be used in analyses that are robust to this stratification. However, parental controls are often expensive and difficult to collect. In some situations, studies may have both parental and unrelated controls available for analysis. For example, a candidate-gene study may analyze triads but may have an additional sample of unrelated controls for examination of background linkage disequilibrium in genomic regions. Also, studies may collect a sample of triads to confirm results initially found using a traditional case-control study. Initial association studies also may collect each type of control, to provide insurance against the weaknesses of the other type. In these situations, resulting samples will consist of some triads, some unrelated controls, and, possibly, some unrelated cases. Rather than analyze the triads and unrelated subjects separately, we present a likelihood-based approach for combining their information in a single combined association analysis. Our approach allows for joint analysis of data from both triad and case-control study designs. Simulations indicate that our proposed approach is more powerful than association tests that are based on each separate sample. Our approach also allows for flexible modeling and estimation of allele effects, as well as for missing parental data. We illustrate the usefulness of our approach using SNP data from a candidate-gene study of psoriasis.  相似文献   

16.
As genome-wide association studies (GWAS) are becoming more popular, two approaches, among others, could be considered in order to improve statistical power for identifying genes contributing subtle to moderate effects to human diseases. The first approach is to increase sample size, which could be achieved by combining both unrelated and familial subjects together. The second approach is to jointly analyze multiple correlated traits. In this study, by extending generalized estimating equations (GEEs), we propose a simple approach for performing univariate or multivariate association tests for the combined data of unrelated subjects and nuclear families. In particular, we correct for population stratification by integrating principal component analysis and transmission disequilibrium test strategies. The proposed method allows for multiple siblings as well as missing parental information. Simulation studies show that the proposed test has improved power compared to two popular methods, EIGENSTRAT and FBAT, by analyzing the combined data, while correcting for population stratification. In addition, joint analysis of bivariate traits has improved power over univariate analysis when pleiotropic effects are present. Application to the Genetic Analysis Workshop 16 (GAW16) data sets attests to the feasibility and applicability of the proposed method.  相似文献   

17.
Dominici F 《Biometrics》2000,56(2):546-553
We propose a methodology for estimating the cell probabilities in a multiway contingency table by combining partial information from a number of studies when not all of the variables are recorded in all studies. We jointly model the full set of categorical variables recorded in at least one of the studies, and we treat the variables that are not reported as missing dimensions of the study-specific contingency table. For example, we might be interested in combining several cohort studies in which the incidence in the exposed and nonexposed groups is not reported for all risk factors in all studies while the overall numbers of cases and cohort size is always available. To account for study-to-study variability, we adopt a Bayesian hierarchical model. At the first stage of the model, the observation stage, data are modeled by a multinomial distribution with fixed total number of observations. At the second stage, we use the logistic normal (LN) distribution to model variability in the study-specific cells' probabilities. Using this model and data augmentation techniques, we reconstruct the contingency table for each study regardless of which dimensions are missing, and we estimate population parameters of interest. Our hierarchical procedure borrows strength from all the studies and accounts for correlations among the cells' probabilities. The main difficulty in combining studies recording different variables is in maintaining a consistent interpretation of parameters across studies. The approach proposed here overcomes this difficulty and at the same time addresses the uncertainty arising from the missing dimensions. We apply our modeling strategy to analyze data on air pollution and mortality from 1987 to 1994 for six U.S. cities by combining six cross-classifications of low, medium, and high levels of mortality counts, particulate matter, ozone, and carbon monoxide with the complication that four of the six cities do not report all the air pollution variables. Our goals are to investigate the association between air pollution and mortality by reconstructing the tables with missing dimensions, to determine the most harmful pollutant combinations, and to make predictions about these key issues for a city other than the six sampled. We find that, for high levels of ozone and carbon monoxide, the number of cases with a high number of deaths increases as the levels of particulate matter, PM10, increases and that the most harmful combinations corresponds to high levels of PM10, confirming prior findings that levels of PM10 higher than the NAAQS standard are harmful.  相似文献   

18.
Summary In epidemics of infectious diseases such as influenza, an individual may have one of four possible final states: prior immune, escaped from infection, infected with symptoms, and infected asymptomatically. The exact state is often not observed. In addition, the unobserved transmission times of asymptomatic infections further complicate analysis. Under the assumption of missing at random, data‐augmentation techniques can be used to integrate out such uncertainties. We adapt an importance‐sampling‐based Monte Carlo Expectation‐Maximization (MCEM) algorithm to the setting of an infectious disease transmitted in close contact groups. Assuming the independence between close contact groups, we propose a hybrid EM‐MCEM algorithm that applies the MCEM or the traditional EM algorithms to each close contact group depending on the dimension of missing data in that group, and discuss the variance estimation for this practice. In addition, we propose a bootstrap approach to assess the total Monte Carlo error and factor that error into the variance estimation. The proposed methods are evaluated using simulation studies. We use the hybrid EM‐MCEM algorithm to analyze two influenza epidemics in the late 1970s to assess the effects of age and preseason antibody levels on the transmissibility and pathogenicity of the viruses.  相似文献   

19.
In case-control studies of inherited diseases, participating subjects (probands) are often interviewed to collect detailed data about disease history and age-at-onset information in their family members. Genotype data are typically collected from the probands, but not from their relatives. In this article, we introduce an approach that combines case-control analysis of data on the probands with kin-cohort analysis of disease history data on relatives. Assuming a marginally specified multivariate survival model for joint risk of disease among family members, we describe methods for estimating relative risk, cumulative risk, and residual familial aggregation. We also describe a variation of the methodology that can be used for kin-cohort analysis of the family history data from a sample of genotyped cases only. We perform simulation studies to assess performance of the proposed methodologies with correct and mis-specified models for familial aggregation. We illustrate the proposed methodologies by estimating the risk of breast cancer from BRCA1/2 mutations using data from the Washington Ashkenazi Study.  相似文献   

20.
Estimating haplotype frequencies becomes increasingly important in the mapping of complex disease genes, as millions of single nucleotide polymorphisms (SNPs) are being identified and genotyped. When genotypes at multiple SNP loci are gathered from unrelated individuals, haplotype frequencies can be accurately estimated using expectation-maximization (EM) algorithms (Excoffier and Slatkin, 1995; Hawley and Kidd, 1995; Long et al., 1995), with standard errors estimated using bootstraps. However, because the number of possible haplotypes increases exponentially with the number of SNPs, handling data with a large number of SNPs poses a computational challenge for the EM methods and for other haplotype inference methods. To solve this problem, Niu and colleagues, in their Bayesian haplotype inference paper (Niu et al., 2002), introduced a computational algorithm called progressive ligation (PL). But their Bayesian method has a limitation on the number of subjects (no more than 100 subjects in the current implementation of the method). In this paper, we propose a new method in which we use the same likelihood formulation as in Excoffier and Slatkin's EM algorithm and apply the estimating equation idea and the PL computational algorithm with some modifications. Our proposed method can handle data sets with large number of SNPs as well as large numbers of subjects. Simultaneously, our method estimates standard errors efficiently, using the sandwich-estimate from the estimating equation, rather than the bootstrap method. Additionally, our method admits missing data and produces valid estimates of parameters and their standard errors under the assumption that the missing genotypes are missing at random in the sense defined by Rubin (1976).  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号