期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Efficient estimation of SNP heritability using Gaussian predictive process in large scale cohort studies

Souvik Seal Abhirup Datta Saonli Basu 《PLoS genetics》2022,18(4)

With the advent of high throughput genetic data, there have been attempts to estimate heritability from genome-wide SNP data on a cohort of distantly related individuals using linear mixed model (LMM). Fitting such an LMM in a large scale cohort study, however, is tremendously challenging due to its high dimensional linear algebraic operations. In this paper, we propose a new method named PredLMM approximating the aforementioned LMM motivated by the concepts of genetic coalescence and Gaussian predictive process. PredLMM has substantially better computational complexity than most of the existing LMM based methods and thus, provides a fast alternative for estimating heritability in large scale cohort studies. Theoretically, we show that under a model of genetic coalescence, the limiting form of our approximation is the celebrated predictive process approximation of large Gaussian process likelihoods that has well-established accuracy standards. We illustrate our approach with extensive simulation studies and use it to estimate the heritability of multiple quantitative traits from the UK Biobank cohort. 相似文献

2.

Estimating haplotype frequencies and standard errors for multiple single nucleotide polymorphisms 总被引：2，自引：0，他引：2

Li SS Khalid N Carlson C Zhao LP 《Biostatistics (Oxford, England)》2003,4(4):513-522

Estimating haplotype frequencies becomes increasingly important in the mapping of complex disease genes, as millions of single nucleotide polymorphisms (SNPs) are being identified and genotyped. When genotypes at multiple SNP loci are gathered from unrelated individuals, haplotype frequencies can be accurately estimated using expectation-maximization (EM) algorithms (Excoffier and Slatkin, 1995; Hawley and Kidd, 1995; Long et al., 1995), with standard errors estimated using bootstraps. However, because the number of possible haplotypes increases exponentially with the number of SNPs, handling data with a large number of SNPs poses a computational challenge for the EM methods and for other haplotype inference methods. To solve this problem, Niu and colleagues, in their Bayesian haplotype inference paper (Niu et al., 2002), introduced a computational algorithm called progressive ligation (PL). But their Bayesian method has a limitation on the number of subjects (no more than 100 subjects in the current implementation of the method). In this paper, we propose a new method in which we use the same likelihood formulation as in Excoffier and Slatkin's EM algorithm and apply the estimating equation idea and the PL computational algorithm with some modifications. Our proposed method can handle data sets with large number of SNPs as well as large numbers of subjects. Simultaneously, our method estimates standard errors efficiently, using the sandwich-estimate from the estimating equation, rather than the bootstrap method. Additionally, our method admits missing data and produces valid estimates of parameters and their standard errors under the assumption that the missing genotypes are missing at random in the sense defined by Rubin (1976). 相似文献

3.

3D geometric morphometrics and missing-data. Can extant taxa give clues for the analysis of fossil primates?

Sébastien Couette Jess White 《Comptes Rendus Palevol》2010,9(6-7):423-433

Geometric morphometric methods constitute a powerful and precise tool for the quantification of morphological differences. The use of geometric morphometrics in palaeontology is very often limited by missing data. Shape analysis methods based on landmarks are very sensible but until now have not been adapted to this kind of dataset. To analyze the prospective utility of this method for fossil taxa, we propose a model based on prosimian cranial morphology in which we test two methods of missing data reconstruction. These consist of generating missing-data in a dataset (by increments of five percent) and estimating missing data using two multivariate methods. Estimates were found to constitute a useful tool for the analysis of partial datasets (to a certain extent). These results are promising for future studies of morphological variation in fossil taxa. 相似文献

4.

Doubly robust estimates for binary longitudinal data analysis with missing response and missing covariates

Chen B Zhou XH 《Biometrics》2011,67(3):830-842

Longitudinal studies often feature incomplete response and covariate data. Likelihood-based methods such as the expectation-maximization algorithm give consistent estimators for model parameters when data are missing at random (MAR) provided that the response model and the missing covariate model are correctly specified; however, we do not need to specify the missing data mechanism. An alternative method is the weighted estimating equation, which gives consistent estimators if the missing data and response models are correctly specified; however, we do not need to specify the distribution of the covariates that have missing values. In this article, we develop a doubly robust estimation method for longitudinal data with missing response and missing covariate when data are MAR. This method is appealing in that it can provide consistent estimators if either the missing data model or the missing covariate model is correctly specified. Simulation studies demonstrate that this method performs well in a variety of situations. 相似文献

5.

Estimating biomarker-based HIV incidence using prevalence data in high risk groups with missing outcomes

Chu H Cole SR 《Biometrical journal. Biometrische Zeitschrift》2006,48(5):772-779

The novel two-step serologic sensitive/less sensitive testing algorithm for detecting recent HIV seroconversion (STARHS) provides a simple and practical method to estimate HIV-1 incidence using cross-sectional HIV seroprevalence data. STARHS has been used increasingly in epidemiologic studies. However, the uncertainty of incidence estimates using this algorithm has not been well described, especially for high risk groups or when missing data is present because a fraction of sensitive enzyme immunoassay (EIA) positive specimens are not tested by the less sensitive EIA. Ad hoc methods used in practice provide incorrect confidence limits and thus may jeopardize statistical inference. In this report, we propose maximum likelihood and Bayesian methods for correctly estimating the uncertainty in incidence estimates obtained using prevalence data with a fraction missing, and extend the methods to regression settings. Using a study of injection drug users participating in a drug detoxification program in New York city as an example, we demonstrated the impact of underestimating the uncertainty in incidence estimates using ad hoc methods. Our methods can be applied to estimate the incidence of other diseases from prevalence data using similar testing algorithms when missing data is present. 相似文献

6.

Robust Estimation of Area Under ROC Curve Using Auxiliary Variables in the Presence of Missing Biomarker Values

Qi Long Xiaoxi Zhang Brent A. Johnson 《Biometrics》2011,67(2):559-567

Summary In medical research, the receiver operating characteristic (ROC) curves can be used to evaluate the performance of biomarkers for diagnosing diseases or predicting the risk of developing a disease in the future. The area under the ROC curve (ROC AUC), as a summary measure of ROC curves, is widely utilized, especially when comparing multiple ROC curves. In observational studies, the estimation of the AUC is often complicated by the presence of missing biomarker values, which means that the existing estimators of the AUC are potentially biased. In this article, we develop robust statistical methods for estimating the ROC AUC and the proposed methods use information from auxiliary variables that are potentially predictive of the missingness of the biomarkers or the missing biomarker values. We are particularly interested in auxiliary variables that are predictive of the missing biomarker values. In the case of missing at random (MAR), that is, missingness of biomarker values only depends on the observed data, our estimators have the attractive feature of being consistent if one correctly specifies, conditional on auxiliary variables and disease status, either the model for the probabilities of being missing or the model for the biomarker values. In the case of missing not at random (MNAR), that is, missingness may depend on the unobserved biomarker values, we propose a sensitivity analysis to assess the impact of MNAR on the estimation of the ROC AUC. The asymptotic properties of the proposed estimators are studied and their finite‐sample behaviors are evaluated in simulation studies. The methods are further illustrated using data from a study of maternal depression during pregnancy. 相似文献

7.

A haplotype-based test of association using data from cohort and nested case-control epidemiologic studies

Chen J Peters U Foster C Chatterjee N 《Human heredity》2004,58(1):18-29

Haplotype-based risk models can lead to powerful methods for detecting the association of a disease with a genomic region of interest. In population-based studies of unrelated individuals, however, the haplotype status of some subjects may not be discernible without ambiguity from available locus-specific genotype data. A score test for detecting haplotype-based association using genotype data has been developed in the context of generalized linear models for analysis of data from cross-sectional and retrospective studies. In this article, we develop a test for association using genotype data from cohort and nested case-control studies where subjects are prospectively followed until disease incidence or censoring (end of follow-up) occurs. Assuming a proportional hazard model for the haplotype effects, we derive an induced hazard function of the disease given the genotype data, and hence propose a test statistic based on the associated partial likelihood. The proposed test procedure can account for differential follow-up of subjects, can adjust for possibly time-dependent environmental co-factors and can make efficient use of valuable age-at-onset information that is available on cases. We provide an algorithm for computing the test statistic using readily available statistical software. Utilizing simulated data in the context of two genomic regions GPX1 and GPX3, we evaluate the validity of the proposed test for small sample sizes and study its power in the presence and absence of missing genotype data. 相似文献

8.

Conditional-cumulant-of-exposure method in logistic missing covariate regression

Wang CY Huang WT 《Biometrics》2000,56(1):98-105

We consider estimation in logistic regression where some covariate variables may be missing at random. Satten and Kupper (1993, Journal of the American Statistical Association 88, 200-208) proposed estimating odds ratio parameters using methods based on the probability of exposure. By approximating a partial likelihood, we extend their idea and propose a method that estimates the cumulant-generating function of the missing covariate given observed covariates and surrogates in the controls. Our proposed method first estimates some lower order cumulants of the conditional distribution of the unobserved data and then solves a resulting estimating equation for the logistic regression parameter. A simple version of the proposed method is to replace a missing covariate by the summation of its conditional mean and conditional variance given observed data in the controls. We note that one important property of the proposed method is that, when the validation is only on controls, a class of inverse selection probability weighted semiparametric estimators cannot be applied because selection probabilities on cases are zeroes. The proposed estimator performs well unless the relative risk parameters are large, even though it is technically inconsistent. Small-sample simulations are conducted. We illustrate the method by an example of real data analysis. 相似文献

9.

Case-cohort analysis with accelerated failure time model 总被引：1，自引：0，他引：1

Kong L Cai J 《Biometrics》2009,65(1):135-142

Summary . In a case–cohort design, covariates are assembled only for a subcohort that is randomly selected from the entire cohort and any additional cases outside the subcohort. This design is appealing for large cohort studies of rare disease, especially when the exposures of interest are expensive to ascertain for all the subjects. We propose statistical methods for analyzing the case–cohort data with a semiparametric accelerated failure time model that interprets the covariates effects as to accelerate or decelerate the time to failure. Asymptotic properties of the proposed estimators are developed. The finite sample properties of case–cohort estimator and its relative efficiency to full cohort estimator are assessed via simulation studies. A real example from a study of cardiovascular disease is provided to illustrate the estimating procedure. 相似文献

10.

Statistical Methods for Analyzing Right‐Censored Length‐Biased Data under Cox Model

Jing Qin Yu Shen 《Biometrics》2010,66(2):382-392

Summary Length‐biased time‐to‐event data are commonly encountered in applications ranging from epidemiological cohort studies or cancer prevention trials to studies of labor economy. A longstanding statistical problem is how to assess the association of risk factors with survival in the target population given the observed length‐biased data. In this article, we demonstrate how to estimate these effects under the semiparametric Cox proportional hazards model. The structure of the Cox model is changed under length‐biased sampling in general. Although the existing partial likelihood approach for left‐truncated data can be used to estimate covariate effects, it may not be efficient for analyzing length‐biased data. We propose two estimating equation approaches for estimating the covariate coefficients under the Cox model. We use the modern stochastic process and martingale theory to develop the asymptotic properties of the estimators. We evaluate the empirical performance and efficiency of the two methods through extensive simulation studies. We use data from a dementia study to illustrate the proposed methodology, and demonstrate the computational algorithms for point estimates, which can be directly linked to the existing functions in S‐PLUS or R . 相似文献

11.

Analysis of binary traits: testing association in the presence of linkage

Jonasdottir G Palmgren J Humphreys K 《BMC genetics》2005,6(Z1):S92

Most methods for testing association in the presence of linkage, using family-based studies, have been developed for continuous traits. FBAT (family-based association tests) is one of few methods appropriate for discrete outcomes. In this article we describe a new test of association in the presence of linkage for binary traits. We use a gamma random effects model in which association and linkage are modelled as fixed effects and random effects, respectively. We have compared the gamma random effects model to an FBAT and a generalized estimating equation-based alternative, using two regions in the Genetic Analysis Workshop 14 simulated data. One of these regions contained haplotypes associated with disease, and the other did not. 相似文献

12.

On the generalized poisson regression mixture model for mapping quantitative trait loci with count data

Cui Y Kim DY Zhu J 《Genetics》2006,174(4):2159-2172

Statistical methods for mapping quantitative trait loci (QTL) have been extensively studied. While most existing methods assume normal distribution of the phenotype, the normality assumption could be easily violated when phenotypes are measured in counts. One natural choice to deal with count traits is to apply the classical Poisson regression model. However, conditional on covariates, the Poisson assumption of mean-variance equality may not be valid when data are potentially under- or overdispersed. In this article, we propose an interval-mapping approach for phenotypes measured in counts. We model the effects of QTL through a generalized Poisson regression model and develop efficient likelihood-based inference procedures. This approach, implemented with the EM algorithm, allows for a genomewide scan for the existence of QTL throughout the entire genome. The performance of the proposed method is evaluated through extensive simulation studies along with comparisons with existing approaches such as the Poisson regression and the generalized estimating equation approach. An application to a rice tiller number data set is given. Our approach provides a standard procedure for mapping QTL involved in the genetic control of complex traits measured in counts. 相似文献

13.

Dealing with missing data in family-based association studies: a multiple imputation approach

Croiseau P Génin E Cordell HJ 《Human heredity》2007,63(3-4):229-238

To test for association between a disease and a set of linked markers, or to estimate relative risks of disease, several different methods have been developed. Many methods for family data require that individuals be genotyped at the full set of markers and that phase can be reconstructed. Individuals with missing data are excluded from the analysis. This can result in an important decrease in sample size and a loss of information. A possible solution to this problem is to use missing-data likelihood methods. We propose an alternative approach, namely the use of multiple imputation. Briefly, this method consists in estimating from the available data all possible phased genotypes and their respective posterior probabilities. These posterior probabilities are then used to generate replicate imputed data sets via a data augmentation algorithm. We performed simulations to test the efficiency of this approach for case/parent trio data and we found that the multiple imputation procedure generally gave unbiased parameter estimates with correct type 1 error and confidence interval coverage. Multiple imputation had some advantages over missing data likelihood methods with regards to ease of use and model flexibility. Multiple imputation methods represent promising tools in the search for disease susceptibility variants. 相似文献

14.

Maximum likelihood estimation in random effects cure rate models with nonignorable missing covariates

Herring AH Ibrahim JG 《Biostatistics (Oxford, England)》2002,3(3):387-405

We introduce a method of parameter estimation for a random effects cure rate model. We also propose a methodology that allows us to account for nonignorable missing covariates in this class of models. The proposed method corrects for possible bias introduced by complete case analysis when missing data are not missing completely at random and is motivated by data from a pair of melanoma studies conducted by the Eastern Cooperative Oncology Group in which clustering by cohort or time of study entry was suspected. In addition, these models allow estimation of cure rates, which is desirable when we do not wish to assume that all subjects remain at risk of death or relapse from disease after sufficient follow-up. We develop an EM algorithm for the model and provide an efficient Gibbs sampling scheme for carrying out the E-step of the algorithm. 相似文献

15.

A targeted maximum likelihood estimator for two-stage designs

Rose S van der Laan MJ 《The international journal of biostatistics》2011,7(1):17

We consider two-stage sampling designs, including so-called nested case control studies, where one takes a random sample from a target population and completes measurements on each subject in the first stage. The second stage involves drawing a subsample from the original sample, collecting additional data on the subsample. This data structure can be viewed as a missing data structure on the full-data structure collected in the second-stage of the study. Methods for analyzing two-stage designs include parametric maximum likelihood estimation and estimating equation methodology. We propose an inverse probability of censoring weighted targeted maximum likelihood estimator (IPCW-TMLE) in two-stage sampling designs and present simulation studies featuring this estimator. 相似文献

16.

Principal stratification designs to estimate input data missing due to death

Frangakis CE Rubin DB An MW MacKenzie E 《Biometrics》2007,63(3):641-649

We consider studies of cohorts of individuals after a critical event, such as an injury, with the following characteristics. First, the studies are designed to measure "input" variables, which describe the period before the critical event, and to characterize the distribution of the input variables in the cohort. Second, the studies are designed to measure "output" variables, primarily mortality after the critical event, and to characterize the predictive (conditional) distribution of mortality given the input variables in the cohort. Such studies often possess the complication that the input data are missing for those who die shortly after the critical event because the data collection takes place after the event. Standard methods of dealing with the missing inputs, such as imputation or weighting methods based on an assumption of ignorable missingness, are known to be generally invalid when the missingness of inputs is nonignorable, that is, when the distribution of the inputs is different between those who die and those who live. To address this issue, we propose a novel design that obtains and uses information on an additional key variable-a treatment or externally controlled variable, which if set at its "effective" level, could have prevented the death of those who died. We show that the new design can be used to draw valid inferences for the marginal distribution of inputs in the entire cohort, and for the conditional distribution of mortality given the inputs, also in the entire cohort, even under nonignorable missingness. The crucial framework that we use is principal stratification based on the potential outcomes, here mortality under both levels of treatment. We also show using illustrative preliminary injury data that our approach can reveal results that are more reasonable than the results of standard methods, in relatively dramatic ways. Thus, our approach suggests that the routine collection of data on variables that could be used as possible treatments in such studies of inputs and mortality should become common. 相似文献

17.

Mean residual life cure models for right-censored data with and without length-biased sampling

Chyong-Mei Chen Hsin-Jen Chen Yingwei Peng 《Biometrical journal. Biometrische Zeitschrift》2023,65(5):2100368

We propose a semiparametric mean residual life mixture cure model for right-censored survival data with a cured fraction. The model employs the proportional mean residual life model to describe the effects of covariates on the mean residual time of uncured subjects and the logistic regression model to describe the effects of covariates on the cure rate. We develop estimating equations to estimate the proposed cure model for the right-censored data with and without length-biased sampling, the latter is often found in prevalent cohort studies. In particular, we propose two estimating equations to estimate the effects of covariates in the cure rate and a method to combine them to improve the estimation efficiency. The consistency and asymptotic normality of the proposed estimates are established. The finite sample performance of the estimates is confirmed with simulations. The proposed estimation methods are applied to a clinical trial study on melanoma and a prevalent cohort study on early-onset type 2 diabetes mellitus. 相似文献

18.

Generalized additive models for cancer mapping with incomplete covariates

French JL Wand MP 《Biostatistics (Oxford, England)》2004,5(2):177-191

Maps depicting cancer incidence rates have become useful tools in public health research, giving valuable information about the spatial variation in rates of disease. Typically, these maps are generated using count data aggregated over areas such as counties or census blocks. However, with the proliferation of geographic information systems and related databases, it is becoming easier to obtain exact spatial locations for the cancer cases and suitable control subjects. The use of such point data allows us to adjust for individual-level covariates, such as age and smoking status, when estimating the spatial variation in disease risk. Unfortunately, such covariate information is often subject to missingness. We propose a method for mapping cancer risk when covariates are not completely observed. We model these data using a logistic generalized additive model. Estimates of the linear and non-linear effects are obtained using a mixed effects model representation. We develop an EM algorithm to account for missing data and the random effects. Since the expectation step involves an intractable integral, we estimate the E-step with a Laplace approximation. This framework provides a general method for handling missing covariate values when fitting generalized additive models. We illustrate our method through an analysis of cancer incidence data from Cape Cod, Massachusetts. These analyses demonstrate that standard complete-case methods can yield biased estimates of the spatial variation of cancer risk. 相似文献

19.

Haplotype-based association analysis in cohort and nested case-control studies

Chen J Chatterjee N 《Biometrics》2006,62(1):28-35

Genetic epidemiologic studies often collect genotype data at multiple loci within a genomic region of interest from a sample of unrelated individuals. One popular method for analyzing such data is to assess whether haplotypes, i.e., the arrangements of alleles along individual chromosomes, are associated with the disease phenotype or not. For many study subjects, however, the exact haplotype configuration on the pair of homologous chromosomes cannot be derived with certainty from the available locus-specific genotype data (phase ambiguity). In this article, we consider estimating haplotype-specific association parameters in the Cox proportional hazards model, using genotype, environmental exposure, and the disease endpoint data collected from cohort or nested case-control studies. We study alternative Expectation-Maximization algorithms for estimating haplotype frequencies from cohort and nested case-control studies. Based on a hazard function of the disease derived from the observed genotype data, we then propose a semiparametric method for joint estimation of relative-risk parameters and the cumulative baseline hazard function. The method is greatly simplified under a rare disease assumption, for which an asymptotic variance estimator is also proposed. The performance of the proposed estimators is assessed via simulation studies. An application of the proposed method is presented, using data from the Alpha-Tocopherol, Beta-Carotene Cancer Prevention Study. 相似文献

20.

A multivariate family-based association test using generalized estimating equations: FBAT-GEE 总被引：8，自引：0，他引：8

Lange C Silverman EK Xu X Weiss ST Laird NM 《Biostatistics (Oxford, England)》2003,4(2):195-206

In this paper we propose a multivariate extension of family-based association tests based on generalized estimating equations. The test can be applied to multiple phenotypes and to phenotypic data obtained in longitudinal studies without making any distributional assumptions for the phenotypic observations. Methods for handling missing phenotypic information are discussed. Further, we compare the power of the multivariate test with permutation tests and with using separate tests for each outcome which are adjusted for multiple testing. Application of the proposed test to an asthma study illustrates the power of the approach. 相似文献