期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Random effects modeling approaches for estimating ROC curves from repeated ordinal tests without a gold standard 总被引：1，自引：0，他引：1

Albert PS 《Biometrics》2007,63(2):593-602

Estimating diagnostic accuracy without a gold standard is an important problem in medical testing. Although there is a fairly large literature on this problem for the case of repeated binary tests, there is substantially less work for the case of ordinal tests. A noted exception is the work by Zhou, Castelluccio, and Zhou (2005, Biometrics 61, 600-609), which proposed a methodology for estimating receiver operating characteristic (ROC) curves without a gold standard from multiple ordinal tests. A key assumption in their work was that the test results are independent conditional on the true test result. I propose random effects modeling approaches that incorporate dependence between the ordinal tests, and I show through asymptotic results and simulations the importance of correctly accounting for the dependence between tests. These modeling approaches, along with the importance of accounting for the dependence between tests, are illustrated by analyzing the uterine cancer pathology data analyzed by Zhou et al. (2005). 相似文献

2.

Estimating Diagnostic Accuracy of Raters Without a Gold Standard by Exploiting a Group of Experts

Bo Zhang Zhen Chen Paul S. Albert 《Biometrics》2012,68(4):1294-1302

Summary In diagnostic medicine, estimating the diagnostic accuracy of a group of raters or medical tests relative to the gold standard is often the primary goal. When a gold standard is absent, latent class models where the unknown gold standard test is treated as a latent variable are often used. However, these models have been criticized in the literature from both a conceptual and a robustness perspective. As an alternative, we propose an approach where we exploit an imperfect reference standard with unknown diagnostic accuracy and conduct sensitivity analysis by varying this accuracy over scientifically reasonable ranges. In this article, a latent class model with crossed random effects is proposed for estimating the diagnostic accuracy of regional obstetrics and gynaecological (OB/GYN) physicians in diagnosing endometriosis. To avoid the pitfalls of models without a gold standard, we exploit the diagnostic results of a group of OB/GYN physicians with an international reputation for the diagnosis of endometriosis. We construct an ordinal reference standard based on the discordance among these international experts and propose a mechanism for conducting sensitivity analysis relative to the unknown diagnostic accuracy among them. A Monte Carlo EM algorithm is proposed for parameter estimation and a BIC‐type model selection procedure is presented. Through simulations and data analysis we show that this new approach provides a useful alternative to traditional latent class modeling approaches used in this setting. 相似文献

3.

A cautionary note on the robustness of latent class models for estimating diagnostic error without a gold standard

Albert PS Dodd LE 《Biometrics》2004,60(2):427-435

Modeling diagnostic error without a gold standard has been an active area of biostatistical research. In a majority of the approaches, model-based estimates of sensitivity, specificity, and prevalence are derived from a latent class model in which the latent variable represents an individual's true unobserved disease status. For simplicity, initial approaches assumed that the diagnostic test results on the same subject were independent given the true disease status (i.e., the conditional independence assumption). More recently, various authors have proposed approaches for modeling the dependence structure between test results given true disease status. This note discusses a potential problem with these approaches. Namely, we show that when the conditional dependence between tests is misspecified, estimators of sensitivity, specificity, and prevalence can be biased. Importantly, we demonstrate that with small numbers of tests, likelihood comparisons and other model diagnostics may not be able to distinguish between models with different dependence structures. We present asymptotic results that show the generality of the problem. Further, data analysis and simulations demonstrate the practical implications of model misspecification. Finally, we present some guidelines about the use of these models for practitioners. 相似文献

4.

Infinite hidden Markov models for multiple multivariate time series with missing data

Lauren Hoskovec Matthew D. Koslovsky Kirsten Koehler Nicholas Good Jennifer L. Peel John Volckens Ander Wilson 《Biometrics》2023,79(3):2592-2604

Exposure to air pollution is associated with increased morbidity and mortality. Recent technological advancements permit the collection of time-resolved personal exposure data. Such data are often incomplete with missing observations and exposures below the limit of detection, which limit their use in health effects studies. In this paper, we develop an infinite hidden Markov model for multiple asynchronous multivariate time series with missing data. Our model is designed to include covariates that can inform transitions among hidden states. We implement beam sampling, a combination of slice sampling and dynamic programming, to sample the hidden states, and a Bayesian multiple imputation algorithm to impute missing data. In simulation studies, our model excels in estimating hidden states and state-specific means and imputing observations that are missing at random or below the limit of detection. We validate our imputation approach on data from the Fort Collins Commuter Study. We show that the estimated hidden states improve imputations for data that are missing at random compared to existing approaches. In a case study of the Fort Collins Commuter Study, we describe the inferential gains obtained from our model including improved imputation of missing data and the ability to identify shared patterns in activity and exposure among repeated sampling days for individuals and among distinct individuals. 相似文献

5.

Marginal analysis of incomplete longitudinal binary data: a cautionary note on LOCF imputation

Cook RJ Zeng L Yi GY 《Biometrics》2004,60(3):820-828

In recent years there has been considerable research devoted to the development of methods for the analysis of incomplete data in longitudinal studies. Despite these advances, the methods used in practice have changed relatively little, particularly in the reporting of pharmaceutical trials. In this setting, perhaps the most widely adopted strategy for dealing with incomplete longitudinal data is imputation by the "last observation carried forward" (LOCF) approach, in which values for missing responses are imputed using observations from the most recently completed assessment. We examine the asymptotic and empirical bias, the empirical type I error rate, and the empirical coverage probability associated with estimators and tests of treatment effect based on the LOCF imputation strategy. We consider a setting involving longitudinal binary data with longitudinal analyses based on generalized estimating equations, and an analysis based simply on the response at the end of the scheduled follow-up. We find that for both of these approaches, imputation by LOCF can lead to substantial biases in estimators of treatment effects, the type I error rates of associated tests can be greatly inflated, and the coverage probability can be far from the nominal level. Alternative analyses based on all available data lead to estimators with comparatively small bias, and inverse probability weighted analyses yield consistent estimators subject to correct specification of the missing data process. We illustrate the differences between various methods of dealing with drop-outs using data from a study of smoking behavior. 相似文献

6.

Inferring phenotypes from substance use via collaborative matrix completion

Jin Lu Jiangwen Sun Xinyu Wang Henry Kranzler Joel Gelernter Jinbo Bi 《BMC systems biology》2018,12(6):104

Background

Although substance use disorders (SUDs) are heritable, few genetic risk factors for them have been identified, in part due to the small sample sizes of study populations. To address this limitation, researchers have aggregated subjects from multiple existing genetic studies, but these subjects can have missing phenotypic information, including diagnostic criteria for certain substances that were not originally a focus of study. Recent advances in addiction neurobiology have shown that comorbid SUDs (e.g., the abuse of multiple substances) have similar genetic determinants, which makes it possible to infer missing SUD diagnostic criteria using criteria from another SUD and patient genotypes through statistical modeling.

Results

We propose a new approach based on matrix completion techniques to integrate features of comorbid health conditions and individual’s genotypes to infer unreported diagnostic criteria for a disorder. This approach optimizes a bi-linear model that uses the interactions between known disease correlations and candidate genes to impute missing criteria. An efficient stochastic and parallel algorithm was developed to optimize the model with a speed 20 times greater than the classic sequential algorithm. It was tested on 3441 subjects who had both cocaine and opioid use disorders and successfully inferred missing diagnostic criteria with consistently better accuracy than other recent statistical methods.

Conclusions

The proposed matrix completion imputation method is a promising tool to impute unreported or unobserved symptoms or criteria for disease diagnosis. Integrating data at multiple scales or from heterogeneous sources may help improve the accuracy of phenotype imputation.

相似文献

7.

A Probit Latent Class Model with General Correlation Structures for Evaluating Accuracy of Diagnostic Tests

Huiping Xu Bruce A. Craig 《Biometrics》2009,65(4):1145-1155

Summary Traditional latent class modeling has been widely applied to assess the accuracy of dichotomous diagnostic tests. These models, however, assume that the tests are independent conditional on the true disease status, which is rarely valid in practice. Alternative models using probit analysis have been proposed to incorporate dependence among tests, but these models consider restricted correlation structures. In this article, we propose a probit latent class model that allows a general correlation structure. When combined with some helpful diagnostics, this model provides a more flexible framework from which to evaluate the correlation structure and model fit. Our model encompasses several other PLC models but uses a parameter‐expanded Monte Carlo EM algorithm to obtain the maximum‐likelihood estimates. The parameter‐expanded EM algorithm was designed to accelerate the convergence rate of the EM algorithm by expanding the complete‐data model to include a larger set of parameters and it ensures a simple solution in fitting the PLC model. We demonstrate our estimation and model selection methods using a simulation study and two published medical studies. 相似文献

8.

Smoothed empirical likelihood inference for ROC curve in the presence of missing biomarker values

Weili Cheng Niansheng Tang 《Biometrical journal. Biometrische Zeitschrift》2020,62(4):1038-1059

This paper considers statistical inference for the receiver operating characteristic (ROC) curve in the presence of missing biomarker values by utilizing estimating equations (EEs) together with smoothed empirical likelihood (SEL). Three approaches are developed to estimate ROC curve and construct its SEL-based confidence intervals based on the kernel-assisted EE imputation, multiple imputation, and hybrid imputation combining the inverse probability weighted imputation and multiple imputation. Under some regularity conditions, we show asymptotic properties of the proposed maximum SEL estimators for ROC curve. Simulation studies are conducted to investigate the performance of the proposed SEL approaches. An example is illustrated by the proposed methodologies. Empirical results show that the hybrid imputation method behaves better than the kernel-assisted and multiple imputation methods, and the proposed three SEL methods outperform existing nonparametric method. 相似文献

9.

New technologies in the mix: Assessing N‐mixture models for abundance estimation using automated detection data from drone surveys

Evangeline Corcoran Simon Denman Grant Hamilton 《Ecology and evolution》2020,10(15):8176-8185

Reliable estimates of abundance are critical in effectively managing threatened species, but the feasibility of integrating data from wildlife surveys completed using advanced technologies such as remotely piloted aircraft systems (RPAS) and machine learning into abundance estimation methods such as N‐mixture modeling is largely unknown due to the unique sources of detection errors associated with these technologies.
We evaluated two modeling approaches for estimating the abundance of koalas detected automatically in RPAS imagery: (a) a generalized N‐mixture model and (b) a modified Horvitz–Thompson (H‐T) estimator method combining generalized linear models and generalized additive models for overall probability of detection, false detection, and duplicate detection. The final estimates from each model were compared to the true number of koalas present as determined by telemetry‐assisted ground surveys.
The modified H‐T estimator approach performed best, with the true count of koalas captured within the 95% confidence intervals around the abundance estimates in all 4 surveys in the testing dataset (n = 138 detected objects), a particularly strong result given the difficulty in attaining accuracy found with previous methods.
The results suggested that N‐mixture models in their current form may not be the most appropriate approach to estimating the abundance of wildlife detected in RPAS surveys with automated detection, and accurate estimates could be made with approaches that account for spurious detections.

相似文献

10.

Why are rare variants hard to impute? Coalescent models reveal theoretical limits in existing algorithms

Yichen Si Brett Vanderwerff Sebastian Zllner 《Genetics》2021,217(4)

Genotype imputation is an indispensable step in human genetic studies. Large reference panels with deeply sequenced genomes now allow interrogating variants with minor allele frequency < 1% without sequencing. Although it is critical to consider limits of this approach, imputation methods for rare variants have only done so empirically; the theoretical basis of their imputation accuracy has not been explored. To provide theoretical consideration of imputation accuracy under the current imputation framework, we develop a coalescent model of imputing rare variants, leveraging the joint genealogy of the sample to be imputed and reference individuals. We show that broadly used imputation algorithms include model misspecifications about this joint genealogy that limit the ability to correctly impute rare variants. We develop closed-form solutions for the probability distribution of this joint genealogy and quantify the inevitable error rate resulting from the model misspecification across a range of allele frequencies and reference sample sizes. We show that the probability of a falsely imputed minor allele decreases with reference sample size, but the proportion of falsely imputed minor alleles mostly depends on the allele count in the reference sample. We summarize the impact of this error on genotype imputation on association tests by calculating the r² between imputed and true genotype and show that even when modeling other sources of error, the impact of the model misspecification has a significant impact on the r² of rare variants. To evaluate these predictions in practice, we compare the imputation of the same dataset across imputation panels of different sizes. Although this empirical imputation accuracy is substantially lower than our theoretical prediction, modeling misspecification seems to further decrease imputation accuracy for variants with low allele counts in the reference. These results provide a framework for developing new imputation algorithms and for interpreting rare variant association analyses. 相似文献

11.

Molecular prediction for atherogenic risks across different cell types of leukocytes

Feng Cheng Ellen C Keeley Jae K Lee 《BMC medical genomics》2012,5(1):1-11

Background

We explored the imputation performance of the program IMPUTE in an admixed sample from Mexico City. The following issues were evaluated: (a) the impact of different reference panels (HapMap vs. 1000 Genomes) on imputation; (b) potential differences in imputation performance between single-step vs. two-step (phasing and imputation) approaches; (c) the effect of different INFO score thresholds on imputation performance and (d) imputation performance in common vs. rare markers.

Methods

The sample from Mexico City comprised 1,310 individuals genotyped with the Affymetrix 5.0 array. We randomly masked 5% of the markers directly genotyped on chromosome 12 (n?=?1,046) and compared the imputed genotypes with the microarray genotype calls. Imputation was carried out with the program IMPUTE. The concordance rates between the imputed and observed genotypes were used as a measure of imputation accuracy and the proportion of non-missing genotypes as a measure of imputation efficacy.

Results

The single-step imputation approach produced slightly higher concordance rates than the two-step strategy (99.1% vs. 98.4% when using the HapMap phase II combined panel), but at the expense of a lower proportion of non-missing genotypes (85.5% vs. 90.1%). The 1,000 Genomes reference sample produced similar concordance rates to the HapMap phase II panel (98.4% for both datasets, using the two-step strategy). However, the 1000 Genomes reference sample increased substantially the proportion of non-missing genotypes (94.7% vs. 90.1%). Rare variants (<1%) had lower imputation accuracy and efficacy than common markers.

Conclusions

The program IMPUTE had an excellent imputation performance for common alleles in an admixed sample from Mexico City, which has primarily Native American (62%) and European (33%) contributions. Genotype concordances were higher than 98.4% using all the imputation strategies, in spite of the fact that no Native American samples are present in the HapMap and 1000 Genomes reference panels. The best balance of imputation accuracy and efficiency was obtained with the 1,000 Genomes panel. Rare variants were not captured effectively by any of the available panels, emphasizing the need to be cautious in the interpretation of association results for imputed rare variants. 相似文献

12.

Estimating prevalence and test accuracy in disease ecology: How Bayesian latent class analysis can boost or bias imperfect test results

Sarah K. Helman Riley O. Mummah Katelyn M. Gostic Michael G. Buhnerkempe Katherine C. Prager James O. Lloyd‐Smith 《Ecology and evolution》2020,10(14):7221-7232

Obtaining accurate estimates of disease prevalence is crucial for the monitoring and management of wildlife populations but can be difficult if different diagnostic tests yield conflicting results and if the accuracy of each diagnostic test is unknown. Bayesian latent class analysis (BLCA) modeling offers a potential solution, providing estimates of prevalence levels and diagnostic test accuracy under the realistic assumption that no diagnostic test is perfect.
In typical applications of this approach, the specificity of one test is fixed at or close to 100%, allowing the model to simultaneously estimate the sensitivity and specificity of all other tests, in addition to infection prevalence. In wildlife systems, a test with near‐perfect specificity is not always available, so we simulated data to investigate how decreasing this fixed specificity value affects the accuracy of model estimates.
We used simulations to explore how the trade‐off between diagnostic test specificity and sensitivity impacts prevalence estimates and found that directional biases depend on pathogen prevalence. Both the precision and accuracy of results depend on the sample size, the diagnostic tests used, and the true infection prevalence, so these factors should be considered when applying BLCA to estimate disease prevalence and diagnostic test accuracy in wildlife systems. A wildlife disease case study, focusing on leptospirosis in California sea lions, demonstrated the potential for Bayesian latent class methods to provide reliable estimates under real‐world conditions.
We delineate conditions under which BLCA improves upon the results from a single diagnostic across a range of prevalence levels and sample sizes, demonstrating when this method is preferable for disease ecologists working in a wide variety of pathogen systems.

相似文献

13.

A Missing Data Approach to Correct for Direct and Indirect Range Restrictions with a Dichotomous Criterion: A Simulation Study

Andreas Pfaffel Marlene Kollmayer Barbara Schober Christiane Spiel 《PloS one》2016,11(3)

A recurring methodological problem in the evaluation of the predictive validity of selection methods is that the values of the criterion variable are available for selected applicants only. This so-called range restriction problem causes biased population estimates. Correction methods for direct and indirect range restriction scenarios have widely studied for continuous criterion variables but not for dichotomous ones. The few existing approaches are inapplicable because they do not consider the unknown base rate of success. Hence, there is a lack of scientific research on suitable correction methods and the systematic analysis of their accuracies in the cases of a naturally or artificially dichotomous criterion. We aim to overcome this deficiency by viewing the range restriction problem as a missing data mechanism. We used multiple imputation by chained equations to generate complete criterion data before estimating the predictive validity and the base rate of success. Monte Carlo simulations were conducted to investigate the accuracy of the proposed correction in dependence of selection ratio, predictive validity, and base rate of success in an experimental design. In addition, we compared our proposed missing data approach with Thorndike’s well-known correction formulas that have only been used in the case of continuous criterion variables so far. The results show that the missing data approach is more accurate in estimating the predictive validity than Thorndike’s correction formulas. The accuracy of our proposed correction increases as the selection ratio and the correlation between predictor and criterion increase. Furthermore, the missing data approach provides a valid estimate of the unknown base rate of success. On the basis of our findings, we argue for the use of multiple imputation by chained equations in the evaluation of the predictive validity of selection methods when the criterion is dichotomous. 相似文献

14.

A coalescent model for genotype imputation

Jewett EM Zawistowski M Rosenberg NA Zöllner S 《Genetics》2012,191(4):1239-1255

The potential for imputed genotypes to enhance an analysis of genetic data depends largely on the accuracy of imputation, which in turn depends on properties of the reference panel of template haplotypes used to perform the imputation. To provide a basis for exploring how properties of the reference panel affect imputation accuracy theoretically rather than with computationally intensive imputation experiments, we introduce a coalescent model that considers imputation accuracy in terms of population-genetic parameters. Our model allows us to investigate sampling designs in the frequently occurring scenario in which imputation targets and templates are sampled from different populations. In particular, we derive expressions for expected imputation accuracy as a function of reference panel size and divergence time between the reference and target populations. We find that a modestly sized "internal" reference panel from the same population as a target haplotype yields, on average, greater imputation accuracy than a larger "external" panel from a different population, even if the divergence time between the two populations is small. The improvement in accuracy for the internal panel increases with increasing divergence time between the target and reference populations. Thus, in humans, our model predicts that imputation accuracy can be improved by generating small population-specific custom reference panels to augment existing collections such as those of the HapMap or 1000 Genomes Projects. Our approach can be extended to understand additional factors that affect imputation accuracy in complex population-genetic settings, and the results can ultimately facilitate improvements in imputation study designs. 相似文献

15.

Design of low density SNP chips for genotype imputation in layer chicken

Florian Herry Frédéric Hérault David Picard Druet Amandine Varenne Thierry Burlot Pascale Le Roy Sophie Allais 《BMC genetics》2018,19(1):108

Background

The main goal of selection is to achieve genetic gain for a population by choosing the best breeders among a set of selection candidates. Since 2013, the use of a high density genotyping chip (600K Affymetrix® Axiom® HD genotyping array) for chicken has enabled the implementation of genomic selection in layer and broiler breeding, but the genotyping costs remain high for a routine use on a large number of selection candidates. It has thus been deemed interesting to develop a low density genotyping chip that would induce lower costs. In this perspective, various simulation studies have been conducted to find the best way to select a set of SNPs for low density genotyping of two laying hen lines.

Results

To design low density SNP chips, two methodologies, based on equidistance (EQ) or on linkage disequilibrium (LD) were compared. Imputation accuracy was assessed as the mean correlation between true and imputed genotypes. The results showed correlations more sensitive to false imputation of SNPs having low Minor Allele Frequency (MAF) when the EQ methodology was used. An increase in imputation accuracy was obtained when SNP density was increased, either through an increase in the number of selected windows on a chromosome or through the rise of the LD threshold. Moreover, the results varied depending on the type of chromosome (macro or micro-chromosome). The LD methodology enabled to optimize the number of SNPs, by reducing the SNP density on macro-chromosomes and by increasing it on micro-chromosomes. Imputation accuracy also increased when the size of the reference population was increased. Conversely, imputation accuracy decreased when the degree of kinship between reference and candidate populations was reduced. Finally, adding selection candidates’ dams in the reference population, in addition to their sire, enabled to get better imputation results.

Conclusions

Whichever the SNP chip, the methodology, and the scenario studied, highly accurate imputations were obtained, with mean correlations higher than 0.83. The key point to achieve good imputation results is to take into account chicken lines’ LD when designing a low density SNP chip, and to include the candidates’ direct parents in the reference population.

相似文献

16.

A linear complexity phasing method for thousands of genomes

Delaneau O Marchini J Zagury JF 《Nature methods》2012,9(2):179-181

Human-disease etiology can be better understood with phase information about diploid sequences. We present a method for estimating haplotypes, using genotype data from unrelated samples or small nuclear families, that leads to improved accuracy and speed compared to several widely used methods. The method, segmented haplotype estimation and imputation tool (SHAPEIT), scales linearly with the number of haplotypes used in each iteration and can be run efficiently on whole chromosomes. 相似文献

17.

Comparison of Methods for Estimating Bird Abundance and Trends From Historical Count Data

FRANK R. THOMPSON III FRANK A. LA SORTE 《The Journal of wildlife management》2008,72(8):1674-1682

Abstract: The use of bird counts as indices has come under increasing scrutiny because assumptions concerning detection probabilities may not be met, but there also seems to be some resistance to use of model-based approaches to estimating abundance. We used data from the United States Forest Service, Southern Region bird monitoring program to compare several common approaches for estimating annual abundance or indices and population trends from point-count data. We compared indices of abundance estimated as annual means of counts and from a mixed-Poisson model to abundance estimates from a count-removal model with 3 time intervals and a distance model with 3 distance bands. We compared trend estimates calculated from an autoregressive, exponential model fit to annual abundance estimates from the above methods and also by estimating trend directly by treating year as a continuous covariate in the mixed-Poisson model. We produced estimates for 6 forest songbirds based on an average of 621 and 459 points in 2 physiographic areas from 1997 to 2004. There was strong evidence that detection probabilities varied among species and years. Nevertheless, there was good overall agreement across trend estimates from the 5 methods for 9 of 12 comparisons. In 3 of 12 comparisons, however, patterns in detection probabilities potentially confounded interpretation of uncorrected counts. Estimates of detection probabilities differed greatly between removal and distance models, likely because the methods estimated different components of detection probability and the data collection was not optimally designed for either method. Given that detection probabilities often vary among species, years, and observers investigators should address detection probability in their surveys, whether it be by estimation of probability of detection and abundance, estimation of effects of key covariates when modeling count as an index of abundance, or through design-based methods to standardize these effects. 相似文献

18.

Estimation of capture probabilities using generalized estimating equations and mixed effects approaches

下载免费PDF全文

Md. Abdus Salam Akanda Russell Alpizar‐Jara 《Ecology and evolution》2014,4(7):1158-1165

Modeling individual heterogeneity in capture probabilities has been one of the most challenging tasks in capture–recapture studies. Heterogeneity in capture probabilities can be modeled as a function of individual covariates, but correlation structure among capture occasions should be taking into account. A proposed generalized estimating equations (GEE) and generalized linear mixed modeling (GLMM) approaches can be used to estimate capture probabilities and population size for capture–recapture closed population models. An example is used for an illustrative application and for comparison with currently used methodology. A simulation study is also conducted to show the performance of the estimation procedures. Our simulation results show that the proposed quasi‐likelihood based on GEE approach provides lower SE than partial likelihood based on either generalized linear models (GLM) or GLMM approaches for estimating population size in a closed capture–recapture experiment. Estimator performance is good if a large proportion of individuals are captured. For cases where only a small proportion of individuals are captured, the estimates become unstable, but the GEE approach outperforms the other methods. 相似文献

19.

Exome sequence genotype imputation in globally diverse hexaploid wheat accessions

Fan Shi Josquin Tibbits Raj K. Pasam Pippa Kay Debbie Wong Joanna Petkowski Kerrie L. Forrest Ben J. Hayes Alina Akhunova John Davies Steven Webb German C. Spangenberg Eduard Akhunov Matthew J. Hayden Hans D. Daetwyler 《TAG. Theoretical and applied genetics. Theoretische und angewandte Genetik》2017,130(7):1393-1404

Key message

Imputing genotypes from the 90K SNP chip to exome sequence in wheat was moderately accurate. We investigated the factors that affect imputation and propose several strategies to improve accuracy.

Abstract

Imputing genetic marker genotypes from low to high density has been proposed as a cost-effective strategy to increase the power of downstream analyses (e.g. genome-wide association studies and genomic prediction) for a given budget. However, imputation is often imperfect and its accuracy depends on several factors. Here, we investigate the effects of reference population selection algorithms, marker density and imputation algorithms (Beagle4 and FImpute) on the accuracy of imputation from low SNP density (9K array) to the Infinium 90K single-nucleotide polymorphism (SNP) array for a collection of 837 hexaploid wheat Watkins landrace accessions. Based on these results, we then used the best performing reference selection and imputation algorithms to investigate imputation from 90K to exome sequence for a collection of 246 globally diverse wheat accessions. Accession-to-nearest-entry and genomic relationship-based methods were the best performing selection algorithms, and FImpute resulted in higher accuracy and was more efficient than Beagle4. The accuracy of imputing exome capture SNPs was comparable to imputing from 9 to 90K at approximately 0.71. This relatively low imputation accuracy is in part due to inconsistency between 90K and exome sequence formats. We also found the accuracy of imputation could be substantially improved to 0.82 when choosing an equivalent number of exome SNP, instead of 90K SNPs on the existing array, as the lower density set. We present a number of recommendations to increase the accuracy of exome imputation.

相似文献

20.

Missing data imputation and haplotype phase inference for genome-wide association studies 总被引：6，自引：2，他引：4

Browning SR 《Human genetics》2008,124(5):439-450

Imputation of missing data and the use of haplotype-based association tests can improve the power of genome-wide association studies (GWAS). In this article, I review methods for haplotype inference and missing data imputation, and discuss their application to GWAS. I discuss common features of the best algorithms for haplotype phase inference and missing data imputation in large-scale data sets, as well as some important differences between classes of methods, and highlight the methods that provide the highest accuracy and fastest computational performance. 相似文献