首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.

Background

Imputation of genotypes for ungenotyped individuals could enable the use of valuable phenotypes created before the genomic era in analyses that require genotypes. The objective of this study was to investigate the accuracy of imputation of non-genotyped individuals using genotype information from relatives.

Methods

Genotypes were simulated for all individuals in the pedigree of a real (historical) dataset of phenotyped dairy cows and with part of the pedigree genotyped. The software AlphaImpute was used for imputation in its standard settings but also without phasing, i.e. using basic inheritance rules and segregation analysis only. Different scenarios were evaluated i.e.: (1) the real data scenario, (2) addition of genotypes of sires and maternal grandsires of the ungenotyped individuals, and (3) addition of one, two, or four genotyped offspring of the ungenotyped individuals to the reference population.

Results

The imputation accuracy using AlphaImpute in its standard settings was lower than without phasing. Including genotypes of sires and maternal grandsires in the reference population improved imputation accuracy, i.e. the correlation of the true genotypes with the imputed genotype dosages, corrected for mean gene content, across all animals increased from 0.47 (real situation) to 0.60. Including one, two and four genotyped offspring increased the accuracy of imputation across all animals from 0.57 (no offspring) to 0.73, 0.82, and 0.92, respectively.

Conclusions

At present, the use of basic inheritance rules and segregation analysis appears to be the best imputation method for ungenotyped individuals. Comparison of our empirical animal-specific imputation accuracies to predictions based on selection index theory suggested that not correcting for mean gene content considerably overestimates the true accuracy. Imputation of ungenotyped individuals can help to include valuable phenotypes for genome-wide association studies or for genomic prediction, especially when the ungenotyped individuals have genotyped offspring.  相似文献   

2.
Although genomic selection offers the prospect of improving the rate of genetic gain in meat, wool and dairy sheep breeding programs, the key constraint is likely to be the cost of genotyping. Potentially, this constraint can be overcome by genotyping selection candidates for a low density (low cost) panel of SNPs with sparse genotype coverage, imputing a much higher density of SNP genotypes using a densely genotyped reference population. These imputed genotypes would then be used with a prediction equation to produce genomic estimated breeding values. In the future, it may also be desirable to impute very dense marker genotypes or even whole genome re‐sequence data from moderate density SNP panels. Such a strategy could lead to an accurate prediction of genomic estimated breeding values across breeds, for example. We used genotypes from 48 640 (50K) SNPs genotyped in four sheep breeds to investigate both the accuracy of imputation of the 50K SNPs from low density SNP panels, as well as prospects for imputing very dense or whole genome re‐sequence data from the 50K SNPs (by leaving out a small number of the 50K SNPs at random). Accuracy of imputation was low if the sparse panel had less than 5000 (5K) markers. Across breeds, it was clear that the accuracy of imputing from sparse marker panels to 50K was higher if the genetic diversity within a breed was lower, such that relationships among animals in that breed were higher. The accuracy of imputation from sparse genotypes to 50K genotypes was higher when the imputation was performed within breed rather than when pooling all the data, despite the fact that the pooled reference set was much larger. For Border Leicesters, Poll Dorsets and White Suffolks, 5K sparse genotypes were sufficient to impute 50K with 80% accuracy. For Merinos, the accuracy of imputing 50K from 5K was lower at 71%, despite a large number of animals with full genotypes (2215) being used as a reference. For all breeds, the relationship of individuals to the reference explained up to 64% of the variation in accuracy of imputation, demonstrating that accuracy of imputation can be increased if sires and other ancestors of the individuals to be imputed are included in the reference population. The accuracy of imputation could also be increased if pedigree information was available and was used in tracking inheritance of large chromosome segments within families. In our study, we only considered methods of imputation based on population‐wide linkage disequilibrium (largely because the pedigree for some of the populations was incomplete). Finally, in the scenarios designed to mimic imputation of high density or whole genome re‐sequence data from the 50K panel, the accuracy of imputation was much higher (86–96%). This is promising, suggesting that in silico genome re‐sequencing is possible in sheep if a suitable pool of key ancestors is sequenced for each breed.  相似文献   

3.
Li B  Leal SM 《Human heredity》2008,65(4):199-208
Missing genotype data can increase false-positive evidence for linkage when either parametric or nonparametric analysis is carried out ignoring intermarker linkage disequilibrium (LD). Previously it was demonstrated by Huang et al. [1] that no bias occurs in this situation for affected sib-pairs with unrelated parents when either both parents are genotyped or genotype data is available for two additional unaffected siblings when parental genotypes are missing. However, this is not the case for autosomal recessive consanguineous pedigrees, where missing genotype data for any pedigree member within a consanguinity loop can increase false-positive evidence of linkage. False-positive evidence for linkage is further increased when cryptic consanguinity is present. The amount of false-positive evidence for linkage, and which family members aid in its reduction, is highly dependent on which family members are genotyped. When parental genotype data is available, the false-positive evidence for linkage is usually not as strong as when parental genotype data is unavailable. For a pedigree with an affected proband whose first-cousin parents have been genotyped, further reduction in the false-positive evidence of linkage can be obtained by including genotype data from additional affected siblings of the proband or genotype data from the proband's sibling-grandparents. For the situation, when parental genotypes are unavailable, false-positive evidence for linkage can be reduced by including genotype data from either unaffected siblings of the proband or the proband's married-in-grandparents in the analysis.  相似文献   

4.

Background

Genomic evaluations in Holstein dairy cattle have quickly become more reliable over the last two years in many countries as more animals have been genotyped for 50,000 markers. Evaluations can also include animals genotyped with more or fewer markers using new tools such as the 777,000 or 2,900 marker chips recently introduced for cattle. Gains from more markers can be predicted using simulation, whereas strategies to use fewer markers have been compared using subsets of actual genotypes. The overall cost of selection is reduced by genotyping most animals at less than the highest density and imputing their missing genotypes using haplotypes. Algorithms to combine different densities need to be efficient because numbers of genotyped animals and markers may continue to grow quickly.

Methods

Genotypes for 500,000 markers were simulated for the 33,414 Holsteins that had 50,000 marker genotypes in the North American database. Another 86,465 non-genotyped ancestors were included in the pedigree file, and linkage disequilibrium was generated directly in the base population. Mixed density datasets were created by keeping 50,000 (every tenth) of the markers for most animals. Missing genotypes were imputed using a combination of population haplotyping and pedigree haplotyping. Reliabilities of genomic evaluations using linear and nonlinear methods were compared.

Results

Differing marker sets for a large population were combined with just a few hours of computation. About 95% of paternal alleles were determined correctly, and > 95% of missing genotypes were called correctly. Reliability of breeding values was already high (84.4%) with 50,000 simulated markers. The gain in reliability from increasing the number of markers to 500,000 was only 1.6%, but more than half of that gain resulted from genotyping just 1,406 young bulls at higher density. Linear genomic evaluations had reliabilities 1.5% lower than the nonlinear evaluations with 50,000 markers and 1.6% lower with 500,000 markers.

Conclusions

Methods to impute genotypes and compute genomic evaluations were affordable with many more markers. Reliabilities for individual animals can be modified to reflect success of imputation. Breeders can improve reliability at lower cost by combining marker densities to increase both the numbers of markers and animals included in genomic evaluation. Larger gains are expected from increasing the number of animals than the number of markers.  相似文献   

5.

Background

Genotyping accounts for a substantial part of the cost of genomic selection (GS). Using both dense and sparse SNP chips, together with imputation of missing genotypes, can reduce these costs. The aim of this study was to identify the set of candidates that are most important for dense genotyping, when they are used to impute the genotypes of sparsely genotyped animals. In a real pig pedigree, the 2500 most recently born pigs of the last generation, i.e. the target animals, were used for sparse genotyping. Their missing genotypes were imputed using either Beagle or LDMIP from T densely genotyped candidates chosen from the whole pedigree. A new optimization method was derived to identify the best animals for dense genotyping, which minimized the conditional genetic variance of the target animals, using either the pedigree-based relationship matrix (MCA), or a genotypic relationship matrix based on sparse marker genotypes (MCG). These, and five other methods for selecting the T animals were compared, using T = 100 or 200 animals, SNP genotypes were obtained assuming Ne =100 or 200, and MAF thresholds set to D = 0.01, 0.05 or 0.10. The performances of the methods were compared using the following criteria: call rate of true genotypes, accuracy of genotype prediction, and accuracy of genomic evaluations using the imputed genotypes.

Results

For all criteria, MCA and MCG performed better than other selection methods, significantly so for all methods other than selection of sires with the largest numbers of offspring. Methods that choose animals that have the closest average relationship or contribution to the target population gave the lowest accuracy of imputation, in some cases worse than random selection, and should be avoided in practice.

Conclusion

Minimization of the conditional variance of the genotypes in target animals provided an effective optimization procedure for prioritizing animals for genotyping or sequencing.

Electronic supplementary material

The online version of this article (doi:10.1186/1297-9686-46-46) contains supplementary material, which is available to authorized users.  相似文献   

6.
Theo Meuwissen  Mike Goddard 《Genetics》2010,185(4):1441-1449
A novel method, called linkage disequilibrium multilocus iterative peeling (LDMIP), for the imputation of phase and missing genotypes is developed. LDMIP performs an iterative peeling step for every locus, which accounts for the family data, and uses a forward–backward algorithm to accumulate information across loci. Marker similarity between haplotype pairs is used to impute possible missing genotypes and phases, which relies on the linkage disequilibrium between closely linked markers. After this imputation step, the combined iterative peeling/forward–backward algorithm is applied again, until convergence. The calculations per iteration scale linearly with number of markers and number of individuals in the pedigree, which makes LDMIP well suited to large numbers of markers and/or large numbers of individuals. Per iteration calculations scale quadratically with the number of alleles, which implies biallelic markers are preferred. In a situation with up to 15% randomly missing genotypes, the error rate of the imputed genotypes was <1% and ∼99% of the missing genotypes were imputed. In another example, LDMIP was used to impute whole-genome sequence data consisting of 17,321 SNPs on a chromosome. Imputation of the sequence was based on the information of 20 (re)sequenced founder individuals and genotyping their descendants for a panel of 3000 SNPs. The error rate of the imputed SNP genotypes was 10%. However, if the parents of these 20 founders are also sequenced, >99% of missing genotypes are imputed correctly.HIGH-DENSITY SNP arrays are currently available for an increasing number of species. QTL mapping, marker-assisted selection (MAS), and other genetic analyses often require or benefit greatly from imputing missing genotypes and from knowing the phase of the SNP genotypes. Although many statistical methods have been developed for phasing in the literature, new methods are needed because of the spectacular improvements in the efficiency of high-throughput genotyping. The older phasing methods often use linkage information (e.g., Sobel and Lange 1996). However, due to the use of increasingly dense marker maps, the use of linkage disequilibrium (LD) information has become increasingly attractive. Moreover, the linkage analysis methods became computationally intractable as the number of SNPs and/or the number of individuals increased. Phasing methods that rely solely on LD, such as Fastphase (Scheet and Stephens 2006), tend to mistakenly introduce recombinations when applied to genotypes covering long genetic distances (Kong et al. 2008). Linkage information and use of LD are not fundamentally different—use of LD may be thought of as linkage analysis based on common ancestors that occur before the known pedigree. Kong et al. suggested a new approach called “long-range phasing (LRP)” that relies on detecting identical-by-descent (IBD) haplotypes in different individuals and can phase large numbers of SNPs. For situations where the pedigree back to the common ancestor is available this may be considered linkage analysis but it can also be used without a pedigree and would use LD information. Long-range phasing may be seen as a set of sensible, heuristic rules to determine the phase using linkage analysis information, without attempting to extract all information in an optimal way. The latter is typical for modern linkage-based phasing methods: they are less concerned about optimally using all information, since there is a surplus of information, and are more concerned with handling high SNP densities over large genetic distances and for many genotyped individuals. The surplus of information is especially large if whole-genome sequence data are used. We define here whole-genome (re)sequence data as all the SNPs in the genome, which ignores information from copy number variation and other non-SNP genetic polymorphisms.Here we describe a new phasing method, called linkage disequilibrium multilocus iterative peeling (LDMIP), which combines linkage and linkage disequilibrium information and can handle tens of thousands of SNPs per chromosome and thousands of individuals. It was initially developed for the common situation where many individuals at the top of the pedigree are ungenotyped, but the method is general and can be applied in other situations. For instance, we apply it here to the situation where a few individuals have very dense genetic information (e.g., from genome sequencing) while most individuals have sparser or no genotype data. To make optimum use of the known pedigree, family information is used quite extensively in an iterative peeling approach (Elston and Stewart 1971; Van Arendonk et al. 1989; Janss et al. 1995). The use of LD information crudely follows the approach of Meuwissen and Goddard (2001, 2007).  相似文献   

7.
Family-based association tests for genomewide association scans   总被引:7,自引:1,他引:6       下载免费PDF全文
With millions of single-nucleotide polymorphisms (SNPs) identified and characterized, genomewide association studies have begun to identify susceptibility genes for complex traits and diseases. These studies involve the characterization and analysis of very-high-resolution SNP genotype data for hundreds or thousands of individuals. We describe a computationally efficient approach to testing association between SNPs and quantitative phenotypes, which can be applied to whole-genome association scans. In addition to observed genotypes, our approach allows estimation of missing genotypes, resulting in substantial increases in power when genotyping resources are limited. We estimate missing genotypes probabilistically using the Lander-Green or Elston-Stewart algorithms and combine high-resolution SNP genotypes for a subset of individuals in each pedigree with sparser marker data for the remaining individuals. We show that power is increased whenever phenotype information for ungenotyped individuals is included in analyses and that high-density genotyping of just three carefully selected individuals in a nuclear family can recover >90% of the information available if every individual were genotyped, for a fraction of the cost and experimental effort. To aid in study design, we evaluate the power of strategies that genotype different subsets of individuals in each pedigree and make recommendations about which individuals should be genotyped at a high density. To illustrate our method, we performed genomewide association analysis for 27 gene-expression phenotypes in 3-generation families (Centre d'Etude du Polymorphisme Humain pedigrees), in which genotypes for ~860,000 SNPs in 90 grandparents and parents are complemented by genotypes for ~6,700 SNPs in a total of 168 individuals. In addition to increasing the evidence of association at 15 previously identified cis-acting associated alleles, our genotype-inference algorithm allowed us to identify associated alleles at 4 cis-acting loci that were missed when analysis was restricted to individuals with the high-density SNP data. Our genotype-inference algorithm and the proposed association tests are implemented in software that is available for free.  相似文献   

8.
The dog is a valuable model species for the genetic analysis of complex traits, and the use of genotype imputation in dogs will be an important tool for future studies. It is of particular interest to analyse the effect of factors like single nucleotide polymorphism (SNP) density of genotyping arrays and relatedness between dogs on imputation accuracy due to the acknowledged genetic and pedigree structure of dog breeds. In this study, we simulated different genotyping strategies based on data from 1179 Labrador Retriever dogs. The study involved 5826 SNPs on chromosome 1 representing the high density (HighD) array; the low‐density (LowD) array was simulated by masking different proportions of SNPs on the HighD array. The correlations between true and imputed genotypes for a realistic masking level of 87.5% ranged from 0.92 to 0.97, depending on the scenario used. A correlation of 0.92 was found for a likely scenario (10% of dogs genotyped using HighD, 87.5% of HighD SNPs masked in the LowD array), which indicates that genotype imputation in Labrador Retrievers can be a valuable tool to reduce experimental costs while increasing sample size. Furthermore, we show that genotype imputation can be performed successfully even without pedigree information and with low relatedness between dogs in the reference and validation sets. Based on these results, the impact of genotype imputation was evaluated in a genome‐wide association analysis and genomic prediction in Labrador Retrievers.  相似文献   

9.

Background

Efficient, robust, and accurate genotype imputation algorithms make large-scale application of genomic selection cost effective. An algorithm that imputes alleles or allele probabilities for all animals in the pedigree and for all genotyped single nucleotide polymorphisms (SNP) provides a framework to combine all pedigree, genomic, and phenotypic information into a single-stage genomic evaluation.

Methods

An algorithm was developed for imputation of genotypes in pedigreed populations that allows imputation for completely ungenotyped animals and for low-density genotyped animals, accommodates a wide variety of pedigree structures for genotyped animals, imputes unmapped SNP, and works for large datasets. The method involves simple phasing rules, long-range phasing and haplotype library imputation and segregation analysis.

Results

Imputation accuracy was high and computational cost was feasible for datasets with pedigrees of up to 25 000 animals. The resulting single-stage genomic evaluation increased the accuracy of estimated genomic breeding values compared to a scenario in which phenotypes on relatives that were not genotyped were ignored.

Conclusions

The developed imputation algorithm and software and the resulting single-stage genomic evaluation method provide powerful new ways to exploit imputation and to obtain more accurate genetic evaluations.  相似文献   

10.
The objective of this study was to quantify the accuracy achievable from imputing genotypes from a commercially available low-density marker panel (2730 single nucleotide polymorphisms (SNPs) following edits) to a commercially available higher density marker panel (51 602 SNPs following edits) in Holstein-Friesian cattle using Beagle, a freely available software package. A population of 764 Holstein-Friesian animals born since 2006 were used as the test group to quantify the accuracy of imputation, all of which had genotypes for the high-density panel; only SNPs on the low-density panel were retained with the remaining SNPs to be imputed. The reference population for imputation consisted of 4732 animals born before 2006 also with genotypes on the higher density marker panel. The concordance between the actual and imputed genotypes in the test group of animals did not vary across chromosomes and was on average 95%; the concordance between actual and imputed alleles was, on average, 97% across all SNPs. Genomic predictions were undertaken across a range of production and functional traits for the 764 test group animals using either their real or imputed genotypes. Little or no mean difference in the genomic predictions was evident when comparing direct genomic values (DGVs) using real or imputed genotypes. The average correlation between the DGVs estimated using the real or imputed genotypes for the 15 traits included in the Irish total merit index was 0.97 (range of 0.92 to 0.99), indicating good concordance between proofs from real or imputed genotypes. Results show that a commercially available high-density marker panel can be imputed from a commercially available lower density marker panel, which will also have a lower cost, thereby facilitating a reduction in the cost of genomic selection. Increased available numbers of genotyped and phenotyped animals also has implications for increasing the accuracy of genomic prediction in the entire population and thus genetic gain using genomic selection.  相似文献   

11.
Imputation of missing genotypes, in particular from low density to high density, is an important issue in genomic selection and genome‐wide association studies. Given the marker densities, the most important factors affecting imputation accuracy are the size of the reference population and the relationship between individuals in the reference (genotyped with high‐density panel) and study (genotyped with low‐density panel) populations. In this study, we investigated the imputation accuracies when the reference population (genotyped with Illumina BovineSNP50 SNP panel) contained sires, halfsibs, or both sires and halfsibs of the individuals in the study population (genotyped with Illumina BovineLD SNP panel) using three imputation programs (fimpute v2.2, findhap v2, and beagle v3.3.2). Two criteria, correlation between true and imputed genotypes and missing rate after imputation, were used to evaluate the performance of the three programs in different scenarios. Our results showed that fimpute performed the best in all cases, with correlations from 0.921 to 0.978 when imputing from sires to their daughters or between halfsibs. In general, the accuracies of imputing between halfsibs or from sires to their daughters were higher than were those imputing between non‐halfsibs or from sires to non‐daughters. Including both sires and halfsibs in the reference population did not improve the imputation performance in comparison with when only including halfsibs in the reference population for all the three programs.  相似文献   

12.
The aim of this study was to evaluate the impact of genotype imputation on the performance of the GBLUP and Bayesian methods for genomic prediction. A total of 10,309 Holstein bulls were genotyped on the BovineSNP50 BeadChip (50 k). Five low density single nucleotide polymorphism (SNP) panels, containing 6,177, 2,480, 1,536, 768 and 384 SNPs, were simulated from the 50 k panel. A fraction of 0%, 33% and 66% of the animals were randomly selected from the training sets to have low density genotypes which were then imputed into 50 k genotypes. A GBLUP and a Bayesian method were used to predict direct genomic values (DGV) for validation animals using imputed or their actual 50 k genotypes. Traits studied included milk yield, fat percentage, protein percentage and somatic cell score (SCS). Results showed that performance of both GBLUP and Bayesian methods was influenced by imputation errors. For traits affected by a few large QTL, the Bayesian method resulted in greater reductions of accuracy due to imputation errors than GBLUP. Including SNPs with largest effects in the low density panel substantially improved the accuracy of genomic prediction for the Bayesian method. Including genotypes imputed from the 6 k panel achieved almost the same accuracy of genomic prediction as that of using the 50 k panel even when 66% of the training population was genotyped on the 6 k panel. These results justified the application of the 6 k panel for genomic prediction. Imputations from lower density panels were more prone to errors and resulted in lower accuracy of genomic prediction. But for animals that have close relationship to the reference set, genotype imputation may still achieve a relatively high accuracy.  相似文献   

13.
Joint genomic prediction (GP) is an attractive method to improve the accuracy of GP by combining information from multiple populations. However, many factors can negatively influence the accuracy of joint GP, such as differences in linkage disequilibrium phasing between single nucleotide polymorphisms (SNPs) and causal variants, minor allele frequencies and causal variants’ effect sizes across different populations. The objective of this study was to investigate whether the imputed high-density genotype data can improve the accuracy of joint GP using genomic best linear unbiased prediction (GBLUP), single-step GBLUP (ssGBLUP), multi-trait GBLUP (MT-GBLUP) and GBLUP based on genomic relationship matrix considering heterogenous minor allele frequencies across different populations (wGBLUP). Three traits, including days taken to reach slaughter weight, backfat thickness and loin muscle area, were measured on 67 276 Large White pigs from two different populations, for which 3334 were genotyped by SNP array. The results showed that a combined population could substantially improve the accuracy of GP compared with a single-population GP, especially for the population with a smaller size. The imputed SNP data had no effect for single population GP but helped to yield higher accuracy than the medium-density array data for joint GP. Of the four methods, ssGLBUP performed the best, but the advantage of ssGBLUP decreased as more individuals were genotyped. In some cases, MT-GBLUP and wGBLUP performed better than GBLUP. In conclusion, our results confirmed that joint GP could be beneficial from imputed high-density genotype data, and the wGBLUP and MT-GBLUP methods are promising for joint GP in pig breeding.  相似文献   

14.
Lee SH  Van der Werf JH  Tier B 《Genetics》2005,171(4):2063-2072
A linkage analysis for finding inheritance states and haplotype configurations is an essential process for linkage and association mapping. The linkage analysis is routinely based upon observed pedigree information and marker genotypes for individuals in the pedigree. It is not feasible for exact methods to use all such information for a large complex pedigree especially when there are many missing genotypic data. Proposed Markov chain Monte Carlo approaches such as a single-site Gibbs sampler or the meiosis Gibbs sampler are able to handle a complex pedigree with sparse genotypic data; however, they often have reducibility problems, causing biased estimates. We present a combined method, applying the random walk approach to the reducible sites in the meiosis sampler. Therefore, one can efficiently obtain reliable estimates such as identity-by-descent coefficients between individuals based on inheritance states or haplotype configurations, and a wider range of data can be used for mapping of quantitative trait loci within a reasonable time.  相似文献   

15.
Druet T  Farnir FP 《Genetics》2011,188(2):409-419
Identity-by-descent probabilities are important for many applications in genetics. Here we propose a method for modeling the transmission of the haplotypes from the closest genotyped relatives along an entire chromosome. The method relies on a hidden Markov model where hidden states correspond to the set of all possible origins of a haplotype within a given pedigree. Initial state probabilities are estimated from average genetic contribution of each origin to the modeled haplotype while transition probabilities are computed from recombination probabilities and pedigree relationships between the modeled haplotype and the various possible origins. The method was tested on three simulated scenarios based on real data sets from dairy cattle, Arabidopsis thaliana, and maize. The mean identity-by-descent probabilities estimated for the truly inherited parental chromosome ranged from 0.94 to 0.98 according to the design and the marker density. The lowest values were observed in regions close to crossing over or where the method was not able to discriminate between several origins due to their similarity. It is shown that the estimated probabilities were correctly calibrated. For marker imputation (or QTL allele prediction for fine mapping or genomic selection), the method was efficient, with 3.75% allelic imputation error rates on a dairy cattle data set with a low marker density map (1 SNP/Mb). The method should prove useful for situations we are facing now in experimental designs and in plant and animal breeding, where founders are genotyped with relatively high markers densities and last generation(s) genotyped with a lower-density panel.  相似文献   

16.

Background

Genotype imputation can help reduce genotyping costs particularly for implementation of genomic selection. In applications entailing large populations, recovering the genotypes of untyped loci using information from reference individuals that were genotyped with a higher density panel is computationally challenging. Popular imputation methods are based upon the Hidden Markov model and have computational constraints due to an intensive sampling process. A fast, deterministic approach, which makes use of both family and population information, is presented here. All individuals are related and, therefore, share haplotypes which may differ in length and frequency based on their relationships. The method starts with family imputation if pedigree information is available, and then exploits close relationships by searching for long haplotype matches in the reference group using overlapping sliding windows. The search continues as the window size is shrunk in each chromosome sweep in order to capture more distant relationships.

Results

The proposed method gave higher or similar imputation accuracy than Beagle and Impute2 in cattle data sets when all available information was used. When close relatives of target individuals were present in the reference group, the method resulted in higher accuracy compared to the other two methods even when the pedigree was not used. Rare variants were also imputed with higher accuracy. Finally, computing requirements were considerably lower than those of Beagle and Impute2. The presented method took 28 minutes to impute from 6 k to 50 k genotypes for 2,000 individuals with a reference size of 64,429 individuals.

Conclusions

The proposed method efficiently makes use of information from close and distant relatives for accurate genotype imputation. In addition to its high imputation accuracy, the method is fast, owing to its deterministic nature and, therefore, it can easily be used in large data sets where the use of other methods is impractical.  相似文献   

17.
Whole-genome association studies present many new statistical and computational challenges due to the large quantity of data obtained. One of these challenges is haplotype inference; methods for haplotype inference designed for small data sets from candidate-gene studies do not scale well to the large number of individuals genotyped in whole-genome association studies. We present a new method and software for inference of haplotype phase and missing data that can accurately phase data from whole-genome association studies, and we present the first comparison of haplotype-inference methods for real and simulated data sets with thousands of genotyped individuals. We find that our method outperforms existing methods in terms of both speed and accuracy for large data sets with thousands of individuals and densely spaced genetic markers, and we use our method to phase a real data set of 3,002 individuals genotyped for 490,032 markers in 3.1 days of computing time, with 99% of masked alleles imputed correctly. Our method is implemented in the Beagle software package, which is freely available.  相似文献   

18.
In livestock, many studies have reported the results of imputation to 50k single nucleotide polymorphism (SNP) genotypes for animals that are genotyped with low-density SNP panels. The objective of this paper is to review different measures of correctness of imputation, and to evaluate their utility depending on the purpose of the imputed genotypes. Across studies, imputation accuracy, computed as the correlation between true and imputed genotypes, and imputation error rates, that counts the number of incorrectly imputed alleles, are commonly used measures of imputation correctness. Based on the nature of both measures and results reported in the literature, imputation accuracy appears to be a more useful measure of the correctness of imputation than imputation error rates, because imputation accuracy does not depend on minor allele frequency (MAF), whereas imputation error rate depends on MAF. Therefore imputation accuracy can be better compared across loci with different MAF. Imputation accuracy depends on the ability of identifying the correct haplotype of a SNP, but many other factors have been identified as well, including the number of genotyped immediate ancestors, the number of animals with genotypes at the high-density panel, the SNP density on the low- and high-density panel, the MAF of the imputed SNP and whether imputed SNP are located at the end of a chromosome or not. Some of these factors directly contribute to the linkage disequilibrium between imputed SNP and SNP on the low-density panel. When imputation accuracy is assessed as a predictor for the accuracy of subsequent genomic prediction, we recommend that: (1) individual-specific imputation accuracies should be used that are computed after centring and scaling both true and imputed genotypes; and (2) imputation of gene dosage is preferred over imputation of the most likely genotype, as this increases accuracy and reduces bias of the imputed genotypes and the subsequent genomic predictions.  相似文献   

19.
In studies of complex diseases, a common paradigm is to conduct association analysis at markers in regions identified by linkage analysis, to attempt to narrow the region of interest. Family-based tests for association based on parental transmissions to affected offspring are often used in fine-mapping studies. However, for diseases with late onset, parental genotypes are often missing. Without parental genotypes, family-based tests either compare allele frequencies in affected individuals with those in their unaffected siblings or use siblings to infer missing parental genotypes. An example of the latter approach is the score test implemented in the computer program TRANSMIT. The inference of missing parental genotypes in TRANSMIT assumes that transmissions from parents to affected siblings are independent, which is appropriate when there is no linkage. However, using computer simulations, we show that, when the marker and disease locus are linked and the data set consists of families with multiple affected siblings, this assumption leads to a bias in the score statistic under the null hypothesis of no association between the marker and disease alleles. This bias leads to an inflated type I error rate for the score test in regions of linkage. We present a novel test for association in the presence of linkage (APL) that correctly infers missing parental genotypes in regions of linkage by estimating identity-by-descent parameters, to adjust for correlation between parental transmissions to affected siblings. In simulated data, we demonstrate the validity of the APL test under the null hypothesis of no association and show that the test can be more powerful than the pedigree disequilibrium test and family-based association test. As an example, we compare the performance of the tests in a candidate-gene study in families with Parkinson disease.  相似文献   

20.
《Genomics》2021,113(2):655-668
Genotyping-by-sequencing (GBS) provides the marker density required for genomic predictions (GP). However, GBS gives a high proportion of missing SNP data which, for species without a chromosome-level genome assembly, must be imputed without knowing the SNP physical positions. Here, we compared GP accuracy with seven map-independent and two map-dependent imputation approaches, and when using all SNPs against the subset of genetically mapped SNPs. We used two rubber tree (Hevea brasiliensis) datasets with three traits. The results showed that the best imputation approaches were LinkImputeR, Beagle and FImpute. Using the genetically mapped SNPs increased GP accuracy by 4.3%. Using LinkImputeR on all the markers allowed avoiding genetic mapping, with a slight decrease in GP accuracy. LinkImputeR gave the highest level of correctly imputed genotypes and its performances were further improved by its ability to define a subset of SNPs imputed optimally. These results will contribute to the efficient implementation of genomic selection with GBS. For Hevea, GBS is promising for rubber yield improvement, with GP accuracies reaching 0.52.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号