首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 326 毫秒
1.

Background

In recent years, capabilities for genotyping large sets of single nucleotide polymorphisms (SNPs) has increased considerably with the ability to genotype over 1 million SNP markers across the genome. This advancement in technology has led to an increase in the number of genome-wide association studies (GWAS) for various complex traits. These GWAS have resulted in the implication of over 1500 SNPs associated with disease traits. However, the SNPs identified from these GWAS are not necessarily the functional variants. Therefore, the next phase in GWAS will involve the refining of these putative loci.

Methodology

A next step for GWAS would be to catalog all variants, especially rarer variants, within the detected loci, followed by the association analysis of the detected variants with the disease trait. However, sequencing a locus in a large number of subjects is still relatively expensive. A more cost effective approach would be to sequence a portion of the individuals, followed by the application of genotype imputation methods for imputing markers in the remaining individuals. A potentially attractive alternative option would be to impute based on the 1000 Genomes Project; however, this has the drawbacks of using a reference population that does not necessarily match the disease status and LD pattern of the study population. We explored a variety of approaches for carrying out the imputation using a reference panel consisting of sequence data for a fraction of the study participants using data from both a candidate gene sequencing study and the 1000 Genomes Project.

Conclusions

Imputation of genetic variation based on a proportion of sequenced samples is feasible. Our results indicate the following sequencing study design guidelines which take advantage of the recent advances in genotype imputation methodology: Select the largest and most diverse reference panel for sequencing and genotype as many “anchor” markers as possible.  相似文献   

2.
Genotype-Imputation Accuracy across Worldwide Human Populations   总被引:2,自引:0,他引:2  
A current approach to mapping complex-disease-susceptibility loci in genome-wide association (GWA) studies involves leveraging the information in a reference database of dense genotype data. By modeling the patterns of linkage disequilibrium in a reference panel, genotypes not directly measured in the study samples can be imputed and tested for disease association. This imputation strategy has been successful for GWA studies in populations well represented by existing reference panels. We used genotypes at 513,008 autosomal single-nucleotide polymorphism (SNP) loci in 443 unrelated individuals from 29 worldwide populations to evaluate the “portability” of the HapMap reference panels for imputation in studies of diverse populations. When a single HapMap panel was leveraged for imputation of randomly masked genotypes, European populations had the highest imputation accuracy, followed by populations from East Asia, Central and South Asia, the Americas, Oceania, the Middle East, and Africa. For each population, we identified “optimal” mixtures of reference panels that maximized imputation accuracy, and we found that in most populations, mixtures including individuals from at least two HapMap panels produced the highest imputation accuracy. From a separate survey of additional SNPs typed in the same samples, we evaluated imputation accuracy in the scenario in which all genotypes at a given SNP position were unobserved and were imputed on the basis of data from a commercial “SNP chip,” again finding that most populations benefited from the use of combinations of two or more HapMap reference panels. Our results can serve as a guide for selecting appropriate reference panels for imputation-based GWA analysis in diverse populations.  相似文献   

3.
Imputation of genotypes in a study sample can make use of sequenced or densely genotyped external reference panels consisting of individuals that are not from the study sample. It also can employ internal reference panels, incorporating a subset of individuals from the study sample itself. Internal panels offer an advantage over external panels because they can reduce imputation errors arising from genetic dissimilarity between a population of interest and a second, distinct population from which the external reference panel has been constructed. As the cost of next-generation sequencing decreases, internal reference panel selection is becoming increasingly feasible. However, it is not clear how best to select individuals to include in such panels. We introduce a new method for selecting an internal reference panel—minimizing the average distance to the closest leaf (ADCL)—and compare its performance relative to an earlier algorithm: maximizing phylogenetic diversity (PD). Employing both simulated data and sequences from the 1000 Genomes Project, we show that ADCL provides a significant improvement in imputation accuracy, especially for imputation of sites with low-frequency alleles. This improvement in imputation accuracy is robust to changes in reference panel size, marker density, and length of the imputation target region.  相似文献   

4.
Using whole-genome sequence (WGS) data are supposed to be optimal for genome-wide association studies and genomic predictions. However, sequencing thousands of individuals of interest is expensive. Imputation from single nucleotide polymorphisms panels to WGS data is an attractive approach to obtain highly reliable WGS data at low cost. Here, we conducted a genotype imputation study with a combined reference panel in yellow-feather dwarf broiler population. The combined reference panel was assembled by sequencing 24 key individuals of a yellow-feather dwarf broiler population (internal reference panel) and WGS data from 311 chickens in public databases (external reference panel). Three scenarios were investigated to determine how different factors affect the accuracy of imputation from 600 K array data to WGS data, including: genotype imputation with internal, external and combined reference panels; the number of internal reference individuals in the combined reference panel; and different reference sizes and selection strategies of an external reference panel. Results showed that imputation accuracy from 600 K to WGS data were 0.834±0.012, 0.920±0.007 and 0.982±0.003 for the internal, external and combined reference panels, respectively. Increasing the reference size from 50 to 250 improved the accuracy of genotype imputation from 0.848 to 0.974 for the combined reference panel and from 0.647 to 0.917 for the external reference panel. The selection strategies for the external reference panel had no impact on the accuracy of imputation using the combined reference panel. However, if only an external reference panel with reference size >50 was used, the selection strategy of minimizing the average distance to the closest leaf had the greatest imputation accuracy compared with other methods. Generally, using a combined reference panel provided greater imputation accuracy, especially for low-frequency variants. In conclusion, the optimal imputation strategy with a combined reference panel should comprehensively consider genetic diversity of the study population, availability and properties of external reference panels, sequencing and computing costs, and frequency of imputed variants. This work sheds light on how to design and execute genotype imputation with a combined external reference panel in a livestock population.  相似文献   

5.
Genotype imputation is now routinely applied in genome-wide association studies (GWAS) and meta-analyses. However, most of the imputations have been run using HapMap samples as reference, imputation of low frequency and rare variants (minor allele frequency (MAF) < 5%) are not systemically assessed. With the emergence of next-generation sequencing, large reference panels (such as the 1000 Genomes panel) are available to facilitate imputation of these variants. Therefore, in order to estimate the performance of low frequency and rare variants imputation, we imputed 153 individuals, each of whom had 3 different genotype array data including 317k, 610k and 1 million SNPs, to three different reference panels: the 1000 Genomes pilot March 2010 release (1KGpilot), the 1000 Genomes interim August 2010 release (1KGinterim), and the 1000 Genomes phase1 November 2010 and May 2011 release (1KGphase1) by using IMPUTE version 2. The differences between these three releases of the 1000 Genomes data are the sample size, ancestry diversity, number of variants and their frequency spectrum. We found that both reference panel and GWAS chip density affect the imputation of low frequency and rare variants. 1KGphase1 outperformed the other 2 panels, at higher concordance rate, higher proportion of well-imputed variants (info>0.4) and higher mean info score in each MAF bin. Similarly, 1M chip array outperformed 610K and 317K. However for very rare variants (MAF≤0.3%), only 0–1% of the variants were well imputed. We conclude that the imputation of low frequency and rare variants improves with larger reference panels and higher density of genome-wide genotyping arrays. Yet, despite a large reference panel size and dense genotyping density, very rare variants remain difficult to impute.  相似文献   

6.
Genotype imputation is an indispensable step in human genetic studies. Large reference panels with deeply sequenced genomes now allow interrogating variants with minor allele frequency < 1% without sequencing. Although it is critical to consider limits of this approach, imputation methods for rare variants have only done so empirically; the theoretical basis of their imputation accuracy has not been explored. To provide theoretical consideration of imputation accuracy under the current imputation framework, we develop a coalescent model of imputing rare variants, leveraging the joint genealogy of the sample to be imputed and reference individuals. We show that broadly used imputation algorithms include model misspecifications about this joint genealogy that limit the ability to correctly impute rare variants. We develop closed-form solutions for the probability distribution of this joint genealogy and quantify the inevitable error rate resulting from the model misspecification across a range of allele frequencies and reference sample sizes. We show that the probability of a falsely imputed minor allele decreases with reference sample size, but the proportion of falsely imputed minor alleles mostly depends on the allele count in the reference sample. We summarize the impact of this error on genotype imputation on association tests by calculating the r2 between imputed and true genotype and show that even when modeling other sources of error, the impact of the model misspecification has a significant impact on the r2 of rare variants. To evaluate these predictions in practice, we compare the imputation of the same dataset across imputation panels of different sizes. Although this empirical imputation accuracy is substantially lower than our theoretical prediction, modeling misspecification seems to further decrease imputation accuracy for variants with low allele counts in the reference. These results provide a framework for developing new imputation algorithms and for interpreting rare variant association analyses.  相似文献   

7.
Common variants explain little of the variance of most common disease,prompting large-scale sequencing studies to understand the contribution of rare variants to these diseases.Imputation of rare variants from genome-wide genotypic arrays offers a cost-efficient strategy to achieve necessary sample sizes required for adequate statistical power.To estimate the performance of imputation of rare variants,we imputed 153 individuals,each of whom was genotyped on 3 different genotype arrays including 317k,610k and 1 million single nucleotide polymorphisms(SNPs),to two different reference panels:HapMap2 and 1000 Genomes pilot March 2010 release (lKGpilot) by using IMPUTE version 2.We found that more than 94%and 84%of all SNPs yield acceptable accuracy(info > 0.4) in HapMap2 and lKGpilot-based imputation,respectively.For rare variants(minor allele frequency(MAF) <5%),the proportion of wellimputed SNPs increased as the MAF increased from 0.3%to 5%across all 3 genome-wide association study(GWAS) datasets.The proportion of well-imputed SNPs was 69%,60%and 49%for SNPs with a MAF from 0.3%to 5%for 1M,610k and 317k,respectively. None of the very rare variants(MAF < 0.3%) were well imputed.We conclude that the imputation accuracy of rare variants increases with higher density of genome-wide genotyping arrays when the size of the reference panel is small.Variants with lower MAF are more difficult to impute.These findings have important implications in the design and replication of large-scale sequencing studies.  相似文献   

8.

Background

The advent of low cost next generation sequencing has made it possible to sequence a large number of dairy and beef bulls which can be used as a reference for imputation of whole genome sequence data. The aim of this study was to investigate the accuracy and speed of imputation from a high density SNP marker panel to whole genome sequence level. Data contained 132 Holstein, 42 Jersey, 52 Nordic Red and 16 Brown Swiss bulls with whole genome sequence data; 16 Holstein, 27 Jersey and 29 Nordic Reds had previously been typed with the bovine high density SNP panel and were used for validation. We investigated the effect of enlarging the reference population by combining data across breeds on the accuracy of imputation, and the accuracy and speed of both IMPUTE2 and BEAGLE using either genotype probability reference data or pre-phased reference data. All analyses were done on Bovine autosome 29 using 387,436 bi-allelic variants and 13,612 SNP markers from the bovine HD panel.

Results

A combined breed reference population led to higher imputation accuracies than did a single breed reference. The highest accuracy of imputation for all three test breeds was achieved when using BEAGLE with un-phased reference data (mean genotype correlations of 0.90, 0.89 and 0.87 for Holstein, Jersey and Nordic Red respectively) but IMPUTE2 with un-phased reference data gave similar accuracies for Holsteins and Nordic Red. Pre-phasing the reference data only lead to a minor decrease in the imputation accuracy, but gave a large improvement in computation time. Pre-phasing with BEAGLE was substantially faster than pre-phasing with SHAPEIT2 (2.5 hours vs. 52 hours for 242 individuals), and imputation with pre-phased data was faster in IMPUTE2 than in BEAGLE (5 minutes vs. 50 minutes per individual).

Conclusion

Combining reference populations across breeds is a good option to increase the size of the reference data and in turn the accuracy of imputation when only few animals are available. Pre-phasing the reference data only slightly decreases the accuracy but gives substantial improvements in speed. Using BEAGLE for pre-phasing and IMPUTE2 for imputation is a fast and accurate strategy.  相似文献   

9.
We present a genotype imputation method that scales to millions of reference samples. The imputation method, based on the Li and Stephens model and implemented in Beagle v.4.1, is parallelized and memory efficient, making it well suited to multi-core computer processors. It achieves fast, accurate, and memory-efficient genotype imputation by restricting the probability model to markers that are genotyped in the target samples and by performing linear interpolation to impute ungenotyped variants. We compare Beagle v.4.1 with Impute2 and Minimac3 by using 1000 Genomes Project data, UK10K Project data, and simulated data. All three methods have similar accuracy but different memory requirements and different computation times. When imputing 10 Mb of sequence data from 50,000 reference samples, Beagle’s throughput was more than 100× greater than Impute2’s throughput on our computer servers. When imputing 10 Mb of sequence data from 200,000 reference samples in VCF format, Minimac3 consumed 26× more memory per computational thread and 15× more CPU time than Beagle. We demonstrate that Beagle v.4.1 scales to much larger reference panels by performing imputation from a simulated reference panel having 5 million samples and a mean marker density of one marker per four base pairs.  相似文献   

10.
Whole-genome sequencing across multiple samples in a population provides an unprecedented opportunity for comprehensively characterizing the polymorphic variants in the population. Although the 1000 Genomes Project (1KGP) has offered brief insights into the value of population-level sequencing, the low coverage has compromised the ability to confidently detect rare and low-frequency variants. In addition, the composition of populations in the 1KGP is not complete, despite the fact that the study design has been extended to more than 2,500 samples from more than 20 population groups. The Malays are one of the Austronesian groups predominantly present in Southeast Asia and Oceania, and the Singapore Sequencing Malay Project (SSMP) aims to perform deep whole-genome sequencing of 100 healthy Malays. By sequencing at a minimum of 30× coverage, we have illustrated the higher sensitivity at detecting low-frequency and rare variants and the ability to investigate the presence of hotspots of functional mutations. Compared to the low-pass sequencing in the 1KGP, the deeper coverage allows more functional variants to be identified for each person. A comparison of the fidelity of genotype imputation of Malays indicated that a population-specific reference panel, such as the SSMP, outperforms a cosmopolitan panel with larger number of individuals for common SNPs. For lower-frequency (<5%) markers, a larger number of individuals might have to be whole-genome sequenced so that the accuracy currently afforded by the 1KGP can be achieved. The SSMP data are expected to be the benchmark for evaluating the value of deep population-level sequencing versus low-pass sequencing, especially in populations that are poorly represented in population-genetics studies.  相似文献   

11.
We present methods for imputing data for ungenotyped markers and for inferring haplotype phase in large data sets of unrelated individuals and parent-offspring trios. Our methods make use of known haplotype phase when it is available, and our methods are computationally efficient so that the full information in large reference panels with thousands of individuals is utilized. We demonstrate that substantial gains in imputation accuracy accrue with increasingly large reference panel sizes, particularly when imputing low-frequency variants, and that unphased reference panels can provide highly accurate genotype imputation. We place our methodology in a unified framework that enables the simultaneous use of unphased and phased data from trios and unrelated individuals in a single analysis. For unrelated individuals, our imputation methods produce well-calibrated posterior genotype probabilities and highly accurate allele-frequency estimates. For trios, our haplotype-inference method is four orders of magnitude faster than the gold-standard PHASE program and has excellent accuracy. Our methods enable genotype imputation to be performed with unphased trio or unrelated reference panels, thus accounting for haplotype-phase uncertainty in the reference panel. We present a useful measure of imputation accuracy, allelic R2, and show that this measure can be estimated accurately from posterior genotype probabilities. Our methods are implemented in version 3.0 of the BEAGLE software package.  相似文献   

12.
Application of imputation methods to accurately predict a dense array of SNP genotypes in the dog could provide an important supplement to current analyses of array-based genotyping data. Here, we developed a reference panel of 4,885,283 SNPs in 83 dogs across 15 breeds using whole genome sequencing. We used this panel to predict the genotypes of 268 dogs across three breeds with 84,193 SNP array-derived genotypes as inputs. We then (1) performed breed clustering of the actual and imputed data; (2) evaluated several reference panel breed combinations to determine an optimal reference panel composition; and (3) compared the accuracy of two commonly used software algorithms (Beagle and IMPUTE2). Breed clustering was well preserved in the imputation process across eigenvalues representing 75 % of the variation in the imputed data. Using Beagle with a target panel from a single breed, genotype concordance was highest using a multi-breed reference panel (92.4 %) compared to a breed-specific reference panel (87.0 %) or a reference panel containing no breeds overlapping with the target panel (74.9 %). This finding was confirmed using target panels derived from two other breeds. Additionally, using the multi-breed reference panel, genotype concordance was slightly higher with IMPUTE2 (94.1 %) compared to Beagle; Pearson correlation coefficients were slightly higher for both software packages (0.946 for Beagle, 0.961 for IMPUTE2). Our findings demonstrate that genotype imputation from SNP array-derived data to whole genome-level genotypes is both feasible and accurate in the dog with appropriate breed overlap between the target and reference panels.  相似文献   

13.

Background

High density genotyping data are indispensable for genomic analyses of complex traits in animal and crop species. Maize is one of the most important crop plants worldwide, however a high density SNP genotyping array for analysis of its large and highly dynamic genome was not available so far.

Results

We developed a high density maize SNP array composed of 616,201 variants (SNPs and small indels). Initially, 57 M variants were discovered by sequencing 30 representative temperate maize lines and then stringently filtered for sequence quality scores and predicted conversion performance on the array resulting in the selection of 1.2 M polymorphic variants assayed on two screening arrays. To identify high-confidence variants, 285 DNA samples from a broad genetic diversity panel of worldwide maize lines including the samples used for sequencing, important founder lines for European maize breeding, hybrids, and proprietary samples with European, US, semi-tropical, and tropical origin were used for experimental validation. We selected 616 k variants according to their performance during validation, support of genotype calls through sequencing data, and physical distribution for further analysis and for the design of the commercially available Affymetrix® Axiom® Maize Genotyping Array. This array is composed of 609,442 SNPs and 6,759 indels. Among these are 116,224 variants in coding regions and 45,655 SNPs of the Illumina® MaizeSNP50 BeadChip for study comparison. In a subset of 45,974 variants, apart from the target SNP additional off-target variants are detected, which show only a minor bias towards intermediate allele frequencies. We performed principal coordinate and admixture analyses to determine the ability of the array to detect and resolve population structure and investigated the extent of LD within a worldwide validation panel.

Conclusions

The high density Affymetrix® Axiom® Maize Genotyping Array is optimized for European and American temperate maize and was developed based on a diverse sample panel by applying stringent quality filter criteria to ensure its suitability for a broad range of applications. With 600 k variants it is the largest currently publically available genotyping array in crop species.

Electronic supplementary material

The online version of this article (doi:10.1186/1471-2164-15-823) contains supplementary material, which is available to authorized users.  相似文献   

14.
Genotype imputations based on 1000 Genomes (1KG) Project data have the advantage of imputing many more SNPs than imputations based on HapMap data. It also provides an opportunity to discover associations with relatively rare variants. Recent investigations are increasingly using 1KG data for genotype imputations, but only limited evaluations of the performance of this approach are available. In this paper, we empirically evaluated imputation performance using 1KG data by comparing imputation results to those using the HapMap Phase II data that have been widely used. We used three reference panels: the CEU panel consisting of 120 haplotypes from HapMap II and 1KG data (June 2010 release) and the EUR panel consisting of 566 haplotypes also from 1KG data (August 2010 release). We used Illumina 324,607 autosomal SNPs genotyped in 501 individuals of European ancestry. Our most important finding was that both 1KG reference panels provided much higher imputation yield than the HapMap II panel. There were more than twice as many successfully imputed SNPs as there were using the HapMap II panel (6.7 million vs. 2.5 million). Our second most important finding was that accuracy using both 1KG panels was high and almost identical to accuracy using the HapMap II panel. Furthermore, after removing SNPs with MACH Rsq <0.3, accuracy for both rare and low frequency SNPs was very high and almost identical to accuracy for common SNPs. We found that imputation using the 1KG-EUR panel had advantages in successfully imputing rare, low frequency and common variants. Our findings suggest that 1KG-based imputation can increase the opportunity to discover significant associations for SNPs across the allele frequency spectrum. Because the 1KG Project is still underway, we expect that later versions will provide even better imputation performance.  相似文献   

15.
Genotype imputation has the potential to assess human genetic variation at a lower cost than assaying the variants using laboratory techniques. The performance of imputation for rare variants has not been comprehensively studied. We utilized 8865 human samples with high depth resequencing data for the exons and flanking regions of 202 genes and Genome-Wide Association Study (GWAS) data to characterize the performance of genotype imputation for rare variants. We evaluated reference sets ranging from 100 to 3713 subjects for imputing into samples typed for the Affymetrix (500K and 6.0) and Illumina 550K GWAS panels. The proportion of variants that could be well imputed (true r2>0.7) with a reference panel of 3713 individuals was: 31% (Illumina 550K) or 25% (Affymetrix 500K) with MAF (Minor Allele Frequency) less than or equal 0.001, 48% or 35% with 0.0010.05. The performance for common SNPs (MAF>0.05) within exons and flanking regions is comparable to imputation of more uniformly distributed SNPs. The performance for rare SNPs (0.01相似文献   

16.

Background

We explored the imputation performance of the program IMPUTE in an admixed sample from Mexico City. The following issues were evaluated: (a) the impact of different reference panels (HapMap vs. 1000 Genomes) on imputation; (b) potential differences in imputation performance between single-step vs. two-step (phasing and imputation) approaches; (c) the effect of different INFO score thresholds on imputation performance and (d) imputation performance in common vs. rare markers.

Methods

The sample from Mexico City comprised 1,310 individuals genotyped with the Affymetrix 5.0 array. We randomly masked 5% of the markers directly genotyped on chromosome 12 (n?=?1,046) and compared the imputed genotypes with the microarray genotype calls. Imputation was carried out with the program IMPUTE. The concordance rates between the imputed and observed genotypes were used as a measure of imputation accuracy and the proportion of non-missing genotypes as a measure of imputation efficacy.

Results

The single-step imputation approach produced slightly higher concordance rates than the two-step strategy (99.1% vs. 98.4% when using the HapMap phase II combined panel), but at the expense of a lower proportion of non-missing genotypes (85.5% vs. 90.1%). The 1,000 Genomes reference sample produced similar concordance rates to the HapMap phase II panel (98.4% for both datasets, using the two-step strategy). However, the 1000 Genomes reference sample increased substantially the proportion of non-missing genotypes (94.7% vs. 90.1%). Rare variants (<1%) had lower imputation accuracy and efficacy than common markers.

Conclusions

The program IMPUTE had an excellent imputation performance for common alleles in an admixed sample from Mexico City, which has primarily Native American (62%) and European (33%) contributions. Genotype concordances were higher than 98.4% using all the imputation strategies, in spite of the fact that no Native American samples are present in the HapMap and 1000 Genomes reference panels. The best balance of imputation accuracy and efficiency was obtained with the 1,000 Genomes panel. Rare variants were not captured effectively by any of the available panels, emphasizing the need to be cautious in the interpretation of association results for imputed rare variants.  相似文献   

17.
Genotype imputation, used in genome-wide association studies to expand coverage of single nucleotide polymorphisms (SNPs), has performed poorly in African Americans compared to less admixed populations. Overall, imputation has typically relied on HapMap reference haplotype panels from Africans (YRI), European Americans (CEU), and Asians (CHB/JPT). The 1000 Genomes project offers a wider range of reference populations, such as African Americans (ASW), but their imputation performance has had limited evaluation. Using 595 African Americans genotyped on Illumina’s HumanHap550v3 BeadChip, we compared imputation results from four software programs (IMPUTE2, BEAGLE, MaCH, and MaCH-Admix) and three reference panels consisting of different combinations of 1000 Genomes populations (February 2012 release): (1) 3 specifically selected populations (YRI, CEU, and ASW); (2) 8 populations of diverse African (AFR) or European (AFR) descent; and (3) all 14 available populations (ALL). Based on chromosome 22, we calculated three performance metrics: (1) concordance (percentage of masked genotyped SNPs with imputed and true genotype agreement); (2) imputation quality score (IQS; concordance adjusted for chance agreement, which is particularly informative for low minor allele frequency [MAF] SNPs); and (3) average r2hat (estimated correlation between the imputed and true genotypes, for all imputed SNPs). Across the reference panels, IMPUTE2 and MaCH had the highest concordance (91%–93%), but IMPUTE2 had the highest IQS (81%–83%) and average r2hat (0.68 using YRI+ASW+CEU, 0.62 using AFR+EUR, and 0.55 using ALL). Imputation quality for most programs was reduced by the addition of more distantly related reference populations, due entirely to the introduction of low frequency SNPs (MAF≤2%) that are monomorphic in the more closely related panels. While imputation was optimized by using IMPUTE2 with reference to the ALL panel (average r2hat = 0.86 for SNPs with MAF>2%), use of the ALL panel for African American studies requires careful interpretation of the population specificity and imputation quality of low frequency SNPs.  相似文献   

18.
Researchers have successfully applied exome sequencing to discover causal variants in selected individuals with familial, highly penetrant disorders. We demonstrate the utility of exome sequencing followed by imputation for discovering low-frequency variants associated with complex quantitative traits. We performed exome sequencing in a reference panel of 761 African Americans and then imputed newly discovered variants into a larger sample of more than 13,000 African Americans for association testing with the blood cell traits hemoglobin, hematocrit, white blood count, and platelet count. First, we illustrate the feasibility of our approach by demonstrating genome-wide-significant associations for variants that are not covered by conventional genotyping arrays; for example, one such association is that between higher platelet count and an MPL c.117G>T (p.Lys39Asn) variant encoding a p.Lys39Asn amino acid substitution of the thrombpoietin receptor gene (p = 1.5 × 10−11). Second, we identified an association between missense variants of LCT and higher white blood count (p = 4 × 10−13). Third, we identified low-frequency coding variants that might account for allelic heterogeneity at several known blood cell-associated loci: MPL c.754T>C (p.Tyr252His) was associated with higher platelet count; CD36 c.975T>G (p.Tyr325) was associated with lower platelet count; and several missense variants at the α-globin gene locus were associated with lower hemoglobin. By identifying low-frequency missense variants associated with blood cell traits not previously reported by genome-wide association studies, we establish that exome sequencing followed by imputation is a powerful approach to dissecting complex, genetically heterogeneous traits in large population-based studies.  相似文献   

19.
Genome-wide association studies are revolutionizing the search for the genes underlying human complex diseases. The main decisions to be made at the design stage of these studies are the choice of the commercial genotyping chip to be used and the numbers of case and control samples to be genotyped. The most common method of comparing different chips is using a measure of coverage, but this fails to properly account for the effects of sample size, the genetic model of the disease, and linkage disequilibrium between SNPs. In this paper, we argue that the statistical power to detect a causative variant should be the major criterion in study design. Because of the complicated pattern of linkage disequilibrium (LD) in the human genome, power cannot be calculated analytically and must instead be assessed by simulation. We describe in detail a method of simulating case-control samples at a set of linked SNPs that replicates the patterns of LD in human populations, and we used it to assess power for a comprehensive set of available genotyping chips. Our results allow us to compare the performance of the chips to detect variants with different effect sizes and allele frequencies, look at how power changes with sample size in different populations or when using multi-marker tags and genotype imputation approaches, and how performance compares to a hypothetical chip that contains every SNP in HapMap. A main conclusion of this study is that marked differences in genome coverage may not translate into appreciable differences in power and that, when taking budgetary considerations into account, the most powerful design may not always correspond to the chip with the highest coverage. We also show that genotype imputation can be used to boost the power of many chips up to the level obtained from a hypothetical “complete” chip containing all the SNPs in HapMap. Our results have been encapsulated into an R software package that allows users to design future association studies and our methods provide a framework with which new chip sets can be evaluated.  相似文献   

20.
We present full-genome genotype imputations for 100 classical laboratory mouse strains, using a novel method. Using genotypes at 549,683 SNP loci obtained with the Mouse Diversity Array, we partitioned the genome of 100 mouse strains into 40,647 intervals that exhibit no evidence of historical recombination. For each of these intervals we inferred a local phylogenetic tree. We combined these data with 12 million loci with sequence variations recently discovered by whole-genome sequencing in a common subset of 12 classical laboratory strains. For each phylogenetic tree we identified strains sharing a leaf node with one or more of the sequenced strains. We then imputed high- and medium-confidence genotypes for each of 88 nonsequenced genomes. Among inbred strains, we imputed 92% of SNPs genome-wide, with 71% in high-confidence regions. Our method produced 977 million new genotypes with an estimated per-SNP error rate of 0.083% in high-confidence regions and 0.37% genome-wide. Our analysis identified which of the 88 nonsequenced strains would be the most informative for improving full-genome imputation, as well as which additional strain sequences will reveal more new genetic variants. Imputed sequences and quality scores can be downloaded and visualized online.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号