期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Leveraging Prior Information to Detect Causal Variants via Multi-Variant Regression

Nanye Long Samuel P. Dickson Jessica M. Maia Hee Shin Kim Qianqian Zhu Andrew S. Allen 《PLoS computational biology》2013,9(6)

Although many methods are available to test sequence variants for association with complex diseases and traits, methods that specifically seek to identify causal variants are less developed. Here we develop and evaluate a Bayesian hierarchical regression method that incorporates prior information on the likelihood of variant causality through weighting of variant effects. By simulation studies using both simulated and real sequence variants, we compared a standard single variant test for analyzing variant-disease association with the proposed method using different weighting schemes. We found that by leveraging linkage disequilibrium of variants with known GWAS signals and sequence conservation (phastCons), the proposed method provides a powerful approach for detecting causal variants while controlling false positives. 相似文献

2.

Inferring causative variants in microRNA target sites

Thomas LF Saito T Sætrom P 《Nucleic acids research》2011,39(16):e109

相似文献

3.

Linkage disequilibrium clustering‐based approach for association mapping with tightly linked genomewide data

下载免费PDF全文

Zitong Li Petri Kemppainen Pasi Rastas Juha Merilä 《Molecular ecology resources》2018,18(4):809-824

Genomewide association studies (GWAS) aim to identify genetic markers strongly associated with quantitative traits by utilizing linkage disequilibrium (LD) between candidate genes and markers. However, because of LD between nearby genetic markers, the standard GWAS approaches typically detect a number of correlated SNPs covering long genomic regions, making corrections for multiple testing overly conservative. Additionally, the high dimensionality of modern GWAS data poses considerable challenges for GWAS procedures such as permutation tests, which are computationally intensive. We propose a cluster‐based GWAS approach that first divides the genome into many large nonoverlapping windows and uses linkage disequilibrium network analysis in combination with principal component (PC) analysis as dimensional reduction tools to summarize the SNP data to independent PCs within clusters of loci connected by high LD. We then introduce single‐ and multilocus models that can efficiently conduct the association tests on such high‐dimensional data. The methods can be adapted to different model structures and used to analyse samples collected from the wild or from biparental F₂ populations, which are commonly used in ecological genetics mapping studies. We demonstrate the performance of our approaches with two publicly available data sets from a plant (Arabidopsis thaliana) and a fish (Pungitius pungitius), as well as with simulated data. 相似文献

4.

基因水平的关联分析方法

罗旭红刘志芳董长征《遗传》2013,35(9):1065-1071

全基因组关联研究(Genome wide association study, GWAS)已经在国内外的医学遗传学研究中得到广泛应用, 但是GWAS数据中所蕴含的与多基因复杂性状疾病机制相关的丰富信息尚未得到深度挖掘。近年来, 研究者采用生物网络分析和生物通路分析等生物信息学和生物统计学手段分析GWAS数据, 并探索潜在的疾病机制。生物网络分析和生物通路分析主要是以基因为单位进行的, 因此必须在分析前将基因上全部或者部分单个单核苷酸多态性(Single nucleotide polymorphism, SNP)的遗传关联结果综合起来, 即基因水平的关联分析。基因水平的关联分析需要考虑单个SNP的遗传关联、基因上SNP数量和SNP之间的连锁不平衡结构等多种因素, 因此不仅在遗传学的概念上也在统计方法方面具有一定的复杂性和挑战性。文章对基因水平的关联分析的研究进展、原理和应用进行了综述。相似文献

5.

Identifying causal variants by fine mapping across multiple studies

Nathan LaPierre Kodi Taraszka Helen Huang Rosemary He Farhad Hormozdiari Eleazar Eskin 《PLoS genetics》2021,17(9)

Increasingly large Genome-Wide Association Studies (GWAS) have yielded numerous variants associated with many complex traits, motivating the development of “fine mapping” methods to identify which of the associated variants are causal. Additionally, GWAS of the same trait for different populations are increasingly available, raising the possibility of refining fine mapping results further by leveraging different linkage disequilibrium (LD) structures across studies. Here, we introduce multiple study causal variants identification in associated regions (MsCAVIAR), a method that extends the popular CAVIAR fine mapping framework to a multiple study setting using a random effects model. MsCAVIAR only requires summary statistics and LD as input, accounts for uncertainty in association statistics using a multivariate normal model, allows for multiple causal variants at a locus, and explicitly models the possibility of different SNP effect sizes in different populations. We demonstrate the efficacy of MsCAVIAR in both a simulation study and a trans-ethnic, trans-biobank fine mapping analysis of High Density Lipoprotein (HDL). 相似文献

6.

Performance of single nucleotide polymorphisms versus haplotypes for genome-wide association analysis in barley

Lorenz AJ Hamblin MT Jannink JL 《PloS one》2010,5(11):e14079

Genome-wide association studies (GWAS) may benefit from utilizing haplotype information for making marker-phenotype associations. Several rationales for grouping single nucleotide polymorphisms (SNPs) into haplotype blocks exist, but any advantage may depend on such factors as genetic architecture of traits, patterns of linkage disequilibrium in the study population, and marker density. The objective of this study was to explore the utility of haplotypes for GWAS in barley (Hordeum vulgare) to offer a first detailed look at this approach for identifying agronomically important genes in crops. To accomplish this, we used genotype and phenotype data from the Barley Coordinated Agricultural Project and constructed haplotypes using three different methods. Marker-trait associations were tested by the efficient mixed-model association algorithm (EMMA). When QTL were simulated using single SNPs dropped from the marker dataset, a simple sliding window performed as well or better than single SNPs or the more sophisticated methods of blocking SNPs into haplotypes. Moreover, the haplotype analyses performed better 1) when QTL were simulated as polymorphisms that arose subsequent to marker variants, and 2) in analysis of empirical heading date data. These results demonstrate that the information content of haplotypes is dependent on the particular mutational and recombinational history of the QTL and nearby markers. Analysis of the empirical data also confirmed our intuition that the distribution of QTL alleles in nature is often unlike the distribution of marker variants, and hence utilizing haplotype information could capture associations that would elude single SNPs. We recommend routine use of both single SNP and haplotype markers for GWAS to take advantage of the full information content of the genotype data. 相似文献

7.

Powerful SNP-Set Analysis for Case-Control Genome-wide Association Studies

Michael C. Wu Michael P. Epstein Stephen J. Chanock Xihong Lin 《American journal of human genetics》2010,86(6):929-942

GWAS have emerged as popular tools for identifying genetic variants that are associated with disease risk. Standard analysis of a case-control GWAS involves assessing the association between each individual genotyped SNP and disease risk. However, this approach suffers from limited reproducibility and difficulties in detecting multi-SNP and epistatic effects. As an alternative analytical strategy, we propose grouping SNPs together into SNP sets on the basis of proximity to genomic features such as genes or haplotype blocks, then testing the joint effect of each SNP set. Testing of each SNP set proceeds via the logistic kernel-machine-based test, which is based on a statistical framework that allows for flexible modeling of epistatic and nonlinear SNP effects. This flexibility and the ability to naturally adjust for covariate effects are important features of our test that make it appealing in comparison to individual SNP tests and existing multimarker tests. Using simulated data based on the International HapMap Project, we show that SNP-set testing can have improved power over standard individual-SNP analysis under a wide range of settings. In particular, we find that our approach has higher power than individual-SNP analysis when the median correlation between the disease-susceptibility variant and the genotyped SNPs is moderate to high. When the correlation is low, both individual-SNP analysis and the SNP-set analysis tend to have low power. We apply SNP-set analysis to analyze the Cancer Genetic Markers of Susceptibility (CGEMS) breast cancer GWAS discovery-phase data. 相似文献

8.

GCTA: a tool for genome-wide complex trait analysis 总被引：7，自引：0，他引：7

Yang J Lee SH Goddard ME Visscher PM 《American journal of human genetics》2011,(1):76-82

For most human complex diseases and traits, SNPs identified by genome-wide association studies (GWAS) explain only a small fraction of the heritability. Here we report a user-friendly software tool called genome-wide complex trait analysis (GCTA), which was developed based on a method we recently developed to address the "missing heritability" problem. GCTA estimates the variance explained by all the SNPs on a chromosome or on the whole genome for a complex trait rather than testing the association of any particular SNP to the trait. We introduce GCTA's five main functions: data management, estimation of the genetic relationships from SNPs, mixed linear model analysis of variance explained by the SNPs, estimation of the linkage disequilibrium structure, and GWAS simulation. We focus on the function of estimating the variance explained by all the SNPs on the X chromosome and testing the hypotheses of dosage compensation. The GCTA software is a versatile tool to estimate and partition complex trait variation with large GWAS data sets. 相似文献

9.

Polymorphic L1 retrotransposons are frequently in strong linkage disequilibrium with neighboring SNPs

Saneyuki Higashino Tomoyuki Ohno Koichi Ishiguro Yasunori Aizawa 《Gene》2014

相似文献

10.

Penalized Multimarker vs. Single-Marker Regression Methods for Genome-Wide Association Studies of Quantitative Traits

Hui Yi Patrick Breheny Netsanet Imam Yongmei Liu Ina Hoeschele 《Genetics》2015,199(1):205-222

The data from genome-wide association studies (GWAS) in humans are still predominantly analyzed using single-marker association methods. As an alternative to single-marker analysis (SMA), all or subsets of markers can be tested simultaneously. This approach requires a form of penalized regression (PR) as the number of SNPs is much larger than the sample size. Here we review PR methods in the context of GWAS, extend them to perform penalty parameter and SNP selection by false discovery rate (FDR) control, and assess their performance in comparison with SMA. PR methods were compared with SMA, using realistically simulated GWAS data with a continuous phenotype and real data. Based on these comparisons our analytic FDR criterion may currently be the best approach to SNP selection using PR for GWAS. We found that PR with FDR control provides substantially more power than SMA with genome-wide type-I error control but somewhat less power than SMA with Benjamini–Hochberg FDR control (SMA-BH). PR with FDR-based penalty parameter selection controlled the FDR somewhat conservatively while SMA-BH may not achieve FDR control in all situations. Differences among PR methods seem quite small when the focus is on SNP selection with FDR control. Incorporating linkage disequilibrium into the penalization by adapting penalties developed for covariates measured on graphs can improve power but also generate more false positives or wider regions for follow-up. We recommend the elastic net with a mixing weight for the Lasso penalty near 0.5 as the best method. 相似文献

11.

Contribution of an additive locus to genetic variance when inheritance is multi-factorial with implications on interpretation of GWAS

Daniel Gianola Frederic Hospital Etienne Verrier 《TAG. Theoretical and applied genetics. Theoretische und angewandte Genetik》2013,126(6):1457-1472

Although the effects of linkage disequilibrium (LD) on partition of genetic variance have received attention in quantitative genetics, there has been little discussion on how this phenomenon affects attribution of variance to a given locus. This paper reinforces the point that standard metrics used for assessing the contribution of a locus to variance can be misleading when there is linkage LD and that factors such as distribution of effects and of allelic frequencies over loci, or existence of frequency-dependent effects, play a role as well. An apparently new metric is proposed for measuring how much of the variability is contributed by a locus when LD exists. Effects of intervening factors, such as type and extent of LD, number of loci, distribution of effects, and of allelic frequencies over loci, as well as a model for generating frequency-dependent effects, are illustrated via hypothetical simulation scenarios. Implications on the interpretation of genome-wide association studies (GWAS), as typically carried out in human genetics, where single marker regression and the assumption of a sole quantitative trait locus (QTL) are common, are discussed. It is concluded that the standard attributions to variance contributed by a single QTL from a GWAS analysis may be misleading, conceptually and statistically, when a trait is complex and affected by sets of many genes in linkage disequilibrium. Yet another factor to consider in the “missing heritability” saga?. 相似文献

12.

Genetics of sputum gene expression in chronic obstructive pulmonary disease

Qiu W Cho MH Riley JH Anderson WH Singh D Bakke P Gulsvik A Litonjua AA Lomas DA Crapo JD Beaty TH Celli BR Rennard S Tal-Singer R Fox SM Silverman EK Hersh CP;ECLIPSE Investigators 《PloS one》2011,6(9):e24395

Previous expression quantitative trait loci (eQTL) studies have performed genetic association studies for gene expression, but most of these studies examined lymphoblastoid cell lines from non-diseased individuals. We examined the genetics of gene expression in a relevant disease tissue from chronic obstructive pulmonary disease (COPD) patients to identify functional effects of known susceptibility genes and to find novel disease genes. By combining gene expression profiling on induced sputum samples from 131 COPD cases from the ECLIPSE Study with genomewide single nucleotide polymorphism (SNP) data, we found 4315 significant cis-eQTL SNP-probe set associations (3309 unique SNPs). The 3309 SNPs were tested for association with COPD in a genomewide association study (GWAS) dataset, which included 2940 COPD cases and 1380 controls. Adjusting for 3309 tests (p<1.5e-5), the two SNPs which were significantly associated with COPD were located in two separate genes in a known COPD locus on chromosome 15: CHRNA5 and IREB2. Detailed analysis of chromosome 15 demonstrated additional eQTLs for IREB2 mapping to that gene. eQTL SNPs for CHRNA5 mapped to multiple linkage disequilibrium (LD) bins. The eQTLs for IREB2 and CHRNA5 were not in LD. Seventy-four additional eQTL SNPs were associated with COPD at p<0.01. These were genotyped in two COPD populations, finding replicated associations with a SNP in PSORS1C1, in the HLA-C region on chromosome 6. Integrative analysis of GWAS and gene expression data from relevant tissue from diseased subjects has located potential functional variants in two known COPD genes and has identified a novel COPD susceptibility locus. 相似文献

13.

Association Study of Common Genetic Variants and HIV-1 Acquisition in 6,300 Infected Cases and 7,200 Controls

《PLoS pathogens》2013,9(7)

Multiple genome-wide association studies (GWAS) have been performed in HIV-1 infected individuals, identifying common genetic influences on viral control and disease course. Similarly, common genetic correlates of acquisition of HIV-1 after exposure have been interrogated using GWAS, although in generally small samples. Under the auspices of the International Collaboration for the Genomics of HIV, we have combined the genome-wide single nucleotide polymorphism (SNP) data collected by 25 cohorts, studies, or institutions on HIV-1 infected individuals and compared them to carefully matched population-level data sets (a list of all collaborators appears in Note S1 in Text S1). After imputation using the 1,000 Genomes Project reference panel, we tested approximately 8 million common DNA variants (SNPs and indels) for association with HIV-1 acquisition in 6,334 infected patients and 7,247 population samples of European ancestry. Initial association testing identified the SNP rs4418214, the C allele of which is known to tag the HLA-B*57:01 and B*27:05 alleles, as genome-wide significant (p = 3.6×10⁻¹¹). However, restricting analysis to individuals with a known date of seroconversion suggested that this association was due to the frailty bias in studies of lethal diseases. Further analyses including testing recessive genetic models, testing for bulk effects of non-genome-wide significant variants, stratifying by sexual or parenteral transmission risk and testing previously reported associations showed no evidence for genetic influence on HIV-1 acquisition (with the exception of CCR5Δ32 homozygosity). Thus, these data suggest that genetic influences on HIV acquisition are either rare or have smaller effects than can be detected by this sample size. 相似文献

14.

Genome-Wide Association Studies Using Haplotypes and Individual SNPs in Simmental Cattle

Yang Wu Huizhong Fan Yanhui Wang Lupei Zhang Xue Gao Yan Chen Junya Li HongYan Ren Huijiang Gao 《PloS one》2014,9(10)

Recent advances in high-throughput genotyping technologies have provided the opportunity to map genes using associations between complex traits and markers. Genome-wide association studies (GWAS) based on either a single marker or haplotype have identified genetic variants and underlying genetic mechanisms of quantitative traits. Prompted by the achievements of studies examining economic traits in cattle and to verify the consistency of these two methods using real data, the current study was conducted to construct the haplotype structure in the bovine genome and to detect relevant genes genuinely affecting a carcass trait and a meat quality trait. Using the Illumina BovineHD BeadChip, 942 young bulls with genotyping data were introduced as a reference population to identify the genes in the beef cattle genome significantly associated with foreshank weight and triglyceride levels. In total, 92,553 haplotype blocks were detected in the genome. The regions of high linkage disequilibrium extended up to approximately 200 kb, and the size of haplotype blocks ranged from 22 bp to 199,266 bp. Additionally, the individual SNP analysis and the haplotype-based analysis detected similar regions and common SNPs for these two representative traits. A total of 12 and 7 SNPs in the bovine genome were significantly associated with foreshank weight and triglyceride levels, respectively. By comparison, 4 and 5 haplotype blocks containing the majority of significant SNPs were strongly associated with foreshank weight and triglyceride levels, respectively. In addition, 36 SNPs with high linkage disequilibrium were detected in the GNAQ gene, a potential hotspot that may play a crucial role for regulating carcass trait components. 相似文献

15.

SimPed: a simulation program to generate haplotype and genotype data for pedigree structures

Leal SM Yan K Müller-Myhsok B 《Human heredity》2005,60(2):119-122

With the widespread availability of SNP genotype data, there is great interest in analyzing pedigree haplotype data. Intermarker linkage disequilibrium for microsatellite markers is usually low due to their physical distance; however, for dense maps of SNP markers, there can be strong linkage disequilibrium between marker loci. Linkage analysis (parametric and nonparametric) and family-based association studies are currently being carried out using dense maps of SNP marker loci. Monte Carlo methods are often used for both linkage and association studies; however, to date there are no programs available which can generate haplotype and/or genotype data consisting of a large number of loci for pedigree structures. SimPed is a program that quickly generates haplotype and/or genotype data for pedigrees of virtually any size and complexity. Marker data either in linkage disequilibrium or equilibrium can be generated for greater than 20,000 diallelic or multiallelic marker loci. Haplotypes and/or genotypes are generated for pedigree structures using specified genetic map distances and haplotype and/or allele frequencies. The simulated data generated by SimPed is useful for a variety of purposes, including evaluating methods that estimate haplotype frequencies for pedigree data, evaluating type I error due to intermarker linkage disequilibrium and estimating empirical p values for linkage and family-based association studies. 相似文献

16.

Powerful testing via hierarchical linkage disequilibrium in haplotype association studies

Brunilda Balliu Jeanine J. Houwing‐Duistermaat Stefan Bhringer 《Biometrical journal. Biometrische Zeitschrift》2019,61(3):747-768

Marginal tests based on individual SNPs are routinely used in genetic association studies. Studies have shown that haplotype‐based methods may provide more power in disease mapping than methods based on single markers when, for example, multiple disease‐susceptibility variants occur within the same gene. A limitation of haplotype‐based methods is that the number of parameters increases exponentially with the number of SNPs, inducing a commensurate increase in the degrees of freedom and weakening the power to detect associations. To address this limitation, we introduce a hierarchical linkage disequilibrium model for disease mapping, based on a reparametrization of the multinomial haplotype distribution, where every parameter corresponds to the cumulant of each possible subset of a set of loci. This hierarchy present in the parameters enables us to employ flexible testing strategies over a range of parameter sets: from standard single SNP analyses through the full haplotype distribution tests, reducing degrees of freedom and increasing the power to detect associations. We show via extensive simulations that our approach maintains the type I error at nominal level and has increased power under many realistic scenarios, as compared to single SNP and standard haplotype‐based studies. To evaluate the performance of our proposed methodology in real data, we analyze genome‐wide data from the Wellcome Trust Case‐Control Consortium. 相似文献

17.

Detecting low frequent loss-of-function alleles in genome wide association studies with red hair color as example

Liu F Struchalin MV Duijn Kv Hofman A Uitterlinden AG Duijn Cv Aulchenko YS Kayser M 《PloS one》2011,6(11):e28145

Multiple loss-of-function (LOF) alleles at the same gene may influence a phenotype not only in the homozygote state when alleles are considered individually, but also in the compound heterozygote (CH) state. Such LOF alleles typically have low frequencies and moderate to large effects. Detecting such variants is of interest to the genetics community, and relevant statistical methods for detecting and quantifying their effects are sorely needed. We present a collapsed double heterozygosity (CDH) test to detect the presence of multiple LOF alleles at a gene. When causal SNPs are available, which may be the case in next generation genome sequencing studies, this CDH test has overwhelmingly higher power than single SNP analysis. When causal SNPs are not directly available such as in current GWA settings, we show the CDH test has higher power than standard single SNP analysis if tagging SNPs are in linkage disequilibrium with the underlying causal SNPs to at least a moderate degree (r²>0.1). The test is implemented for genome-wide analysis in the publically available software package GenABEL which is based on a sliding window approach. We provide the proof of principle by conducting a genome-wide CDH analysis of red hair color, a trait known to be influenced by multiple loss-of-function alleles, in a total of 7,732 Dutch individuals with hair color ascertained. The association signals at the MC1R gene locus from CDH were uniformly more significant than traditional GWA analyses (the most significant P for CDH = 3.11×10⁻¹⁴² vs. P for rs258322 = 1.33×10⁻⁶⁶). The CDH test will contribute towards finding rare LOF variants in GWAS and sequencing studies. 相似文献

18.

Efficiency of genome-wide association studies in random cross populations

José Marcelo Soriano Viana Gabriel Borges Mundim Hélcio Duarte Pereira Andréa Carla Bastos Andrade Fabyano Fonseca e Silva 《Molecular breeding : new strategies in plant improvement》2017,37(8):102

Genome-wide association studies (GWAS) with plant species have employed inbred lines panels. We evaluated the efficiency of GWAS in non-inbred and inbred populations and assessed factors affecting GWAS. Fifty samples of 800 individuals from populations with linkage disequilibrium were simulated. Individuals were genotyped for 10,000 single nucleotide polymorphisms (SNPs) and phenotyped for traits controlled by ten quantitative trait loci (QTLs) and 90 minor genes, assuming different degrees of dominance and broad sense heritabilities of 40 and 80%. The average SNP density was 0.1 centiMorgan (cM) and the QTL heritabilities ranged from 3.2 to 11.8%. The results for random cross populations evidenced that to increase the QTL detection power, the additive-dominance model must be fitted for traits controlled by dominance effects but must not be fitted for traits showing no dominance. The power of detection was maximized by increasing the sample size to 400 and the false discovery rate (FDR) to 5%. The average power of detection for the low, intermediate, and high heritability QTLs achieved 52.4, 87.0, and 100.0%, respectively. Assuming sample sizes of 400 and 800, the observed FDR was equal to or lower than the specified level of significance. The association mapping was highly precise, since at least 97% of the declared QTLs were detected by the SNP inside it (average bias of 0.4 cM). Besides controlling the FDR, relatedness (and identity by state) efficiently controls the number of significant associations outside the QTL interval (not all false positive associations). The analysis of the inbred random cross population provided essentially the same results as the non-inbred populations. 相似文献

19.

Enhanced statistical tests for GWAS in admixed populations: assessment using African Americans from CARe and a Breast Cancer Consortium

Pasaniuc B Zaitlen N Lettre G Chen GK Tandon A Kao WH Ruczinski I Fornage M Siscovick DS Zhu X Larkin E Lange LA Cupples LA Yang Q Akylbekova EL Musani SK Divers J Mychaleckyj J Li M Papanicolaou GJ Millikan RC Ambrosone CB John EM Bernstein L Zheng W Hu JJ Ziegler RG Nyante SJ Bandera EV Ingles SA Press MF Chanock SJ Deming SL Rodriguez-Gil JL Palmer CD Buxbaum S Ekunwe L Hirschhorn JN Henderson BE Myers S Haiman CA Reich D Patterson N Wilson JG Price AL 《PLoS genetics》2011,7(4):e1001371

While genome-wide association studies (GWAS) have primarily examined populations of European ancestry, more recent studies often involve additional populations, including admixed populations such as African Americans and Latinos. In admixed populations, linkage disequilibrium (LD) exists both at a fine scale in ancestral populations and at a coarse scale (admixture-LD) due to chromosomal segments of distinct ancestry. Disease association statistics in admixed populations have previously considered SNP association (LD mapping) or admixture association (mapping by admixture-LD), but not both. Here, we introduce a new statistical framework for combining SNP and admixture association in case-control studies, as well as methods for local ancestry-aware imputation. We illustrate the gain in statistical power achieved by these methods by analyzing data of 6,209 unrelated African Americans from the CARe project genotyped on the Affymetrix 6.0 chip, in conjunction with both simulated and real phenotypes, as well as by analyzing the FGFR2 locus using breast cancer GWAS data from 5,761 African-American women. We show that, at typed SNPs, our method yields an 8% increase in statistical power for finding disease risk loci compared to the power achieved by standard methods in case-control studies. At imputed SNPs, we observe an 11% increase in statistical power for mapping disease loci when our local ancestry-aware imputation framework and the new scoring statistic are jointly employed. Finally, we show that our method increases statistical power in regions harboring the causal SNP in the case when the causal SNP is untyped and cannot be imputed. Our methods and our publicly available software are broadly applicable to GWAS in admixed populations. 相似文献

20.

Cross-trait prediction accuracy of summary statistics in genome-wide association studies

Bingxin Zhao Fei Zou Hongtu Zhu 《Biometrics》2023,79(2):841-853

In the era of big data, univariate models have widely been used as a workhorse tool for quickly producing marginal estimators; and this is true even when in a high-dimensional dense setting, in which many features are “true,” but weak signals. Genome-wide association studies (GWAS) epitomize this type of setting. Although the GWAS marginal estimator is popular, it has long been criticized for ignoring the correlation structure of genetic variants (i.e., the linkage disequilibrium [LD] pattern). In this paper, we study the effects of LD pattern on the GWAS marginal estimator and investigate whether or not additionally accounting for the LD can improve the prediction accuracy of complex traits. We consider a general high-dimensional dense setting for GWAS and study a class of ridge-type estimators, including the popular marginal estimator and the best linear unbiased prediction (BLUP) estimator as two special cases. We show that the performance of GWAS marginal estimator depends on the LD pattern through the first three moments of its eigenvalue distribution. Furthermore, we uncover that the relative performance of GWAS marginal and BLUP estimators highly depends on the ratio of GWAS sample size over the number of genetic variants. Particularly, our finding reveals that the marginal estimator can easily become near-optimal within this class when the sample size is relatively small, even though it ignores the LD pattern. On the other hand, BLUP estimator has substantially better performance than the marginal estimator as the sample size increases toward the number of genetic variants, which is typically in millions. Therefore, adjusting for the LD (such as in the BLUP) is most needed when GWAS sample size is large. We illustrate the importance of our results by using the simulated data and real GWAS. 相似文献