Bayesian Quantitative Trait Locus Mapping Using Inferred Haplotypes |
| |
Authors: | Caroline Durrant Richard Mott |
| |
Affiliation: | Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, OX3 7BN, United Kingdom |
| |
Abstract: | We describe a fast hierarchical Bayesian method for mapping quantitative trait loci by haplotype-based association, applicable when haplotypes are not observed directly but are inferred from multiple marker genotypes. The method avoids the use of a Monte Carlo Markov chain by employing priors for which the likelihood factorizes completely. It is parameterized by a single hyperparameter, the fraction of variance explained by the quantitative trait locus, compared to the frequentist fixed-effects model, which requires a parameter for the phenotypic effect of each combination of haplotypes; nevertheless it still provides estimates of haplotype effects. We use simulation to show that the method matches the power of the frequentist regression model and, when the haplotypes are inferred, exceeds it for small QTL effect sizes. The Bayesian estimates of the haplotype effects are more accurate than the frequentist estimates, for both known and inferred haplotypes, which indicates that this advantage is independent of the effect of uncertainty in haplotype inference and will hold in comparison with frequentist methods in general. We apply the method to data from a panel of recombinant inbred lines of Arabidopsis thaliana, descended from 19 inbred founders.AS the power of haplotypic association has become better appreciated, studies using inferred multiallelic loci (i.e., haplotypes or pairs of haplotypes) are becoming more common. This is because single-nucleotide polymorphisms (SNPs), which are the most commonly used type of marker, are very susceptible to a loss of power to detect QTL, due to a mismatch in allele frequencies between the SNP and the causative variant. While multiallelic markers contain more information and have greater power than SNPs for QTL mapping, they are more costly and cumbersome. Consequently a major analytical advance has been the combination of multiple SNP marker information, either to infer haplotypes as in many human association studies or to infer the mosaic of ancestral founder haplotypes in synthetic populations descended from multiple founder strains. The latter scenario includes crosses between inbred strains of mice or rats or inbred accessions of plants.However, there are two potential difficulties with haplotypic association. First, in a fixed-effects framework, a parameter is estimated for each haplotype, which is undesirable when the number of haplotypes is large. In a synthetic population descended from N inbred strains, up to N haplotypes may segregate; for the mouse collaborative cross (Threadgill et al. 2002) N = 8 and for the Arabidopsis thaliana multiparent advanced generation intercross (MAGIC) population of recombinant inbred lines, N = 19 (Kover et al. 2009). For complex traits, where many QTL are expected to segregate, multiple QTL mapping only exacerbates problems with the numbers of parameters.Second, one must account for uncertainty in the inference of haplotypes, which depends on the marker density and how well one can distinguish between all founders at a locus. At some loci the founders'' haplotypes may be identical, for example, in crosses descended from inbred strains of mice.These problems are well known in haplotype association mapping involving human populations, where in general fixed-effects regression modeling is used. Consequently methods have been developed to reduce the number of haplotype groups at a marker locus, using hierarchical clustering and Bayesian partitioning algorithms (Molitor et al. 2003; Durrant et al. 2004; Bardel et al. 2006; Morris 2006; Tzeng et al. 2006; Waldron et al. 2006; Igo et al. 2007; Liu et al. 2007; Tachmazidou et al. 2007; Knight et al. 2008).Bayesian methods are increasingly the approach of choice for QTL mapping, particularly for multiple QTL mapping and the modeling of interactions (Yi and Shriner 2008). The hierarchical Bayesian framework can accommodate more complicated models with more parameters, even when there are many more parameters than observations (Meuwissen et al. 2001; Xu 2003). The Bayesian approach has an additional advantage when the inferred haplotypes are not all identifiable. Reliable estimates of haplotype effects can be determined because the shrinkage effect of the prior distribution restricts the posterior. However, these methods must be fast if these complex analyses are to be practical.In a hierarchical model the key problem is how to model the distribution of the variance attributable to a QTL and its prior. Meuwissen et al. (2001) consider a hierarchical Bayesian random-effects model (HBREM) for observed multiallelic marker loci. They choose normal priors centered at zero for the individual genotype effects, with different variances for each locus. The prior distributions for the variance parameters are scaled inverse chi square, with parameters chosen to give the mean and variance preestimated from the data. However, this prior has a tiny probability of a QTL effect being equal to zero, whereas that is clearly very likely in a genome scan. Hence they also showed an alternative prior, a mixture of a point mass at zero and a scaled inverse chi-square distribution, which gave better results.Xu (2003) considers a noninformative Jeffrey''s prior on the locus variance. The model fits all markers simultaneously and can detect large-effect QTL with little noise at other markers, despite the negligible probability of zero locus variance. However, the model is limited to markers with two or three possible genotypes. Wang et al. (2005) extend this approach to inferred genotypes, but still with only two or three possible genotypes per locus and the method is very computationally intensive.Yi and Xu (2008) argue that the noninformative Jeffrey''s prior on the locus variance induces constant shrinkage on the haplotype effects and that it would be preferable to vary shrinkage according to the data. They compare exponential and scaled inverse chi-square priors on the locus variance, using hyperparameters with vague hyperpriors. They also consider a second prior on the haplotype effects (first proposed by Park and Casella 2008), of a normal distribution with variance proportional to the residual error variance. The four models performed equally when tested on populations with only two genotypes segregating at a locus.There are several frequentist approaches to dealing with haplotype uncertainty in QTL mapping. One is to perform a fixed-effects multiple linear regression or generalized linear regression of the phenotype, treating the haplotype probabilities at the locus as the design matrix (Haley and Knott 1992; Mott et al. 2000). Another is to use multiple imputation to draw samples of haplotypes from the haplotype probabilities (Sen and Churchill 2001). A third is to use the EM algorithm to estimate the haplotypes (Excoffier and Slatkin 1995; Hawley and Kidd 1995; Long et al. 1995; Qin et al. 2002; Lin et al. 2005, 2008; Lin and Zeng 2006; Zeng et al. 2006). An alternative is data expansion, where instead of multiple imputation, the data set is expanded by drawing 10–20 replicate haplotype pairs for every individual from their inferred probability distribution, assigning the same value of the response variable to each, and analyzing the expanded data set. However, this may alter the characteristics of the data, such as the haplotype frequencies.In a Bayesian setting, haplotype uncertainty can be accommodated either by including the predictor variables as unknowns in the updating procedure or by multiple imputation. In a fully Bayesian treatment, the unknown haplotype pair assignments are assigned priors and estimated along with the model parameters. However, Markov chain Monte Carlo (MCMC) is then needed to fit the model, updating the parameters on the basis of the haplotype pairs and then updating the haplotype pairs on the basis of the parameters. Updating the haplotype assignments by MCMC is slow and suffers from the label-switching problem among others (Jasra et al. 2005), so an alternative approach would be preferable.In this article, we present a new HBREM for QTL mapping applicable to observed or inferred haplotypes. It does not require costly MCMC techniques, since the joint posterior distribution factorizes. It parameterizes the variance terms in the model, focusing on the proportion of the variance due to the QTL. We compare its performance with that of the frequentist fixed-effects model for both observed and inferred multiallelic loci. We show first that the posterior mode of the proportion of variance due to a locus is a better outcome measure than two standard Bayesian test statistics and second that the Bayesian estimates of the individual haplotype effects are much more accurate than the corresponding frequentist estimates. Finally we analyze real data from A. thaliana recombinant inbred lines descended from 19 parental lines. |
| |
Keywords: | |
|
|