A Principled Approach to Deriving Approximate Conditional Sampling Distributions in Population Genetics Models with Recombination期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

A Principled Approach to Deriving Approximate Conditional Sampling Distributions in Population Genetics Models with Recombination

Authors:	Joshua S Paul Yun S Song

Institution:	^*Computer Science Division and ^†Department of Statistics, University of California, Berkeley, California 94720

Abstract:	The multilocus conditional sampling distribution (CSD) describes the probability that an additionally sampled DNA sequence is of a certain type, given that a collection of sequences has already been observed. The CSD has a wide range of applications in both computational biology and population genomics analysis, including phasing genotype data into haplotype data, imputing missing data, estimating recombination rates, inferring local ancestry in admixed populations, and importance sampling of coalescent genealogies. Unfortunately, the true CSD under the coalescent with recombination is not known, so approximations, formulated as hidden Markov models, have been proposed in the past. These approximations have led to a number of useful statistical tools, but it is important to recognize that they were not derived from, though were certainly motivated by, principles underlying the coalescent process. The goal of this article is to develop a principled approach to derive improved CSDs directly from the underlying population genetics model. Our approach is based on the diffusion process approximation and the resulting mathematical expressions admit intuitive genealogical interpretations, which we utilize to introduce further approximations and make our method scalable in the number of loci. The general algorithm presented here applies to an arbitrary number of loci and an arbitrary finite-alleles recurrent mutation model. Empirical results are provided to demonstrate that our new CSDs are in general substantially more accurate than previously proposed approximations.THE probability of observing a sample of DNA sequences under a given population genetics model—which is referred to as the sampling probability or likelihood—plays an important role in a wide range of problems in a genetic variation study. When recombination is involved, however, obtaining an analytic formula for the sampling probability has hitherto remained a challenging open problem (see Jenkins and Song 2009, 2010 for recent progress on this problem). As such, much research (Griffiths and Marjoram 1996; Kuhner et al. 2000; Nielsen 2000; Stephens and Donnelly 2000; Fearnhead and Donnelly 2001; De Iorio and Griffiths 2004a,b; Fearnhead and Smith 2005; Griffiths et al. 2008; Wang and Rannala 2008) has focused on developing Monte Carlo methods on the basis of the coalescent with recombination (Griffiths 1981; Kingman 1982a,b; Hudson 1983), a well-established mathematical framework that models the genealogical history of sample chromosomes. These Monte Carlo-based full-likelihood methods mark an important development in population genetics analysis, but a well-known obstacle to their utility is that they tend to be computationally intensive. For a whole-genome variation study, approximations are often unavoidable, and it is therefore important to think of ways to minimize the trade-off between scalability and accuracy.A popular likelihood-based approximation method that has had a significant impact on population genetics analysis is the following approach introduced by Li and Stephens (2003): Given a set Φ of model parameters (e.g., mutation rate, recombination rate, etc.), the joint probability p(h₁, … , h_n \| Φ) of observing a set {h₁, … , h_n} of haplotypes sampled from a population can be decomposed as a product of conditional sampling distributions (CSDs), denoted by π,(1)where π(h_k+1\|h₁, …, h_k, Φ) is the probability of an additionally sampled haplotype being of type h_k+1, given a set of already observed haplotypes h₁, …, h_k. In the presence of recombination, the true CSD π is unknown, so Li and Stephens proposed using an approximate CSD in place of π, thus obtaining the following approximation of the joint probability:(2)Li and Stephens referred to this approximation as the product of approximate conditionals (PAC) model. In general, the closer is to the true CSD π, the more accurate the PAC model becomes. Notable applications and extensions of this framework include estimating crossover rates (Li and Stephens 2003; Crawford et al. 2004) and gene conversion parameters (Gay et al. 2007; Yin et al. 2009), phasing genotype data into haplotype data (Stephens and Scheet 2005; Scheet and Stephens 2006), imputing missing data to improve power in association mapping (Stephens and Scheet 2005; Li and Abecasis 2006; Marchini et al. 2007; Howie et al. 2009), inferring local ancestry in admixed populations (Price et al. 2009), inferring human colonization history (Hellenthal et al. 2008), inferring demography (Davison et al. 2009), and so on.Another problem in which the CSD plays a fundamental role is importance sampling of genealogies under the coalescent process (Stephens and Donnelly 2000; Fearnhead and Donnelly 2001; De Iorio and Griffiths 2004a,b; Fearnhead and Smith 2005; Griffiths et al. 2008). In this context, the optimal proposal distribution can be written in terms of the CSD π (Stephens and Donnelly 2000), and as in the PAC model, an approximate CSD may be used in place of π. The performance of an importance sampling scheme depends critically on the proposal distribution and therefore on the accuracy of the approximation . Often in conjunction with composite-likelihood frameworks (Hudson 2001; Fearnhead and Donnelly 2002), importance sampling has been used in estimating fine-scale recombination rates (McVean et al. 2004; Fearnhead and Smith 2005; Johnson and Slatkin 2009).So far, a significant scope of intuition has gone into choosing the approximate CSDs used in these problems (Marjoram and Tavaré 2006). In the case of completely linked loci, Stephens and Donnelly (2000) suggested constructing an approximation by assuming that the additional haplotype h_k+1 is an imperfect copy of one of the first k haplotypes, with copying errors corresponding to mutation. Fearnhead and Donnelly (2001) generalized this construction to include crossover recombination, assuming that the haplotype h_k+1 is an imperfect mosaic of the first k haplotypes (i.e., h_k+1 is obtained by copying segments from h₁, …, h_k, where crossover recombination can change the haplotype from which copying is performed). The associated CSD, which we denote by , can be interpreted as a hidden Markov model and so admits an efficient dynamic programming solution. Finally, Li and Stephens (2003) proposed a modification to Fearnhead and Donnelly''s model that limits the hidden state space, thereby providing a computational simplification; we denote the corresponding approximate CSD by .Although these approaches are computationally appealing, it is important to note that they are not derived from, though are certainly motivated by, principles underlying typical population genetics models, in particular the coalescent process (Griffiths 1981; Kingman 1982a,b; Hudson 1983). The main objective of this article is to develop a principled technique to derive an improved CSD directly from the underlying population genetics model. Rather than relying on intuition, we base our work on mathematical foundation. The theoretical framework we employ is the diffusion process. De Iorio and Griffiths (2004a,b) first introduced the diffusion-generator approximation technique to obtain an approximate CSD in the case of a single locus (i.e., no recombination). Griffiths et al. (2008) later extended the approach to two loci to include crossover recombination, assuming a parent-independent mutation model at each locus. In this article, we extend the framework to develop a general algorithm that applies to an arbitrary number of loci and an arbitrary finite-alleles recurrent mutation model.Our work can be summarized as follows. Using the diffusion-generator approximation technique, we derive a recursion relation satisfied by an approximate CSD. This recursion can be used to construct a closed system of coupled linear equations, in which the conditional sampling probability of interest appears as one of the unknown variables. The system of equations can be solved using standard numerical analysis techniques. However, the size of the system grows superexponentially with the number of loci and, consequently, so does the running time. To remedy this drawback, we introduce additional approximations to make our approach scalable in the number of loci. Specifically, the recursion admits an intuitive genealogical interpretation, and, on the basis of this interpretation, we propose modifications to the recursion, which then can be easily solved using dynamic programming. The computational complexity of the modified algorithm is polynomial in the number of loci, and, importantly, the resulting CSD has little loss of accuracy compared to that following from the full recursion.The accuracy of approximate CSDs has not been discussed much in the literature, except in the application-specific context for which they are being employed. In this article, we carry out an empirical study to explicitly test the accuracy of various CSDs and demonstrate that our new CSDs are in general substantially more accurate than previously proposed approximations. We also consider the PAC framework and show that our approximations also produce more accurate PAC-likelihood estimates. We note that for the maximum-likelihood estimation of recombination rates, the actual value of the likelihood may not be so important, as long as it is maximized near the true recombination rate. However, in many other applications—e.g., phasing genotype data into haplotype data, imputing missing data, importance sampling, and so on—the accuracy of the CSD and PAC-likelihood function over a wide range of parameter values may be important. Thus, we believe that the theoretical work presented here will have several practical implications; our method can be applied in a wide range of statistical tools that use CSDs, improving their accuracy.The remainder of this article is organized as follows. To provide intuition for the ensuing mathematics, we first describe a genealogical process that gives rise to our CSD. Using our genealogical interpretation, we consider two additional approximations and relate these to previously proposed CSDs. Then, in the following section, we derive our CSD using the diffusion-generator approach and provide mathematical statements for the additional approximations; some interesting limiting behavior is also described there. This section is self-contained and may be skipped by the reader uninterested in mathematical details. Finally, in the subsequent section, we carry out a simulation study to compare the accuracy of various approximate CSDs and demonstrate that ours are generally the most accurate.

Keywords:

设为首页 | 免责声明 | 关于勤云 | 加入收藏