共查询到20条相似文献,搜索用时 0 毫秒
1.
The multilocus conditional sampling distribution (CSD) describes the probability that an additionally sampled DNA sequence is of a certain type, given that a collection of sequences has already been observed. The CSD has a wide range of applications in both computational biology and population genomics analysis, including phasing genotype data into haplotype data, imputing missing data, estimating recombination rates, inferring local ancestry in admixed populations, and importance sampling of coalescent genealogies. Unfortunately, the true CSD under the coalescent with recombination is not known, so approximations, formulated as hidden Markov models, have been proposed in the past. These approximations have led to a number of useful statistical tools, but it is important to recognize that they were not derived from, though were certainly motivated by, principles underlying the coalescent process. The goal of this article is to develop a principled approach to derive improved CSDs directly from the underlying population genetics model. Our approach is based on the diffusion process approximation and the resulting mathematical expressions admit intuitive genealogical interpretations, which we utilize to introduce further approximations and make our method scalable in the number of loci. The general algorithm presented here applies to an arbitrary number of loci and an arbitrary finite-alleles recurrent mutation model. Empirical results are provided to demonstrate that our new CSDs are in general substantially more accurate than previously proposed approximations.THE probability of observing a sample of DNA sequences under a given population genetics model—which is referred to as the sampling probability or likelihood—plays an important role in a wide range of problems in a genetic variation study. When recombination is involved, however, obtaining an analytic formula for the sampling probability has hitherto remained a challenging open problem (see Jenkins and Song 2009, 2010 for recent progress on this problem). As such, much research (Griffiths and Marjoram 1996; Kuhner et al. 2000; Nielsen 2000; Stephens and Donnelly 2000; Fearnhead and Donnelly 2001; De Iorio and Griffiths 2004a,b; Fearnhead and Smith 2005; Griffiths et al. 2008; Wang and Rannala 2008) has focused on developing Monte Carlo methods on the basis of the coalescent with recombination (Griffiths 1981; Kingman 1982a,b; Hudson 1983), a well-established mathematical framework that models the genealogical history of sample chromosomes. These Monte Carlo-based full-likelihood methods mark an important development in population genetics analysis, but a well-known obstacle to their utility is that they tend to be computationally intensive. For a whole-genome variation study, approximations are often unavoidable, and it is therefore important to think of ways to minimize the trade-off between scalability and accuracy.A popular likelihood-based approximation method that has had a significant impact on population genetics analysis is the following approach introduced by Li and Stephens (2003): Given a set Φ of model parameters (e.g., mutation rate, recombination rate, etc.), the joint probability p(h1, … , hn | Φ) of observing a set {h1, … , hn} of haplotypes sampled from a population can be decomposed as a product of conditional sampling distributions (CSDs), denoted by π,(1)where π(hk+1|h1, …, hk, Φ) is the probability of an additionally sampled haplotype being of type hk+1, given a set of already observed haplotypes h1, …, hk. In the presence of recombination, the true CSD π is unknown, so Li and Stephens proposed using an approximate CSD in place of π, thus obtaining the following approximation of the joint probability:(2)Li and Stephens referred to this approximation as the product of approximate conditionals (PAC) model. In general, the closer is to the true CSD π, the more accurate the PAC model becomes. Notable applications and extensions of this framework include estimating crossover rates (Li and Stephens 2003; Crawford et al. 2004) and gene conversion parameters (Gay et al. 2007; Yin et al. 2009), phasing genotype data into haplotype data (Stephens and Scheet 2005; Scheet and Stephens 2006), imputing missing data to improve power in association mapping (Stephens and Scheet 2005; Li and Abecasis 2006; Marchini et al. 2007; Howie et al. 2009), inferring local ancestry in admixed populations (Price et al. 2009), inferring human colonization history (Hellenthal et al. 2008), inferring demography (Davison et al. 2009), and so on.Another problem in which the CSD plays a fundamental role is importance sampling of genealogies under the coalescent process (Stephens and Donnelly 2000; Fearnhead and Donnelly 2001; De Iorio and Griffiths 2004a,b; Fearnhead and Smith 2005; Griffiths et al. 2008). In this context, the optimal proposal distribution can be written in terms of the CSD π (Stephens and Donnelly 2000), and as in the PAC model, an approximate CSD may be used in place of π. The performance of an importance sampling scheme depends critically on the proposal distribution and therefore on the accuracy of the approximation . Often in conjunction with composite-likelihood frameworks (Hudson 2001; Fearnhead and Donnelly 2002), importance sampling has been used in estimating fine-scale recombination rates (McVean et al. 2004; Fearnhead and Smith 2005; Johnson and Slatkin 2009).So far, a significant scope of intuition has gone into choosing the approximate CSDs used in these problems (Marjoram and Tavaré 2006). In the case of completely linked loci, Stephens and Donnelly (2000) suggested constructing an approximation by assuming that the additional haplotype hk+1 is an imperfect copy of one of the first k haplotypes, with copying errors corresponding to mutation. Fearnhead and Donnelly (2001) generalized this construction to include crossover recombination, assuming that the haplotype hk+1 is an imperfect mosaic of the first k haplotypes (i.e., hk+1 is obtained by copying segments from h1, …, hk, where crossover recombination can change the haplotype from which copying is performed). The associated CSD, which we denote by , can be interpreted as a hidden Markov model and so admits an efficient dynamic programming solution. Finally, Li and Stephens (2003) proposed a modification to Fearnhead and Donnelly''s model that limits the hidden state space, thereby providing a computational simplification; we denote the corresponding approximate CSD by .Although these approaches are computationally appealing, it is important to note that they are not derived from, though are certainly motivated by, principles underlying typical population genetics models, in particular the coalescent process (Griffiths 1981; Kingman 1982a,b; Hudson 1983). The main objective of this article is to develop a principled technique to derive an improved CSD directly from the underlying population genetics model. Rather than relying on intuition, we base our work on mathematical foundation. The theoretical framework we employ is the diffusion process. De Iorio and Griffiths (2004a,b) first introduced the diffusion-generator approximation technique to obtain an approximate CSD in the case of a single locus (i.e., no recombination). Griffiths et al. (2008) later extended the approach to two loci to include crossover recombination, assuming a parent-independent mutation model at each locus. In this article, we extend the framework to develop a general algorithm that applies to an arbitrary number of loci and an arbitrary finite-alleles recurrent mutation model.Our work can be summarized as follows. Using the diffusion-generator approximation technique, we derive a recursion relation satisfied by an approximate CSD. This recursion can be used to construct a closed system of coupled linear equations, in which the conditional sampling probability of interest appears as one of the unknown variables. The system of equations can be solved using standard numerical analysis techniques. However, the size of the system grows superexponentially with the number of loci and, consequently, so does the running time. To remedy this drawback, we introduce additional approximations to make our approach scalable in the number of loci. Specifically, the recursion admits an intuitive genealogical interpretation, and, on the basis of this interpretation, we propose modifications to the recursion, which then can be easily solved using dynamic programming. The computational complexity of the modified algorithm is polynomial in the number of loci, and, importantly, the resulting CSD has little loss of accuracy compared to that following from the full recursion.The accuracy of approximate CSDs has not been discussed much in the literature, except in the application-specific context for which they are being employed. In this article, we carry out an empirical study to explicitly test the accuracy of various CSDs and demonstrate that our new CSDs are in general substantially more accurate than previously proposed approximations. We also consider the PAC framework and show that our approximations also produce more accurate PAC-likelihood estimates. We note that for the maximum-likelihood estimation of recombination rates, the actual value of the likelihood may not be so important, as long as it is maximized near the true recombination rate. However, in many other applications—e.g., phasing genotype data into haplotype data, imputing missing data, importance sampling, and so on—the accuracy of the CSD and PAC-likelihood function over a wide range of parameter values may be important. Thus, we believe that the theoretical work presented here will have several practical implications; our method can be applied in a wide range of statistical tools that use CSDs, improving their accuracy.The remainder of this article is organized as follows. To provide intuition for the ensuing mathematics, we first describe a genealogical process that gives rise to our CSD. Using our genealogical interpretation, we consider two additional approximations and relate these to previously proposed CSDs. Then, in the following section, we derive our CSD using the diffusion-generator approach and provide mathematical statements for the additional approximations; some interesting limiting behavior is also described there. This section is self-contained and may be skipped by the reader uninterested in mathematical details. Finally, in the subsequent section, we carry out a simulation study to compare the accuracy of various approximate CSDs and demonstrate that ours are generally the most accurate. 相似文献
2.
Throughout history, the population size of modern humans has varied considerably due to changes in environment, culture, and technology. More accurate estimates of population size changes, and when they occurred, should provide a clearer picture of human colonization history and help remove confounding effects from natural selection inference. Demography influences the pattern of genetic variation in a population, and thus genomic data of multiple individuals sampled from one or more present-day populations contain valuable information about the past demographic history. Recently, Li and Durbin developed a coalescent-based hidden Markov model, called the pairwise sequentially Markovian coalescent (PSMC), for a pair of chromosomes (or one diploid individual) to estimate past population sizes. This is an efficient, useful approach, but its accuracy in the very recent past is hampered by the fact that, because of the small sample size, only few coalescence events occur in that period. Multiple genomes from the same population contain more information about the recent past, but are also more computationally challenging to study jointly in a coalescent framework. Here, we present a new coalescent-based method that can efficiently infer population size changes from multiple genomes, providing access to a new store of information about the recent past. Our work generalizes the recently developed sequentially Markov conditional sampling distribution framework, which provides an accurate approximation of the probability of observing a newly sampled haplotype given a set of previously sampled haplotypes. Simulation results demonstrate that we can accurately reconstruct the true population histories, with a significant improvement over the PSMC in the recent past. We apply our method, called diCal, to the genomes of multiple human individuals of European and African ancestry to obtain a detailed population size change history during recent times. 相似文献
3.
Wim J. van der Steen 《Acta biotheoretica》1998,46(4):369-377
Research in behaviour genetics uncovers causes of behaviour at the population level. For inferences about individuals we also need to know how genes and the environment affect phenotypes. Behaviour genetics fosters a biased view of individual behaviour since it identifies the environment with psychosocial factors and disregards ecology. 相似文献
4.
植物化学遗传学:一种崭新的植物遗传学研究方法 总被引:1,自引:0,他引:1
化学遗传学(chemical genetics,也称为化学基因组学,chemical genomics)研究方法是利用生物活性小分子扰动蛋白分子互作过程来研究有关的生命现象,是常规遗传学研究方法的补充和延伸。化学遗传学在植物科学中的应用——植物化学遗传学的研究在短短几年内,凭借其作为一种新的遗传学研究方法所具备的独特优势(如能够克服常规遗传学研究中的遗传冗余、突变致死难题及可提供特异强度、作用时间点上的条件性遗传扰动等),已开始解决一些植物分子生物学中长期存在的研究难题。本文就植物化学遗传学的一般原理及其方法,以及它作为一种新的遗传学研究方法的优势及特点作一个综述. 相似文献
5.
Frederick Hecht 《American journal of human genetics》1983,35(1):159-160
6.
7.
Dr. Satya N. Mishra Prof. Edward J. Dudewicz 《Biometrical journal. Biometrische Zeitschrift》1987,29(4):471-483
The problem of selecting a “best” (largest mean, or smallest mean) population from a collection of k independent populations was formulated and solved by Bechhofer (1954). Gupta (1965) solved another important problem, that of selecting a subset of populations containing the “best” population from the original collection of populations. Since then many variations of the problem have been considered. Tong (1969) and Lewis (1980) have investigated the problem of selecting extreme populations (populations with a largest, and populations with a smallest, mean) with respect to one and two standard populations, respectively. In this paper we study the selection of extreme populations in absence of any standard population. We formulate subset-selection procedures when variances are known and equal, and also in the most general case when they are unknown and unequal. Nonexistence of a single-stage procedure is noted for this latter case (even if variances are equal). A two-stage procedure and some of its associated properties are discussed. Tables needed for application are provided, as is a worked example. 相似文献
8.
Using a distribution-free approach, a modification of the usual procedure for selecting the better of two treatments is presented. Here the possibility of no selection when the treatments appear to be ‘equivalent’ is allowed. The sample size and the constant needed to implement the proposed procedure are determined by controlling the probabilities of a correct selection and a wrong selection when the two treatments are not equivalent. 相似文献
9.
Data from the Greater Manchester Butterfly Atlas (UK) reveal a highly significant and substantial impact of visits on both species' richness and species' incidence in squares. This effect has been demonstrated for three different zones mapped at different scales. The significant impact of number of visits persists when data are amalgamated for coarser scales. The findings demonstrate that it is essential for distribution mapping projects to record data on recording effort as well as on the target organisms. Suggestions are made as to how distribution mapping may be improved, including a geographically and environmentally representative structure of permanently monitored squares and closer links between distribution mapping and the Butterfly Monitoring Scheme (BMS), which primarily monitors changes in butterfly populations. The benefit to conservation will be data that can be better used to analyse the reasons for changes in ranges and distributions, fundamental for determining priorities and policy decisions. 相似文献
10.
A deterministic two-loci model was developed to predict genetic response to marker-assisted selection (MAS) in one generation and in multiple generations. Formulas were derived to relate linkage disequilibrium in a population to the proportion of additive genetic variance used by MAS, and in turn to an extra improvement in genetic response over phenotypic selection. Predictions of the response were compared to those predicted by using an infinite-loci model and the factors affecting efficiency of MAS were examined. Theoretical analyses of the present study revealed the nonlinearity between the selection intensity and genetic response in MAS. In addition to the heritability of the trait and the proportion of the marker-associated genetic variance, the frequencies of the selectively favorable alleles at the two loci, one marker and one quantitative trait locus, were found to play an important role in determining both the short- and long-term efficiencies of MAS. The evolution of linkage disequilibrium and thus the genetic response over several generations were predicted theoretically and examined by simulation. MAS dissipated the disequilibrium more quickly than drift alone. In some cases studied, the rate of dissipation was as large as that to be expected in the circumstance where the true recombination fraction was increased by three times and selection was absent. 相似文献
11.
12.
13.
14.
Michael P. McAssey Fetsje Bijma Bernadetta Tarigan Jaap van Pelt Arjen van Ooyen Mathisca de Gunst 《PloS one》2014,9(1)
Neuronal signal integration and information processing in cortical neuronal networks critically depend on the organization of synaptic connectivity. Because of the challenges involved in measuring a large number of neurons, synaptic connectivity is difficult to determine experimentally. Current computational methods for estimating connectivity typically rely on the juxtaposition of experimentally available neurons and applying mathematical techniques to compute estimates of neural connectivity. However, since the number of available neurons is very limited, these connectivity estimates may be subject to large uncertainties. We use a morpho-density field approach applied to a vast ensemble of model-generated neurons. A morpho-density field (MDF) describes the distribution of neural mass in the space around the neural soma. The estimated axonal and dendritic MDFs are derived from 100,000 model neurons that are generated by a stochastic phenomenological model of neurite outgrowth. These MDFs are then used to estimate the connectivity between pairs of neurons as a function of their inter-soma displacement. Compared with other density-field methods, our approach to estimating synaptic connectivity uses fewer restricting assumptions and produces connectivity estimates with a lower standard deviation. An important requirement is that the model-generated neurons reflect accurately the morphology and variation in morphology of the experimental neurons used for optimizing the model parameters. As such, the method remains subject to the uncertainties caused by the limited number of neurons in the experimental data set and by the quality of the model and the assumptions used in creating the MDFs and in calculating estimating connectivity. In summary, MDFs are a powerful tool for visualizing the spatial distribution of axonal and dendritic densities, for estimating the number of potential synapses between neurons with low standard deviation, and for obtaining a greater understanding of the relationship between neural morphology and network connectivity. 相似文献
15.
16.
17.
Sampling along roadsides is convenient and is widely practiced in insect population researches. Ecological conditions in road verges are very different than those prevailing in natural habitats and they affect the annual growth of plants in semi-arid and arid regions. This in turn may improve development, survival and abundance of insects feeding on plants growing in roadsides. These trends may bias the results of sampling. To verify this assertion, we quantified the effects of growing in roadside on annual growth of Pistacia atlantica trees and Pistacia palaestina shrubs and compare two demographic indexes of nine gall-inducing aphid species on trees growing along roads with trees in the open landscape, in Israel. The annual growth of the two host plants was significantly more vigorous in roadsides than away from roads. Tests of Combined Probabilities showed that the likelihood of P. atlantica and P. palaestina to be parasitized by more galls of Fordini species is higher in roadsides than away from roads. Moreover, in the semi-dry regions of Israel, three aphid species on P. atlantica and five species on P. palaestina induced more galls in plants growing along roads than away from roads, while in the rainy Northern region, the difference was not significant between the two habitats. These results indicate a biased evaluation of population size in roadside habitat, which has to be accounted in insect–plant relation researches. 相似文献
18.
19.
20.