首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
The choice of summary statistics is a crucial step in approximate Bayesian computation (ABC). Since statistics are often not sufficient, this choice involves a trade-off between loss of information and reduction of dimensionality. The latter may increase the efficiency of ABC. Here, we propose an approach for choosing summary statistics based on boosting, a technique from the machine-learning literature. We consider different types of boosting and compare them to partial least-squares regression as an alternative. To mitigate the lack of sufficiency, we also propose an approach for choosing summary statistics locally, in the putative neighborhood of the true parameter value. We study a demographic model motivated by the reintroduction of Alpine ibex (Capra ibex) into the Swiss Alps. The parameters of interest are the mean and standard deviation across microsatellites of the scaled ancestral mutation rate (θanc = 4Neu) and the proportion of males obtaining access to matings per breeding season (ω). By simulation, we assess the properties of the posterior distribution obtained with the various methods. According to our criteria, ABC with summary statistics chosen locally via boosting with the L2-loss performs best. Applying that method to the ibex data, we estimate θ^anc1.288 and find that most of the variation across loci of the ancestral mutation rate u is between 7.7 × 10−4 and 3.5 × 10−3 per locus per generation. The proportion of males with access to matings is estimated as ω^0.21, which is in good agreement with recent independent estimates.  相似文献   

2.
In recent years approximate Bayesian computation (ABC) methods have become popular in population genetics as an alternative to full-likelihood methods to make inferences under complex demographic models. Most ABC methods rely on the choice of a set of summary statistics to extract information from the data. In this article we tested the use of the full allelic distribution directly in an ABC framework. Although the ABC techniques are becoming more widely used, there is still uncertainty over how they perform in comparison with full-likelihood methods. We thus conducted a simulation study and provide a detailed examination of ABC in comparison with full likelihood in the case of a model of admixture. This model assumes that two parental populations mixed at a certain time in the past, creating a hybrid population, and that the three populations then evolve under pure drift. Several aspects of ABC methodology were investigated, such as the effect of the distance metric chosen to measure the similarity between simulated and observed data sets. Results show that in general ABC provides good approximations to the posterior distributions obtained with the full-likelihood method. This suggests that it is possible to apply ABC using allele frequencies to make inferences in cases where it is difficult to select a set of suitable summary statistics and when the complexity of the model or the size of the data set makes it computationally prohibitive to use full-likelihood methods.  相似文献   

3.
A key priority in infectious disease research is to understand the ecological and evolutionary drivers of viral diseases from data on disease incidence as well as viral genetic and antigenic variation. We propose using a simulation-based, Bayesian method known as Approximate Bayesian Computation (ABC) to fit and assess phylodynamic models that simulate pathogen evolution and ecology against summaries of these data. We illustrate the versatility of the method by analyzing two spatial models describing the phylodynamics of interpandemic human influenza virus subtype A(H3N2). The first model captures antigenic drift phenomenologically with continuously waning immunity, and the second epochal evolution model describes the replacement of major, relatively long-lived antigenic clusters. Combining features of long-term surveillance data from the Netherlands with features of influenza A (H3N2) hemagglutinin gene sequences sampled in northern Europe, key phylodynamic parameters can be estimated with ABC. Goodness-of-fit analyses reveal that the irregularity in interannual incidence and H3N2''s ladder-like hemagglutinin phylogeny are quantitatively only reproduced under the epochal evolution model within a spatial context. However, the concomitant incidence dynamics result in a very large reproductive number and are not consistent with empirical estimates of H3N2''s population level attack rate. These results demonstrate that the interactions between the evolutionary and ecological processes impose multiple quantitative constraints on the phylodynamic trajectories of influenza A(H3N2), so that sequence and surveillance data can be used synergistically. ABC, one of several data synthesis approaches, can easily interface a broad class of phylodynamic models with various types of data but requires careful calibration of the summaries and tolerance parameters.  相似文献   

4.
Comparison of demo‐genetic models using Approximate Bayesian Computation (ABC) is an active research field. Although large numbers of populations and models (i.e. scenarios) can be analysed with ABC using molecular data obtained from various marker types, methodological and computational issues arise when these numbers become too large. Moreover, Robert et al. (Proceedings of the National Academy of Sciences of the United States of America, 2011, 108, 15112) have shown that the conclusions drawn on ABC model comparison cannot be trusted per se and required additional simulation analyses. Monte Carlo inferential techniques to empirically evaluate confidence in scenario choice are very time‐consuming, however, when the numbers of summary statistics (Ss) and scenarios are large. We here describe a methodological innovation to process efficient ABC scenario probability computation using linear discriminant analysis (LDA) on Ss before computing logistic regression. We used simulated pseudo‐observed data sets (pods) to assess the main features of the method (precision and computation time) in comparison with traditional probability estimation using raw (i.e. not LDA transformed) Ss. We also illustrate the method on real microsatellite data sets produced to make inferences about the invasion routes of the coccinelid Harmonia axyridis. We found that scenario probabilities computed from LDA‐transformed and raw Ss were strongly correlated. Type I and II errors were similar for both methods. The faster probability computation that we observed (speed gain around a factor of 100 for LDA‐transformed Ss) substantially increases the ability of ABC practitioners to analyse large numbers of pods and hence provides a manageable way to empirically evaluate the power available to discriminate among a large set of complex scenarios.  相似文献   

5.
Insect pest phylogeography might be shaped both by biogeographic events and by human influence. Here, we conducted an approximate Bayesian computation (ABC) analysis to investigate the phylogeography of the New World screwworm fly, Cochliomyia hominivorax, with the aim of understanding its population history and its order and time of divergence. Our ABC analysis supports that populations spread from North to South in the Americas, in at least two different moments. The first split occurred between the North/Central American and South American populations in the end of the Last Glacial Maximum (15,300-19,000 YBP). The second split occurred between the North and South Amazonian populations in the transition between the Pleistocene and the Holocene eras (9,100-11,000 YBP). The species also experienced population expansion. Phylogenetic analysis likewise suggests this north to south colonization and Maxent models suggest an increase in the number of suitable areas in South America from the past to present. We found that the phylogeographic patterns observed in C. hominivorax cannot be explained only by climatic oscillations and can be connected to host population histories. Interestingly we found these patterns are very coincident with general patterns of ancient human movements in the Americas, suggesting that humans might have played a crucial role in shaping the distribution and population structure of this insect pest. This work presents the first hypothesis test regarding the processes that shaped the current phylogeographic structure of C. hominivorax and represents an alternate perspective on investigating the problem of insect pests.  相似文献   

6.
Inferring the ancestral dynamics of effective population size is a long-standing question in population genetics, which can now be tackled much more accurately thanks to the massive genomic data available in many species. Several promising methods that take advantage of whole-genome sequences have been recently developed in this context. However, they can only be applied to rather small samples, which limits their ability to estimate recent population size history. Besides, they can be very sensitive to sequencing or phasing errors. Here we introduce a new approximate Bayesian computation approach named PopSizeABC that allows estimating the evolution of the effective population size through time, using a large sample of complete genomes. This sample is summarized using the folded allele frequency spectrum and the average zygotic linkage disequilibrium at different bins of physical distance, two classes of statistics that are widely used in population genetics and can be easily computed from unphased and unpolarized SNP data. Our approach provides accurate estimations of past population sizes, from the very first generations before present back to the expected time to the most recent common ancestor of the sample, as shown by simulations under a wide range of demographic scenarios. When applied to samples of 15 or 25 complete genomes in four cattle breeds (Angus, Fleckvieh, Holstein and Jersey), PopSizeABC revealed a series of population declines, related to historical events such as domestication or modern breed creation. We further highlight that our approach is robust to sequencing errors, provided summary statistics are computed from SNPs with common alleles.  相似文献   

7.
Gibbons are believed to have diverged from the larger great apes ∼16.8 MYA and today reside in the rainforests of Southeast Asia. Based on their diploid chromosome number, the family Hylobatidae is divided into four genera, Nomascus, Symphalangus, Hoolock, and Hylobates. Genetic studies attempting to elucidate the phylogenetic relationships among gibbons using karyotypes, mitochondrial DNA (mtDNA), the Y chromosome, and short autosomal sequences have been inconclusive . To examine the relationships among gibbon genera in more depth, we performed second-generation whole genome sequencing (WGS) to a mean of ∼15× coverage in two individuals from each genus. We developed a coalescent-based approximate Bayesian computation (ABC) method incorporating a model of sequencing error generated by high coverage exome validation to infer the branching order, divergence times, and effective population sizes of gibbon taxa. Although Hoolock and Symphalangus are likely sister taxa, we could not confidently resolve a single bifurcating tree despite the large amount of data analyzed. Instead, our results support the hypothesis that all four gibbon genera diverged at approximately the same time. Assuming an autosomal mutation rate of 1 × 10−9/site/year this speciation process occurred ∼5 MYA during a period in the Early Pliocene characterized by climatic shifts and fragmentation of the Sunda shelf forests. Whole genome sequencing of additional individuals will be vital for inferring the extent of gene flow among species after the separation of the gibbon genera.  相似文献   

8.
Approximate Bayesian computation in population genetics   总被引:23,自引:0,他引:23  
Beaumont MA  Zhang W  Balding DJ 《Genetics》2002,162(4):2025-2035
We propose a new method for approximate Bayesian statistical inference on the basis of summary statistics. The method is suited to complex problems that arise in population genetics, extending ideas developed in this setting by earlier authors. Properties of the posterior distribution of a parameter, such as its mean or density curve, are approximated without explicit likelihood calculations. This is achieved by fitting a local-linear regression of simulated parameter values on simulated summary statistics, and then substituting the observed summary statistics into the regression equation. The method combines many of the advantages of Bayesian statistical inference with the computational efficiency of methods based on summary statistics. A key advantage of the method is that the nuisance parameters are automatically integrated out in the simulation step, so that the large numbers of nuisance parameters that arise in population genetics problems can be handled without difficulty. Simulation results indicate computational and statistical efficiency that compares favorably with those of alternative methods previously proposed in the literature. We also compare the relative efficiency of inferences obtained using methods based on summary statistics with those obtained directly from the data using MCMC.  相似文献   

9.
10.
Understanding the processes by which new diseases are introduced in previously healthy areas is of major interest in elaborating prevention and management policies, as well as in understanding the dynamics of pathogen diversity at large spatial scale. In this study, we aimed to decipher the dispersal processes that have led to the emergence of the plant pathogenic fungus Microcyclus ulei, which is responsible for the South American Leaf Blight (SALB). This fungus has devastated rubber tree plantations across Latin America since the beginning of the twentieth century. As only imprecise historical information is available, the study of population evolutionary history based on population genetics appeared most appropriate. The distribution of genetic diversity in a continental sampling of four countries (Brazil, Ecuador, Guatemala and French Guiana) was studied using a set of 16 microsatellite markers developed specifically for this purpose. A very strong genetic structure was found (Fst=0.70), demonstrating that there has been no regular gene flow between Latin American M. ulei populations. Strong bottlenecks probably occurred at the foundation of each population. The most likely scenario of colonization identified by the Approximate Bayesian Computation (ABC) method implemented in 𝒟ℐ𝒴𝒜ℬ𝒞 suggested two independent sources from the Amazonian endemic area. The Brazilian, Ecuadorian and Guatemalan populations might stem from serial introductions through human-mediated movement of infected plant material from an unsampled source population, whereas the French Guiana population seems to have arisen from an independent colonization event through spore dispersal.  相似文献   

11.
BackgroundNitrogen isotope analysis of bone collagen has been used to reconstruct the breastfeeding practices of archaeological human populations. However, weaning ages have been estimated subjectively because of a lack of both information on subadult bone collagen turnover rates and appropriate analytical models.MethodologyTemporal changes in human subadult bone collagen turnover rates were estimated from data on tissue-level bone metabolism reported in previous studies. A model for reconstructing precise weaning ages was then developed using a framework of approximate Bayesian computation and incorporating the estimated turnover rates. The model is presented as a new open source R package, WARN (Weaning Age Reconstruction with Nitrogen isotope analysis), which computes the age at the start and end of weaning, 15N-enrichment through maternal to infant tissue, and value of collagen synthesized entirely from weaning foods with their posterior probabilities. The model was applied to 39 previously reported Holocene skeletal populations from around the world, and the results were compared with weaning ages observed in ethnographic studies.ConclusionsThere were no significant differences in the age at the end of weaning between the archaeological (2.80±1.32 years) and ethnographic populations. By comparing archaeological populations, it appears that weaning ages did not differ with the type of subsistence practiced (i.e., hunting–gathering or not). Most of -enrichment (2.44±0.90‰) was consistent with biologically valid values. The nitrogen isotope ratios of subadults after the weaning process were lower than those of adults in most of the archaeological populations (−0.48±0.61‰), and this depletion was greater in non-hunter–gatherer populations. Our results suggest that the breastfeeding period in humans had already been shortened by the early Holocene compared with those in extant great apes.  相似文献   

12.
13.
Until recently, the use of Bayesian inference was limited to a few cases because for many realistic probability models the likelihood function cannot be calculated analytically. The situation changed with the advent of likelihood-free inference algorithms, often subsumed under the term approximate Bayesian computation (ABC). A key innovation was the use of a postsampling regression adjustment, allowing larger tolerance values and as such shifting computation time to realistic orders of magnitude. Here we propose a reformulation of the regression adjustment in terms of a general linear model (GLM). This allows the integration into the sound theoretical framework of Bayesian statistics and the use of its methods, including model selection via Bayes factors. We then apply the proposed methodology to the question of population subdivision among western chimpanzees, Pan troglodytes verus.WITH the advent of ever more powerful computers and the refinement of algorithms like MCMC or Gibbs sampling, Bayesian statistics have become an important tool for scientific inference during the past two decades. Consider a model M creating data D (DNA sequence data, for example) determined by parameters from some (bounded) parameter space Π ⊂ Rm whose joint prior density we denote by . The quantity of interest is the posterior distribution of the parameters, which can be calculated by Bayes rule aswhere is the likelihood of the data and is a normalizing constant. Direct use of this formula, however, is often prevented by the fact that the likelihood function cannot be calculated analytically for many realistic probability models. In these cases one is obliged to use stochastic simulation. Tavaré et al. (1997) propose a rejection sampling method for simulating a posterior random sample where the full data D are replaced by a summary statistic s (like the number of segregating sites in their setting). Even if the statistic does not capture the full information contained in the data D, rejection sampling allows for the simulation of approximate posterior distributions of the parameters in question (the scaled mutation rate in their model). This approach was extended to multiple-parameter models with multivariate summary statistics by Weiss and von Haeseler (1998). In their setting a candidate vector of parameters is simulated from a prior distribution and is accepted if its corresponding vector of summary statistics is sufficiently close to the observed summary statistics sobs with respect to some metric in the space of s, i.e., if dist(s, sobs) < ε for a fixed tolerance ε. We suppose that the likelihood of the full model is continuous and nonzero around sobs. In practice the summary statistics are often discrete but the range of values is large enough to be approximated by real numbers. The likelihood of the truncated model obtained by this acceptance–rejection process is given by(1)where is the ε-ball in the space of summary statistics and Ind(·) is the indicator function. Observe that degenerates to a (Dirac) point measure centered at sobs as . If the parameters are generated from a prior , then the distribution of the parameters retained after the rejection process outlined above is given by(2)We call this density the truncated prior. Combining (1) and (2) we get(3)Thus the posterior distribution of the parameters under the model M for s = sobs given the prior is exactly equal to the posterior distribution under the truncated model given the truncated prior . If we can estimate the truncated prior and make an educated guess for a parametric statistical model of Mε(sobs), we arrive at a reasonable approximation of the posterior even if the likelihood of the full model M is unknown. It is to be expected that due to the localization process the truncated model will exhibit a simpler structure than the full model M and thus be easier to estimate.Estimating is straightforward, at least when the summary statistics can be sampled from M in a reasonable amount of time: Sample the parameters from the prior , create their respective statistics s from M, and save those parameters whose statistics lie in in a list . The empirical distribution of these retained parameters yields an estimate of . If the tolerance ε is small, then one can assume that is close to some (unknown) constant over the whole range of . Under that assumption, Equation 3 shows that . However, when the dimension n of summary statistics is high (and for more complex models dimensions like n = 50 are not unusual), the “curse of dimensionality” implies that the tolerance must be chosen rather large or else the acceptance rate becomes prohibitively low. This, however, distorts the precision of the approximation of the posterior distribution by the truncated prior (see Wegmann et al. 2009). This situation can be partially alleviated by speeding up the sampling process; such methods are subsumed under the term approximate Bayesian computation (ABC). Marjoram et al. (2003) develop a variant of the classical Metropolis–Hastings algorithm (termed ABC–MCMC in Sisson et al. 2007), which allows them to sample directly from the truncated prior . In Sisson et al. (2007) a sequential Monte Carlo sampler is proposed, requiring substantially less iterations than ABC–MCMC. But even when such methods are applied, the assumption that is constant over the ε-ball is a very rough one, indeed.To take into account the variation of within the ε-ball, a postsampling regression adjustment (termed ABC-REG in the following) of the sample P of retained parameters is introduced in the important article by Beaumont et al. (2002). Basically, they postulate a (locally) linear dependence between the parameters and their associated summary statistics s. More precisely, the (local) model they implicitly assume is of the form , where M is a matrix of regression coefficients, m0 a constant vector, and a random vector of zero mean. Computer simulations suggest that for many population models ABC–REG yields posterior marginal densities that have narrower highest posterior density (HPD) regions and are more closely centered around the true parameter values than the empirical posterior densities directly produced by ABC samplers (Wegmann et al. 2009). An attractive feature of ABC–REG is that the posterior adjustment is performed directly on the simulated parameters, which makes estimation of the marginal posteriors of individual parameters particularly easy. The method can also be extended to more complex, nonlinear models as demonstrated, e.g., in Blum and Francois (2009). In extreme situations, however, ABC–REG may yield posteriors that are nonzero in parameter regions where the priors actually vanish (see Figure 1B for an illustration of this phenomenon). Moreover, it is not clear how ABC–REG could yield an estimate of the marginal density of model M at sobs, information that is useful for model comparison.Open in a separate windowFigure 1.—Comparison of rejection (A and D), ABC–REG (B and E), and ABC–GLM (C and F) posteriors with those obtained from analytical likelihood calculations. We estimated the population–mutation parameter θ = 4Nμ of a panmictic population for different observed numbers of segregating sites (see text). Shades indicate the L1 distance between the inferred and the analytically calculated posterior. White corresponds to an exact match (zero distance) and darker gray shades indicate larger distances. If the inferred posterior differs from the analytical more than the prior does, squares are marked in black. The top row (A–C) corresponds to cases with a uniform prior θ ∼ Unif([0.005, 10]) and the bottom row (D–F) to cases with a discontinuous prior with “gap.” The tolerance ε is given as the absolute distance in number of segregating sites. Shown are averages over 25 independent estimations. To have a fair comparison, we adjusted the smoothing parameters (bandwidths) to get the best results for all approaches.In contrast to ABC–REG we treat the parameters as exogenous and the summary statistics s as endogenous variables and we stipulate for a general linear model (GLM in the literature—not to be confused with the generalized linear models that unfortunately share the same abbreviation). To be precise, we assume the summary statistics s created by the truncated model''s likelihood to satisfy(4)where C is a n × m matrix of constants, c0 an n × 1 vector, and a random vector with a multivariate normal distribution of zero mean and covariance matrix :A GLM has the advantage of taking into account not only the (local) linearity, but also the strong correlation normally present between the components of the summary statistics. Of course, the model assumption (4) can never represent the full truth since its statistics are in principle unbounded whereas the likelihood is supported on the ε-ball around sobs. But since the multivariate Gaussians will fall off rapidly in practice and not reach far out off the boundary of , this is a disadvantage we can live with. In particular, the ordinary least squares (OLS) estimate outlined below implies that for the constant c0 tends to sobs whereas the design matrix C and the covariance matrix both vanish. This means that in the limit of zero tolerance ε = 0 our model assumption yields the true posterior distribution of M.  相似文献   

14.
15.
16.
Hansen’s disease (leprosy) elimination has proven difficult in several countries, including Brazil, and there is a need for a mathematical model that can predict control program efficacy. This study applied the Approximate Bayesian Computation algorithm to fit 6 different proposed models to each of the 5 regions of Brazil, then fitted hierarchical models based on the best-fit regional models to the entire country. The best model proposed for most regions was a simple model. Posterior checks found that the model results were more similar to the observed incidence after fitting than before, and that parameters varied slightly by region. Current control programs were predicted to require additional measures to eliminate Hansen’s Disease as a public health problem in Brazil.  相似文献   

17.
18.
19.
The principles by which networks of neurons compute, and how spike-timing dependent plasticity (STDP) of synaptic weights generates and maintains their computational function, are unknown. Preceding work has shown that soft winner-take-all (WTA) circuits, where pyramidal neurons inhibit each other via interneurons, are a common motif of cortical microcircuits. We show through theoretical analysis and computer simulations that Bayesian computation is induced in these network motifs through STDP in combination with activity-dependent changes in the excitability of neurons. The fundamental components of this emergent Bayesian computation are priors that result from adaptation of neuronal excitability and implicit generative models for hidden causes that are created in the synaptic weights through STDP. In fact, a surprising result is that STDP is able to approximate a powerful principle for fitting such implicit generative models to high-dimensional spike inputs: Expectation Maximization. Our results suggest that the experimentally observed spontaneous activity and trial-to-trial variability of cortical neurons are essential features of their information processing capability, since their functional role is to represent probability distributions rather than static neural codes. Furthermore it suggests networks of Bayesian computation modules as a new model for distributed information processing in the cortex.  相似文献   

20.
The molecular clock provides a powerful way to estimate species divergence times. If information on some species divergence times is available from the fossil or geological record, it can be used to calibrate a phylogeny and estimate divergence times for all nodes in the tree. The Bayesian method provides a natural framework to incorporate different sources of information concerning divergence times, such as information in the fossil and molecular data. Current models of sequence evolution are intractable in a Bayesian setting, and Markov chain Monte Carlo (MCMC) is used to generate the posterior distribution of divergence times and evolutionary rates. This method is computationally expensive, as it involves the repeated calculation of the likelihood function. Here, we explore the use of Taylor expansion to approximate the likelihood during MCMC iteration. The approximation is much faster than conventional likelihood calculation. However, the approximation is expected to be poor when the proposed parameters are far from the likelihood peak. We explore the use of parameter transforms (square root, logarithm, and arcsine) to improve the approximation to the likelihood curve. We found that the new methods, particularly the arcsine-based transform, provided very good approximations under relaxed clock models and also under the global clock model when the global clock is not seriously violated. The approximation is poorer for analysis under the global clock when the global clock is seriously wrong and should thus not be used. The results suggest that the approximate method may be useful for Bayesian dating analysis using large data sets.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号