期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A Principled Approach to Deriving Approximate Conditional Sampling Distributions in Population Genetics Models with Recombination

Joshua S. Paul Yun S. Song 《Genetics》2010,186(1):321-338

The multilocus conditional sampling distribution (CSD) describes the probability that an additionally sampled DNA sequence is of a certain type, given that a collection of sequences has already been observed. The CSD has a wide range of applications in both computational biology and population genomics analysis, including phasing genotype data into haplotype data, imputing missing data, estimating recombination rates, inferring local ancestry in admixed populations, and importance sampling of coalescent genealogies. Unfortunately, the true CSD under the coalescent with recombination is not known, so approximations, formulated as hidden Markov models, have been proposed in the past. These approximations have led to a number of useful statistical tools, but it is important to recognize that they were not derived from, though were certainly motivated by, principles underlying the coalescent process. The goal of this article is to develop a principled approach to derive improved CSDs directly from the underlying population genetics model. Our approach is based on the diffusion process approximation and the resulting mathematical expressions admit intuitive genealogical interpretations, which we utilize to introduce further approximations and make our method scalable in the number of loci. The general algorithm presented here applies to an arbitrary number of loci and an arbitrary finite-alleles recurrent mutation model. Empirical results are provided to demonstrate that our new CSDs are in general substantially more accurate than previously proposed approximations.THE probability of observing a sample of DNA sequences under a given population genetics model—which is referred to as the sampling probability or likelihood—plays an important role in a wide range of problems in a genetic variation study. When recombination is involved, however, obtaining an analytic formula for the sampling probability has hitherto remained a challenging open problem (see , for recent progress on this problem). As such, much research (; ; ; Stephens and Donnelly 2000; ; De Iorio and Griffiths 2004a,b; ; ; ) has focused on developing Monte Carlo methods on the basis of the coalescent with recombination (Griffiths 1981; Kingman 1982a,b; ), a well-established mathematical framework that models the genealogical history of sample chromosomes. These Monte Carlo-based full-likelihood methods mark an important development in population genetics analysis, but a well-known obstacle to their utility is that they tend to be computationally intensive. For a whole-genome variation study, approximations are often unavoidable, and it is therefore important to think of ways to minimize the trade-off between scalability and accuracy.A popular likelihood-based approximation method that has had a significant impact on population genetics analysis is the following approach introduced by : Given a set Φ of model parameters (e.g., mutation rate, recombination rate, etc.), the joint probability p(h₁, … , h_n | Φ) of observing a set {h₁, … , h_n} of haplotypes sampled from a population can be decomposed as a product of conditional sampling distributions (CSDs), denoted by π,(1)where π(h_k+1|h₁, …, h_k, Φ) is the probability of an additionally sampled haplotype being of type h_k+1, given a set of already observed haplotypes h₁, …, h_k. In the presence of recombination, the true CSD π is unknown, so Li and Stephens proposed using an approximate CSD in place of π, thus obtaining the following approximation of the joint probability:(2)Li and Stephens referred to this approximation as the product of approximate conditionals (PAC) model. In general, the closer is to the true CSD π, the more accurate the PAC model becomes. Notable applications and extensions of this framework include estimating crossover rates (; ) and gene conversion parameters (; ), phasing genotype data into haplotype data (; ), imputing missing data to improve power in association mapping (; Li and Abecasis 2006; ; ), inferring local ancestry in admixed populations (), inferring human colonization history (), inferring demography (), and so on.Another problem in which the CSD plays a fundamental role is importance sampling of genealogies under the coalescent process (Stephens and Donnelly 2000; ; De Iorio and Griffiths 2004a,b; ; ). In this context, the optimal proposal distribution can be written in terms of the CSD π (Stephens and Donnelly 2000), and as in the PAC model, an approximate CSD may be used in place of π. The performance of an importance sampling scheme depends critically on the proposal distribution and therefore on the accuracy of the approximation . Often in conjunction with composite-likelihood frameworks (; Fearnhead and Donnelly 2002), importance sampling has been used in estimating fine-scale recombination rates (; ; ).So far, a significant scope of intuition has gone into choosing the approximate CSDs used in these problems (). In the case of completely linked loci, Stephens and Donnelly (2000) suggested constructing an approximation by assuming that the additional haplotype h_k+1 is an imperfect copy of one of the first k haplotypes, with copying errors corresponding to mutation. generalized this construction to include crossover recombination, assuming that the haplotype h_k+1 is an imperfect mosaic of the first k haplotypes (i.e., h_k+1 is obtained by copying segments from h₁, …, h_k, where crossover recombination can change the haplotype from which copying is performed). The associated CSD, which we denote by , can be interpreted as a hidden Markov model and so admits an efficient dynamic programming solution. Finally, proposed a modification to Fearnhead and Donnelly''s model that limits the hidden state space, thereby providing a computational simplification; we denote the corresponding approximate CSD by .Although these approaches are computationally appealing, it is important to note that they are not derived from, though are certainly motivated by, principles underlying typical population genetics models, in particular the coalescent process (Griffiths 1981; Kingman 1982a,b; ). The main objective of this article is to develop a principled technique to derive an improved CSD directly from the underlying population genetics model. Rather than relying on intuition, we base our work on mathematical foundation. The theoretical framework we employ is the diffusion process. De Iorio and Griffiths (2004a,b) first introduced the diffusion-generator approximation technique to obtain an approximate CSD in the case of a single locus (i.e., no recombination). later extended the approach to two loci to include crossover recombination, assuming a parent-independent mutation model at each locus. In this article, we extend the framework to develop a general algorithm that applies to an arbitrary number of loci and an arbitrary finite-alleles recurrent mutation model.Our work can be summarized as follows. Using the diffusion-generator approximation technique, we derive a recursion relation satisfied by an approximate CSD. This recursion can be used to construct a closed system of coupled linear equations, in which the conditional sampling probability of interest appears as one of the unknown variables. The system of equations can be solved using standard numerical analysis techniques. However, the size of the system grows superexponentially with the number of loci and, consequently, so does the running time. To remedy this drawback, we introduce additional approximations to make our approach scalable in the number of loci. Specifically, the recursion admits an intuitive genealogical interpretation, and, on the basis of this interpretation, we propose modifications to the recursion, which then can be easily solved using dynamic programming. The computational complexity of the modified algorithm is polynomial in the number of loci, and, importantly, the resulting CSD has little loss of accuracy compared to that following from the full recursion.The accuracy of approximate CSDs has not been discussed much in the literature, except in the application-specific context for which they are being employed. In this article, we carry out an empirical study to explicitly test the accuracy of various CSDs and demonstrate that our new CSDs are in general substantially more accurate than previously proposed approximations. We also consider the PAC framework and show that our approximations also produce more accurate PAC-likelihood estimates. We note that for the maximum-likelihood estimation of recombination rates, the actual value of the likelihood may not be so important, as long as it is maximized near the true recombination rate. However, in many other applications—e.g., phasing genotype data into haplotype data, imputing missing data, importance sampling, and so on—the accuracy of the CSD and PAC-likelihood function over a wide range of parameter values may be important. Thus, we believe that the theoretical work presented here will have several practical implications; our method can be applied in a wide range of statistical tools that use CSDs, improving their accuracy.The remainder of this article is organized as follows. To provide intuition for the ensuing mathematics, we first describe a genealogical process that gives rise to our CSD. Using our genealogical interpretation, we consider two additional approximations and relate these to previously proposed CSDs. Then, in the following section, we derive our CSD using the diffusion-generator approach and provide mathematical statements for the additional approximations; some interesting limiting behavior is also described there. This section is self-contained and may be skipped by the reader uninterested in mathematical details. Finally, in the subsequent section, we carry out a simulation study to compare the accuracy of various approximate CSDs and demonstrate that ours are generally the most accurate. 相似文献

2.

Estimating Variable Effective Population Sizes from Multiple Genomes: A Sequentially Markov Conditional Sampling Distribution Approach

Sara Sheehan Kelley Harris Yun S. Song 《Genetics》2013,194(3):647-662

Throughout history, the population size of modern humans has varied considerably due to changes in environment, culture, and technology. More accurate estimates of population size changes, and when they occurred, should provide a clearer picture of human colonization history and help remove confounding effects from natural selection inference. Demography influences the pattern of genetic variation in a population, and thus genomic data of multiple individuals sampled from one or more present-day populations contain valuable information about the past demographic history. Recently, Li and Durbin developed a coalescent-based hidden Markov model, called the pairwise sequentially Markovian coalescent (PSMC), for a pair of chromosomes (or one diploid individual) to estimate past population sizes. This is an efficient, useful approach, but its accuracy in the very recent past is hampered by the fact that, because of the small sample size, only few coalescence events occur in that period. Multiple genomes from the same population contain more information about the recent past, but are also more computationally challenging to study jointly in a coalescent framework. Here, we present a new coalescent-based method that can efficiently infer population size changes from multiple genomes, providing access to a new store of information about the recent past. Our work generalizes the recently developed sequentially Markov conditional sampling distribution framework, which provides an accurate approximation of the probability of observing a newly sampled haplotype given a set of previously sampled haplotypes. Simulation results demonstrate that we can accurately reconstruct the true population histories, with a significant improvement over the PSMC in the recent past. We apply our method, called diCal, to the genomes of multiple human individuals of European and African ancestry to obtain a detailed population size change history during recent times. 相似文献

3.

Bias in Behaviour Genetics: An Ecological Perspective

Wim J. van der Steen 《Acta biotheoretica》1998,46(4):369-377

Research in behaviour genetics uncovers causes of behaviour at the population level. For inferences about individuals we also need to know how genes and the environment affect phenotypes. Behaviour genetics fosters a biased view of individual behaviour since it identifies the environment with psychosocial factors and disregards ecology. 相似文献

4.

植物化学遗传学：一种崭新的植物遗传学研究方法 总被引：1，自引：0，他引：1

赵扬《植物生理学通讯》2011,(1):1-8

化学遗传学（chemical genetics,也称为化学基因组学,chemical genomics）研究方法是利用生物活性小分子扰动蛋白分子互作过程来研究有关的生命现象,是常规遗传学研究方法的补充和延伸。化学遗传学在植物科学中的应用——植物化学遗传学的研究在短短几年内,凭借其作为一种新的遗传学研究方法所具备的独特优势（如能够克服常规遗传学研究中的遗传冗余、突变致死难题及可提供特异强度、作用时间点上的条件性遗传扰动等）,已开始解决一些植物分子生物学中长期存在的研究难题。本文就植物化学遗传学的一般原理及其方法,以及它作为一种新的遗传学研究方法的优势及特点作一个综述．相似文献

5.

Genetics: Human aspects

下载免费PDF全文

Frederick Hecht 《American journal of human genetics》1983,35(1):159-160

相似文献

6.

Genetics: Human aspects

下载免费PDF全文

Margretta R. Seashore 《American journal of human genetics》1990,47(4):759-760

相似文献

7.

Simultaneous Selection of Extreme Populations: A Subset Selection Approach

Dr. Satya N. Mishra Prof. Edward J. Dudewicz 《Biometrical journal. Biometrische Zeitschrift》1987,29(4):471-483

The problem of selecting a “best” (largest mean, or smallest mean) population from a collection of k independent populations was formulated and solved by Bechhofer (1954). Gupta (1965) solved another important problem, that of selecting a subset of populations containing the “best” population from the original collection of populations. Since then many variations of the problem have been considered. Tong (1969) and Lewis (1980) have investigated the problem of selecting extreme populations (populations with a largest, and populations with a smallest, mean) with respect to one and two standard populations, respectively. In this paper we study the selection of extreme populations in absence of any standard population. We formulate subset-selection procedures when variances are known and equal, and also in the most general case when they are unknown and unequal. Nonexistence of a single-stage procedure is noted for this latter case (even if variances are equal). A two-stage procedure and some of its associated properties are discussed. Tables needed for application are provided, as is a worked example. 相似文献

8.

Equivalence or Selection: A Distribution-Free Approach

M. M. Desu David R. Bristol 《Biometrical journal. Biometrische Zeitschrift》1985,27(5):491-500

Using a distribution-free approach, a modification of the usual procedure for selecting the better of two treatments is presented. Here the possibility of no selection when the treatments appear to be ‘equivalent’ is allowed. The sample size and the constant needed to implement the proposed procedure are determined by controlling the probabilities of a correct selection and a wrong selection when the two treatments are not equivalent. 相似文献

9.

Bias in Butterfly Distribution Maps: The Effects of Sampling Effort

Roger L.H. Dennis Tim H. Sparks Peter B. Hardy 《Journal of Insect Conservation》1999,3(1):33-42

Data from the Greater Manchester Butterfly Atlas (UK) reveal a highly significant and substantial impact of visits on both species' richness and species' incidence in squares. This effect has been demonstrated for three different zones mapped at different scales. The significant impact of number of visits persists when data are amalgamated for coarser scales. The findings demonstrate that it is essential for distribution mapping projects to record data on recording effort as well as on the target organisms. Suggestions are made as to how distribution mapping may be improved, including a geographically and environmentally representative structure of permanently monitored squares and closer links between distribution mapping and the Butterfly Monitoring Scheme (BMS), which primarily monitors changes in butterfly populations. The benefit to conservation will be data that can be better used to analyse the reasons for changes in ranges and distributions, fundamental for determining priorities and policy decisions. 相似文献

10.

A Population Genetics Model of Marker-Assisted Selection 总被引：7，自引：0，他引：7

下载免费PDF全文

Z. W. Luo R. Thompson J. A. Woolliams 《Genetics》1997,146(3):1173-1183

A deterministic two-loci model was developed to predict genetic response to marker-assisted selection (MAS) in one generation and in multiple generations. Formulas were derived to relate linkage disequilibrium in a population to the proportion of additive genetic variance used by MAS, and in turn to an extra improvement in genetic response over phenotypic selection. Predictions of the response were compared to those predicted by using an infinite-loci model and the factors affecting efficiency of MAS were examined. Theoretical analyses of the present study revealed the nonlinearity between the selection intensity and genetic response in MAS. In addition to the heritability of the trait and the proportion of the marker-associated genetic variance, the frequencies of the selectively favorable alleles at the two loci, one marker and one quantitative trait locus, were found to play an important role in determining both the short- and long-term efficiencies of MAS. The evolution of linkage disequilibrium and thus the genetic response over several generations were predicted theoretically and examined by simulation. MAS dissipated the disequilibrium more quickly than drift alone. In some cases studied, the rate of dissipation was as large as that to be expected in the circumstance where the true recombination fraction was increased by three times and selection was absent. 相似文献

11.

Human Genetics: The Basics

Charles W. Rodgers 《American journal of human genetics》2011,(1):5-6

相似文献

12.

Human Genetics. Part A: The Unfolding Genome

Margretta R. Seashore 《The Yale journal of biology and medicine》1984,57(2):252-253

相似文献

13.

Perspectives in Human Genetics

下载免费PDF全文

Arno G. Motulsky 《American journal of human genetics》2006,79(2):193

相似文献

14.

A Morpho-Density Approach to Estimating Neural Connectivity

Michael P. McAssey Fetsje Bijma Bernadetta Tarigan Jaap van Pelt Arjen van Ooyen Mathisca de Gunst 《PloS one》2014,9(1)

Neuronal signal integration and information processing in cortical neuronal networks critically depend on the organization of synaptic connectivity. Because of the challenges involved in measuring a large number of neurons, synaptic connectivity is difficult to determine experimentally. Current computational methods for estimating connectivity typically rely on the juxtaposition of experimentally available neurons and applying mathematical techniques to compute estimates of neural connectivity. However, since the number of available neurons is very limited, these connectivity estimates may be subject to large uncertainties. We use a morpho-density field approach applied to a vast ensemble of model-generated neurons. A morpho-density field (MDF) describes the distribution of neural mass in the space around the neural soma. The estimated axonal and dendritic MDFs are derived from 100,000 model neurons that are generated by a stochastic phenomenological model of neurite outgrowth. These MDFs are then used to estimate the connectivity between pairs of neurons as a function of their inter-soma displacement. Compared with other density-field methods, our approach to estimating synaptic connectivity uses fewer restricting assumptions and produces connectivity estimates with a lower standard deviation. An important requirement is that the model-generated neurons reflect accurately the morphology and variation in morphology of the experimental neurons used for optimizing the model parameters. As such, the method remains subject to the uncertainties caused by the limited number of neurons in the experimental data set and by the quality of the model and the assumptions used in creating the MDFs and in calculating estimating connectivity. In summary, MDFs are a powerful tool for visualizing the spatial distribution of axonal and dendritic densities, for estimating the number of potential synapses between neurons with low standard deviation, and for obtaining a greater understanding of the relationship between neural morphology and network connectivity. 相似文献

15.

Human Genetic Diseases: A Practical Approach

《FEBS letters》1987,214(1):199-200

相似文献

16.

Human Genetics: Pre-Columbian Pacific Contact

《Current biology : CB》2014,24(21):R1038-R1040

相似文献

17.

Sampling Bias in Roadsides: The Case of Galling Aphids on Pistacia Trees

J-J. Martinez D. Wool 《Biodiversity and Conservation》2006,15(7):2109-2121

Sampling along roadsides is convenient and is widely practiced in insect population researches. Ecological conditions in road verges are very different than those prevailing in natural habitats and they affect the annual growth of plants in semi-arid and arid regions. This in turn may improve development, survival and abundance of insects feeding on plants growing in roadsides. These trends may bias the results of sampling. To verify this assertion, we quantified the effects of growing in roadside on annual growth of Pistacia atlantica trees and Pistacia palaestina shrubs and compare two demographic indexes of nine gall-inducing aphid species on trees growing along roads with trees in the open landscape, in Israel. The annual growth of the two host plants was significantly more vigorous in roadsides than away from roads. Tests of Combined Probabilities showed that the likelihood of P. atlantica and P. palaestina to be parasitized by more galls of Fordini species is higher in roadsides than away from roads. Moreover, in the semi-dry regions of Israel, three aphid species on P. atlantica and five species on P. palaestina induced more galls in plants growing along roads than away from roads, while in the rainy Northern region, the difference was not significant between the two habitats. These results indicate a biased evaluation of population size in roadside habitat, which has to be accounted in insect–plant relation researches. 相似文献

18.

Current Protocols in Human Genetics

下载免费PDF全文

Brian C. Schutte Jeffrey C. Murray 《American journal of human genetics》1995,57(3):735-736

相似文献

19.

Electromechanical Potentials in Cortical Bone (Phenomenological Approach)

Kh. Kh. Imomnazarov 《Doklady. Biochemistry and biophysics》2003,392(1-6):268-270

相似文献

20.

A Ranking Approach to Genomic Selection

Mathieu Blondel Akio Onogi Hiroyoshi Iwata Naonori Ueda 《PloS one》2015,10(6)

Background

Genomic selection (GS) is a recent selective breeding method which uses predictive models based on whole-genome molecular markers. Until now, existing studies formulated GS as the problem of modeling an individual’s breeding value for a particular trait of interest, i.e., as a regression problem. To assess predictive accuracy of the model, the Pearson correlation between observed and predicted trait values was used.

Contributions

In this paper, we propose to formulate GS as the problem of ranking individuals according to their breeding value. Our proposed framework allows us to employ machine learning methods for ranking which had previously not been considered in the GS literature. To assess ranking accuracy of a model, we introduce a new measure originating from the information retrieval literature called normalized discounted cumulative gain (NDCG). NDCG rewards more strongly models which assign a high rank to individuals with high breeding value. Therefore, NDCG reflects a prerequisite objective in selective breeding: accurate selection of individuals with high breeding value.

Results

We conducted a comparison of 10 existing regression methods and 3 new ranking methods on 6 datasets, consisting of 4 plant species and 25 traits. Our experimental results suggest that tree-based ensemble methods including McRank, Random Forests and Gradient Boosting Regression Trees achieve excellent ranking accuracy. RKHS regression and RankSVM also achieve good accuracy when used with an RBF kernel. Traditional regression methods such as Bayesian lasso, wBSR and BayesC were found less suitable for ranking. Pearson correlation was found to correlate poorly with NDCG. Our study suggests two important messages. First, ranking methods are a promising research direction in GS. Second, NDCG can be a useful evaluation measure for GS. 相似文献