首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The extent to which natural selection shapes diversity within populations is a key question for population genetics. Thus, there is considerable interest in quantifying the strength of selection. A full likelihood approach for inference about selection at a single site within an otherwise neutral fully linked sequence of sites is described here. A coalescent model of evolution is used to model the ancestry of a sample of DNA sequences which have the selected site segregating. The mutation model, for the selected and neutral sites, is the infinitely many-sites model where there is no back or parallel mutation at sites. A unique perfect phylogeny, a gene tree, can be constructed from the configuration of mutations on the sample sequences under this model of mutation. The approach is general and can be used for any bi-allelic selection scheme. Selection is incorporated through modelling the frequency of the selected and neutral allelic classes stochastically back in time, then using a subdivided population model considering the population frequencies through time as variable population sizes. An importance sampling algorithm is then used to explore over coalescent tree space consistent with the data. The method is applied to a simulated data set and the gene tree presented in Verrelli et al. (2002).  相似文献   

2.
An importance sampling algorithm for computing the likelihood of a sample of genes at loci under a stepwise mutation model in a subdivided population is developed. This allows maximum likelihood estimation of migration rates between subpopulations. The time to the most recent common ancestor of the sample can also be computed. The technique is illustrated by an analysis of a data set of Australian red fox populations.  相似文献   

3.
Numerous simulation studies have investigated the accuracy of phylogenetic inference of gene trees under maximum parsimony, maximum likelihood, and Bayesian techniques. The relative accuracy of species tree inference methods under simulation has received less study. The number of analytical techniques available for inferring species trees is increasing rapidly, and in this paper, we compare the performance of several species tree inference techniques at estimating recent species divergences using computer simulation. Simulating gene trees within species trees of different shapes and with varying tree lengths (T) and population sizes (), and evolving sequences on those gene trees, allows us to determine how phylogenetic accuracy changes in relation to different levels of deep coalescence and phylogenetic signal. When the probability of discordance between the gene trees and the species tree is high (i.e., T is small and/or is large), Bayesian species tree inference using the multispecies coalescent (BEST) outperforms other methods. The performance of all methods improves as the total length of the species tree is increased, which reflects the combined benefits of decreasing the probability of discordance between species trees and gene trees and gaining more accurate estimates for gene trees. Decreasing the probability of deep coalescences by reducing also leads to accuracy gains for most methods. Increasing the number of loci from 10 to 100 improves accuracy under difficult demographic scenarios (i.e., coalescent units ≤ 4N(e)), but 10 loci are adequate for estimating the correct species tree in cases where deep coalescence is limited or absent. In general, the correlation between the phylogenetic accuracy and the posterior probability values obtained from BEST is high, although posterior probabilities are overestimated when the prior distribution for is misspecified.  相似文献   

4.
We study with extensive numerical simulation the genealogical process of 2N haploid genetic sequences. The sequences are under selective pressure, and fitness values are assigned at random, but with a tunable degree of correlation to the fitness values of closely related sequences. The genealogies that we observe can be classified into three different categories, corresponding to different regimes of the mutation rate. At low mutation rates, the sequences remain localized around a small number of central sequences, which leads to trees with short pairwise distances and slow turnover of the most recent common ancestor of the population. At high mutation rates, we observe trees similar (but not identical) to those of neutral evolution. In this regime, the population drifts rapidly, and selection does not influence the distribution of fitness values in the population. The third regime, for intermediate mutation rates, is only found in strongly correlated landscapes. It resembles the one for high mutation rates in that the population drifts rapidly, but nevertheless selection still shapes the distribution of fitness values.  相似文献   

5.
Inferring Coalescence Times from DNA Sequence Data   总被引:17,自引:7,他引:10       下载免费PDF全文
The paper is concerned with methods for the estimation of the coalescence time (time since the most recent common ancestor) of a sample of intraspecies DNA sequences. The methods take advantage of prior knowledge of population demography, in addition to the molecular data. While some theoretical results are presented, a central focus is on computational methods. These methods are easy to implement, and, since explicit formulae tend to be either unavailable or unilluminating, they are also more useful and more informative in most applications. Extensions are presented that allow for the effects of uncertainty in our knowledge of population size and mutation rates, for variability in population sizes, for regions of different mutation rate, and for inference concerning the coalescence time of the entire population. The methods are illustrated using recent data from the human Y chromosome.  相似文献   

6.
F. Tajima 《Genetics》1989,123(1):229-240
Using the two subpopulation model, the expected numbers of segregating sites in a number of DNA sequences randomly sampled from a subdivided population were examined for several types of population subdivisions. It is shown that, in the case where the pattern of migration is symmetrical such as the finite island model, the expected number of segregating sites is independent of the migration rate when two or three DNA sequences are randomly sampled from the same subpopulation, but depends on the migration rate when more than three DNA sequences are sampled. It is also shown that the population subdivision can increase the amount of DNA polymorphism even in a subpopulation in some cases.  相似文献   

7.
A Model for Analysis of Population Structure   总被引:5,自引:3,他引:2       下载免费PDF全文
Arguments have been presented for the appropriateness of a multinomial Dirichlet distribution for describing single-locus genotypic frequencies in a subdivided population. This distribution is defined as a function of allele frequency, the average (over the entire population) inbreeding coefficient and the correlation between genotypes within a subdivision. Alternative parameterizations and their genetic interpretations are given.-We then show how information from a sample drawn from this subdivided population, in the absence of pedigrees, can be combined with the multinomial Dirichlet model to form a likelihood function. This likelihood function is then used as the basis for estimation and testing hypotheses concerning the genetic parameters of the model. Comparisons of this approach to the alternative procedure of Cockerham (1969) and (1973) are made using human data obtained from Tecumseh, Michigan and Monte Carlo simulations.-Finally, implications of these results to statistical inference and to mutation rates are presented.  相似文献   

8.
Blum MG  Rosenberg NA 《Genetics》2007,176(3):1741-1757
Estimating the number of ancestral lineages of a sample of DNA sequences at time t in the past can be viewed as a variation on the problem of estimating the time to the most recent common ancestor. To estimate the number of ancestral lineages, we develop a maximum-likelihood approach that takes advantage of a prior model of population demography, in addition to the molecular data summarized by the pattern of polymorphic sites. The method relies on a rejection sampling algorithm that is introduced for simulating conditional coalescent trees given a fixed number of ancestral lineages at time t. Computer simulations show that the number of ancestral lineages can be estimated accurately, provided that the number of mutations that occurred since time t is sufficiently large. The method is applied to 986 present-day human sequences located in hypervariable region 1 of the mitochondrion to estimate the number of ancestral lineages of modern humans at the time of potential admixture with the Neanderthal population. Our estimates support a view that the proportion of the modern population consisting of Neanderthal contributions must be relatively small, less than approximately 5%, if the admixture happened as recently as 30,000 years ago.  相似文献   

9.
Longitudinal samples of DNA sequences are the DNA sequences sampled from the same population at different time points. For fast evolving organisms, e.g. RNA virus, these kind of samples have increasingly been used to study the evolutionary process in action. Longitudinal samples provide some interesting new summary statistics of genetic variation, such as the frequency of mutation of size i in one sample and size j in another, the average number of mutations accumulated since the common ancestor of two sequences each from a different sample, and number of private, shared and fixed mutations within samples. To make the results more applicable, we used in this study a general two-sample model, which assumes two longitudinal samples were taken from the same measurably evolving population. Inspired by the HIV study, we also studied a two-sample-two-stage model, which is a special case of two-sample model and assumes a treatment after the first sampling instantaneously changes the population size. We derived the formulas for calculating statistical properties, e.g. expectations, variances and covariances, of these new summary statistics under the two models. Potential applications of these results were discussed.  相似文献   

10.
M. K. Kuhner  J. Yamato    J. Felsenstein 《Genetics》1995,140(4):1421-1430
We present a new way to make a maximum likelihood estimate of the parameter 4N(e)μ (effective population size times mutation rate per site, or θ) based on a population sample of molecular sequences. We use a Metropolis-Hastings Markov chain Monte Carlo method to sample genealogies in proportion to the product of their likelihood with respect to the data and their prior probability with respect to a coalescent distribution. A specific value of θ must be chosen to generate the coalescent distribution, but the resulting trees can be used to evaluate the likelihood at other values of θ, generating a likelihood curve. This procedure concentrates sampling on those genealogies that contribute most of the likelihood, allowing estimation of meaningful likelihood curves based on relatively small samples. The method can potentially be extended to cases involving varying population size, recombination, and migration.  相似文献   

11.
In a sample of DNA sequences where recombination can occur to the ancestors of the sample, distinct parts of the sequences may have different most recent common ancestors. This paper presents a Markov chain Monte Carlo algorithm for computing the expected time to the most recent common ancestor along the sequences, conditional on where the mutations occur on the sequences.  相似文献   

12.
13.
Y. X. Fu 《Genetics》1996,144(2):829-838
The number of segregating sites in a sample of DNA sequences and the age of the most recent common ancestor (MRCA) of the sequences in the sample are positively correlated. The value of the former can be used to estimate the value of the latter. Using the coalescent approach, we derive in this paper the joint probability distribution of the number of segregating sites and the age of the MRCA of a sample under the neutral Wright-Fisher model. From this distribution, we are able to compute the likelihood function of the number of segregating sites and the posterior probability of the age of the MRCA of a sample. Three point estimators and one interval estimator of the age of the MRCA are developed; their relationships and properties are investigated. The estimation of the age of the MRCA of human Y chromosomes from a sample of no variation is discussed.  相似文献   

14.
Inference about population history from DNA sequence data has become increasingly popular. For human populations, questions about whether a population has been expanding and when expansion began are often the focus of attention. For viral populations, questions about the epidemiological history of a virus, e.g., HIV-1 and Hepatitis C, are often of interest. In this paper I address the following question: Can population history be accurately inferred from single locus DNA data? An idealised world is considered in which the tree relating a sample of n non-recombining and selectively neutral DNA sequences is observed, rather than just the sequences themselves. This approach provides an upper limit to the information that possibly can be extracted from a sample. It is shown, based on Kingman's (1982a) coalescent process, that consistent estimation of parameters describing population history (e.g., a growth rate) cannot be achieved for increasing sample size, n. This is worse than often found for estimators of genetic parameters, e.g., the mutation rate typically converges at rate \(\) under the assumption that all historical mutations can be observed in the sample. In addition, various results for the distribution of maximum likelihood estimators are presented.  相似文献   

15.
J. Hey 《Genetics》1991,128(4):831-840
When two samples of DNA sequences are compared, one way in which they may differ is in the presence of fixed differences, which are defined as sites at which all of the sequences in one sample are different from all of the sequences in a second sample. The probability distribution of the number of fixed differences is developed. The theory employs Wright-Fisher genealogies and the infinite sites mutation model. For the case when both samples are drawn randomly from the same population it is found that genealogies permitting fixed differences are very unlikely. Thus the mere presence of fixed differences between samples is statistically significant, even for small samples. The theory is extended to samples from populations that have been separated for some time. The relationship between a simple Poisson distribution of mutations and the distribution of fixed differences is described as a function of the time since populations have been isolated. It is shown how these results may contribute to improved tests of recent balancing or directional selection.  相似文献   

16.
Kim Y  Maruki T 《Genetics》2011,189(1):213-226
A central problem in population genetics is to detect and analyze positive natural selection by which beneficial mutations are driven to fixation. The hitchhiking effect of a rapidly spreading beneficial mutation, which results in local removal of standing genetic variation, allows such an analysis using DNA sequence polymorphism. However, the current mathematical theory that predicts the pattern of genetic hitchhiking relies on the assumption that a beneficial mutation increases to a high frequency in a single random-mating population, which is certainly violated in reality. Individuals in natural populations are distributed over a geographic space. The spread of a beneficial allele can be delayed by limited migration of individuals over the space and its hitchhiking effect can also be affected. To study this effect of geographic structure on genetic hitchhiking, we analyze a simple model of directional selection in a subdivided population. In contrast to previous studies on hitchhiking in subdivided populations, we mainly investigate the range of sufficiently high migration rates that would homogenize genetic variation at neutral loci. We provide a heuristic mathematical analysis that describes how the genealogical structure at a neutral locus linked to the locus under selection is expected to change in a population divided into two demes. Our results indicate that the overall strength of genetic hitchhiking--the degree to which expected heterozygosity decreases--is diminished by population subdivision, mainly because opportunity for the breakdown of hitchhiking by recombination increases as the spread of the beneficial mutation across demes is delayed when migration rate is much smaller than the strength of selection. Furthermore, the amount of genetic variation after a selective sweep is expected to be unequal over demes: a greater reduction in expected heterozygosity occurs in the subpopulation from which the beneficial mutation originates than in its neighboring subpopulations. This raises a possibility of detecting a "hidden" geographic structure of population by carefully analyzing the pattern of a selective sweep.  相似文献   

17.
MOTIVATION: B cells responding to antigenic stimulation can fine-tune their binding properties through a process of affinity maturation composed of somatic hypermutation, affinity-selection and clonal expansion. The mutation rate of the B cell receptor DNA sequence, and the effect of these mutations on affinity and specificity, are of critical importance for understanding immune and autoimmune processes. Unbiased estimates of these properties are currently lacking due to the short time-scales involved and the small numbers of sequences available. RESULTS: We have developed a bioinformatic method based on a maximum likelihood analysis of phylogenetic lineage trees to estimate the parameters of a B cell clonal expansion model, which includes somatic hypermutation with the possibility of lethal mutations. Lineage trees are created from clonally related B cell receptor DNA sequences. Important links between tree shapes and underlying model parameters are identified using mutual information. Parameters are estimated using a likelihood function based on the joint distribution of several tree shapes, without requiring a priori knowledge of the number of generations in the clone (which is not available for rapidly dividing populations in vivo). A systematic validation on synthetic trees produced by a mutating birth-death process simulation shows that our estimates are precise and robust to several underlying assumptions. These methods are applied to experimental data from autoimmune mice to demonstrate the existence of hypermutating B cells in an unexpected location in the spleen.  相似文献   

18.
In a previous paper (Klotz et a1., 1979) we described a method for determining evolutionary trees from sequence data when rates of evolution of the sequences might differ greatly. It was shown theoretically that the method always gave the correct topology and root when the exact number of mutation differences between sequences and from their common ancestor was known. However, the method is impractical to use in most situations because it requires some knowledge of the ancestor. In this present paper we describe another method, related to the previous one, in which a present-day sequence can serve temporarily as an ancestor for purposes of determining the evolutionary tree regardless of the rates of evolution of the sequences involved. This new method can be carried out with high precision without the aid of a computer, and it does not increase in difficulty rapidly as the number of sequences involved in the study increases, unlike other methods.  相似文献   

19.
Ori Sargsyan 《Genetics》2010,185(4):1355-1368
The general coalescent tree framework is a family of models for determining ancestries among random samples of DNA sequences at a nonrecombining locus. The ancestral models included in this framework can be derived under various evolutionary scenarios. Here, a computationally tractable full-likelihood-based inference method for neutral polymorphisms is presented, using the general coalescent tree framework and the infinite-sites model for mutations in DNA sequences. First, an exact sampling scheme is developed to determine the topologies of conditional ancestral trees. However, this scheme has some computational limitations and to overcome these limitations a second scheme based on importance sampling is provided. Next, these schemes are combined with Monte Carlo integrations to estimate the likelihood of full polymorphism data, the ages of mutations in the sample, and the time of the most recent common ancestor. In addition, this article shows how to apply this method for estimating the likelihood of neutral polymorphism data in a sample of DNA sequences completely linked to a mutant allele of interest. This method is illustrated using the data in a sample of DNA sequences at the APOE gene locus.THE interest in analyzing polymorphism data in contemporary samples of DNA sequences under various evolutionary scenarios creates a demand to design computationally tractable full-likelihood-based inference methods. For an evolutionary scenario of interest, an ancestral-mutation model can be used to design such a method. The ancestral-mutation model for a sample of DNA sequences at a nonrecombining locus is a combination of two processes: one is an ancestral process that traces the lineages of the sample back in time until the most recent common ancestor, constructing an ancestral tree for the sample. The second is a mutation process that is superimposed on the ancestral tree. The complexities of ancestral-mutation models make the design of such methods challenging. Full data are used instead of summary statistics, which can result in loss of important information in the data (see Felsenstein 1992; Donnelly and Tavaré 1995). In addition, current methods use specific features of the underlying ancestral-mutation models, so they lose flexibility to be applicable to other ancestral-mutation models.More specifically, Griffiths and Tavaré (1994c, 1995) and Kuhner et al. (1995) developed full-likelihood-based inference methods for neutral polymorphisms at a nonrecombining locus. They used the combinations of the standard coalescent (Kingman 1982a,b,c; Hudson 1983; Tajima 1983) with the finite-sites or infinite-sites (Watterson 1975) models as ancestral-mutation models. Stephens and Donnelly (2000) designed an importance sampling method to estimate the full likelihood of the data using the same settings for the ancestral-mutation models. Hobolth et al. (2008) provided another importance sampling scheme restricted to the infinite-sites model. The last two methods are computationally more efficient than the first two methods, but they lose flexibility to be applicable to ancestral models without standard coalescent features with independent coalescence waiting times, such as the coalescent processes with exponential growth (Slatkin and Hudson 1991; Griffiths and Tavaré 1994b).To incorporate the coalescent processes with exponential growth, Kuhner et al. (1998) and Griffiths and Tavaré (1994a, 1999) extended their previous methods. For example, the method of Griffiths and Tavaré (1994a, 1999) allows one to consider ancestral models based on coalescent processes with variable population sizes. Coop and Griffiths (2004) modified this inference method and made it applicable for analyzing full polymorphism data in a sample of DNA sequences from a nonrecombining locus completely linked to a mutant allele of interest, either neutral or under selection. Additionally, ancestral models have been developed for this type of sample, where the mutant allele is either neutral (Griffiths and Tavaré 1998, 2003; Wiuf and Donnelly 1999; Stephens 2000) or under selection (Slatkin and Rannala 1997; Stephens and Donnelly 2003). The ancestral model of Slatkin and Rannala (1997) is part of a family of ancestral models derived by Thompson (1975), Nee et al. (1994), and Rannala (1997), using a linear birth–death process as an evolutionary process in a population. Although all the ancestral models mentioned above differ in their properties and evolutionary scenarios, they are part of the general coalescent tree framework (Griffiths and Tavaré 1998). Therefore, a computationally tractable full-likelihood-based inference method based on this general framework is of great interest.For a sample of n sequences, an ancestral model in the general coalescent tree framework is described as a bifurcating rooted tree with n − 1 internal nodes and n leaves, where the internal nodes are coalescent events that happen one at a time. The tree is a combination of two independent components: the topology and the branch lengths. The topology of the tree is constructed going backward in time by combining two randomly chosen ancestral lineages of the sample at each node; the branch lengths of the tree are defined by the joint distribution function of the coalescence waiting times. Note that any density function for coalescence waiting times can define an ancestral model in the general coalescent tree framework.The n leaves (and the sequences in the sample) are labeled from 1 to n; and the n − 1 internal nodes of the ancestral tree are labeled from 1 to n − 1 (in order of occurrence of the coalescent events backward in time). Thus, the topology of an ancestral tree is a leaf-labeled bifurcating rooted tree with totally ordered interior vertices. These trees are called topological trees.When using the general coalescent tree framework and the infinite-sites model, an evolutionary process that generates polymorphism data in a sample of DNA sequences can be described in the following way. An ancestral tree is constructed, as described above, and mutations are added independently on different branches of the ancestral tree as Poisson processes with equal rates, θ/2, in which θ is the mutation rate at the locus. Then, at the mutation events, the ancestral sequences of the sample are changed according to the infinite-sites model; that is, each mutation occurs at a site of an ancestral sequence at which no previous mutations occurred. Thus, these changes define polymorphism data.Naively, this probabilistic framework can be used to estimate the likelihood of the full observed data in a sample of n sequences. That is, data sets are simulated independently as described above and each simulated data set is compared to the observed data. The proportion of the simulated data sets that match the observed data is an estimate of the likelihood of the observed data. Although this approach provides an estimate for the likelihood of the observed data, this method is computationally infeasible, because the topologies of the ancestral trees of the generated data sets are sampled from the space of all the possible topological trees with n leaves. This space has size n!(n − 1)!/2n−1 (Edwards 1970), which is huge for moderate values of n. The topologies of the ancestral trees of the generated data sets that match the observed data represent a small portion of that space. Thus, designing a method that samples topologies of the ancestral trees from this subspace can make the method computationally tractable.On the basis of this idea, I use the general coalescent tree framework with the infinite-sites model to develop a computationally tractable full-likelihood-based inference method for polymorphisms in DNA sequences at a nonrecombining locus. First, an exact sampling scheme for topologies of the conditional ancestral trees is developed. This method has some computational limitations, so to overcome these limitations a second scheme based on an importance sampling is provided. These sampling schemes are combined with Monte Carlo integrations to estimate the likelihood of the full data, the ages of the mutations in the sample, and the time of the most recent common ancestor of the sample. I describe an application of this method for neutral polymorphism data in a sample of DNA sequences at a nonrecombining locus that is completely linked to a mutant allele of interest, either neutral or under selection. The method is illustrated using the data in a sample of DNA sequences at the APOE gene locus from Fullerton et al. (2000).  相似文献   

20.
trees sifter 1.0 implements an approximate method to estimate the time to the most recent common ancestor (TMRCA) of a set of DNA sequences, using population evolution modelling. In essence, the program simulates genealogies with a user‐defined model of coalescence of lineages, and then compares each simulated genealogy to the genealogy inferred from the real data, through two summary statistics: (i) the number of mutations on the genealogy (Mn), and (ii) the number of different sequence types (alleles) observed (Kn). The simulated genealogies are then submitted to a rejection algorithm that keeps only those that are the most likely to have generated the observed sequence data. At the end of the process, the accepted genealogies can be used to estimate the posterior probability distribution of the TMRCA.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号