首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 734 毫秒
1.
Some of the assumptions underlying estimates of DNA and protein sequence divergence are examined. A solution for the variance of these estimates that allows for different mutation rates and different population sizes in each species and for an arbitrary structure in the initial population is obtained. It is shown that these conditions do not strongly affect estimates of divergence. In general, they cause the variance of divergence to be smaller than a binomial variance. Thus, the binomial variance that is usually assumed for these estimates is safely conservative. It is shown that variability in the mutation rate among sites can have an effect as large as or larger than variability in the mutation rate among bases. Variability in the mutation rate among bases and among sites causes the number of substitutions between two sequences to be underestimated. Protein and DNA sequences from several species are collected to estimate the variability in mutation rates among sites. When many homologous sequences are known, standard methods to estimate this variability can be used. The estimates of this variability show that this factor is important when considering the spectrum of spontaneous mutations and is strongly reflected in the divergence of sequences. Smaller variability is found for the third position of codons than for the first and second codon positions. This may be because of less selective constraints on this position or because the third position has been saturated with mutations for the sequences examined.   相似文献   

2.
The ratio of singletons to the total number of segregating sites is used to estimate a reproduction parameter in a population model of large offspring numbers without having to jointly estimate the mutation rate. For neutral genetic variation, the ratio of singletons to the total number of segregating sites is equivalent to the ratio of total length of external branches to the total length of the gene genealogy. A multinomial maximum likelihood method that takes into account more frequency classes than just the singletons is developed to estimate the parameter of another large offspring number model. The performance of these methods with regard to sample size, mutation rate, and bias, is investigated by simulation. The expected value of the ratio of the total length of external branches to the total length of the whole tree is, using simulation, shown to decrease for the Kingman coalescent as sample size increases, but can increase or decrease, depending on parameter values, for Λ coalescents. Considering ratios of tree statistics, as opposed to considering lengths of various subtrees separately, can yield better insight into the dynamics of gene genealogies.  相似文献   

3.
We would like to use maximum likelihood to estimate parameters such as the effective population size N(e) or, if we do not know mutation rates, the product 4N(e) mu of mutation rate per site and effective population size. To compute the likelihood for a sample of unrecombined nucleotide sequences taken from a random-mating population it is necessary to sum over all genealogies that could have led to the sequences, computing for each one the probability that it would have yielded the sequences, and weighting each one by its prior probability. The genealogies vary in tree topology and in branch lengths. Although the likelihood and the prior are straightforward to compute, the summation over all genealogies seems at first sight hopelessly difficult. This paper reports that it is possible to carry out a Monte Carlo integration to evaluate the likelihoods approximately. The method uses bootstrap sampling of sites to create data sets for each of which a maximum likelihood tree is estimated. The resulting trees are assumed to be sampled from a distribution whose height is proportional to the likelihood surface for the full data. That it will be so is dependent on a theorem which is not proven, but seems likely to be true if the sequences are not short. One can use the resulting estimated likelihood curve to make a maximum likelihood estimate of the parameter of interest, N(e) or of 4N(e) mu. The method requires at least 100 times the computational effort required for estimation of a phylogeny by maximum likelihood, but is practical on today's work stations. The method does not at present have any way of dealing with recombination.  相似文献   

4.
Maximum Likelihood Estimation of Population Parameters   总被引:10,自引:5,他引:5       下载免费PDF全文
Y. X. Fu  W. H. Li 《Genetics》1993,134(4):1261-1270
One of the most important parameters in population genetics is θ = 4N(e)μ where N(e) is the effective population size and μ is the rate of mutation per gene per generation. We study two related problems, using the maximum likelihood method and the theory of coalescence. One problem is the potential improvement of accuracy in estimating the parameter θ over existing methods and the other is the estimation of parameter λ which is the ratio of two θ's. The minimum variances of estimates of the parameter θ are derived under two idealized situations. These minimum variances serve as the lower bounds of the variances of all possible estimates of θ in practice. We then show that Watterson's estimate of θ based on the number of segregating sites is asymptotically an optimal estimate of θ. However, for a finite sample of sequences, substantial improvement over Watterson's estimate is possible when θ is large. The maximum likelihood estimate of λ = θ(1)/θ(2) is obtained and the properties of the estimate are discussed.  相似文献   

5.
The extent to which natural selection shapes diversity within populations is a key question for population genetics. Thus, there is considerable interest in quantifying the strength of selection. A full likelihood approach for inference about selection at a single site within an otherwise neutral fully linked sequence of sites is described here. A coalescent model of evolution is used to model the ancestry of a sample of DNA sequences which have the selected site segregating. The mutation model, for the selected and neutral sites, is the infinitely many-sites model where there is no back or parallel mutation at sites. A unique perfect phylogeny, a gene tree, can be constructed from the configuration of mutations on the sample sequences under this model of mutation. The approach is general and can be used for any bi-allelic selection scheme. Selection is incorporated through modelling the frequency of the selected and neutral allelic classes stochastically back in time, then using a subdivided population model considering the population frequencies through time as variable population sizes. An importance sampling algorithm is then used to explore over coalescent tree space consistent with the data. The method is applied to a simulated data set and the gene tree presented in Verrelli et al. (2002).  相似文献   

6.
Rannala B  Yang Z 《Genetics》2003,164(4):1645-1656
The effective population sizes of ancestral as well as modern species are important parameters in models of population genetics and human evolution. The commonly used method for estimating ancestral population sizes, based on counting mismatches between the species tree and the inferred gene trees, is highly biased as it ignores uncertainties in gene tree reconstruction. In this article, we develop a Bayes method for simultaneous estimation of the species divergence times and current and ancestral population sizes. The method uses DNA sequence data from multiple loci and extracts information about conflicts among gene tree topologies and coalescent times to estimate ancestral population sizes. The topology of the species tree is assumed known. A Markov chain Monte Carlo algorithm is implemented to integrate over uncertain gene trees and branch lengths (or coalescence times) at each locus as well as species divergence times. The method can handle any species tree and allows different numbers of sequences at different loci. We apply the method to published noncoding DNA sequences from the human and the great apes. There are strong correlations between posterior estimates of speciation times and ancestral population sizes. With the use of an informative prior for the human-chimpanzee divergence date, the population size of the common ancestor of the two species is estimated to be approximately 20,000, with a 95% credibility interval (8000, 40,000). Our estimates, however, are affected by model assumptions as well as data quality. We suggest that reliable estimates have yet to await more data and more realistic models.  相似文献   

7.
A maximum likelihood framework for estimating site-specific substitution rates is presented that does not require any prior assumptions about the rate distribution. We show that, when the branching pattern of the underlying tree is known, the analysis of pairs of positions is sufficient to estimate site-specific rates. In the abscense of a known topology, we introduce an iterative procedure to estimate simultaneously the branching pattern, the branch lengths, and site-specific substitution rates. Simulations show that the evolutionary rate of fast-evolving sites can be reliably inferred and that the accuracy of rate estimates depends mainly on the number of sequences in the data set. Thus, large sets of aligned sequences are necessary for reliable site-specific rate estimates. The method is applied to the complete mitochondrial DNA sequence of 53 humans, providing a complete picture of the site-specific substitution rates in human mitochondrial DNA.  相似文献   

8.
The frequency distribution of pairwise differences between sequences of mtDNA has recently been used to estimate the size of human populations before and after a hypothetical episode of rapid population growth and the time at which the population grew. To test the internal consistency of this method, we used three different sets of human mtDNA data and the corresponding demographic parameters estimated from the distribution of pairwise differences to determine by simulation the expected number of segregating sites, S, and its empirical distribution. The results indicate that the observed values of S are significantly lower than expected in two of three cases under the assumption of the infinite-sites model. Further simulations in which mutations were allowed to occur more than once at the same site and in which there was variation in mutation rate among sites show that the expected number of segregating sites can be much lower than under the infinite-site assumption. Nevertheless, the observed value of S is still significantly different from the value expected under the expansion hypothesis in two of three cases.   相似文献   

9.
We show that the number of segregating sites is a sufficient statistic for the scaled mutation parameter (θ) in the limit as the number of sites tends to infinity and there is free recombination between sites. We assume that the mutation parameter at each site tends to zero such than the total mutation parameter (θ) is constant in the limit. Our results show that Watterson’s estimator is the maximum likelihood estimator in this case, but that it estimates a composite parameter which is different for different mutation models. Some of our results hold when recombination is limited, because Watterson’s estimator is an unbiased, method-of-moments estimator regardless of the recombination rate. The quantity it estimates depends on the details of how mutations occur at each site.  相似文献   

10.
Statistical Properties of a DNA Sample under the Finite-Sites Model   总被引:1,自引:0,他引:1       下载免费PDF全文
Z. Yang 《Genetics》1996,144(4):1941-1950
Statistical properties of a DNA sample from a random-mating population of constant size are studied under the finite-sites model. It is assumed that there is no migration and no recombination occurs within the locus. A Markov process model is used for nucleotide substitution, allowing for multiple substitutions at a single site. The evolutionary rates among sites are treated as either constant or variable. The general likelihood calculation using numerical integration involves intensive computation and is feasible for three or four sequences only; it may be used for validating approximate algorithms. Methods are developed to approximate the probability distribution of the number of segregating sites in a random sample of n sequences, with either constant or variable substitution rates across sites. Calculations using parameter estimates obtained for human D-loop mitochondrial DNAs show that among-site rate variation has a major effect on the distribution of the number of segregating sites; the distribution under the finite-sites model with variable rates among sites is quite different from that under the infinite-sites model.  相似文献   

11.
Single nucleotide polymorphism (SNP) data can be used for parameter estimation via maximum likelihood methods as long as the way in which the SNPs were determined is known, so that an appropriate likelihood formula can be constructed. We present such likelihoods for several sampling methods. As a test of these approaches, we consider use of SNPs to estimate the parameter Theta = 4N(e)micro (the scaled product of effective population size and per-site mutation rate), which is related to the branch lengths of the reconstructed genealogy. With infinite amounts of data, ML models using SNP data are expected to produce consistent estimates of Theta. With finite amounts of data the estimates are accurate when Theta is high, but tend to be biased upward when Theta is low. If recombination is present and not allowed for in the analysis, the results are additionally biased upward, but this effect can be removed by incorporating recombination into the analysis. SNPs defined as sites that are polymorphic in the actual sample under consideration (sample SNPs) are somewhat more accurate for estimation of Theta than SNPs defined by their polymorphism in a panel chosen from the same population (panel SNPs). Misrepresenting panel SNPs as sample SNPs leads to large errors in the maximum likelihood estimate of Theta. Researchers collecting SNPs should collect and preserve information about the method of ascertainment so that the data can be accurately analyzed.  相似文献   

12.
Mitochondrial D-loop hypervariable region I (HVI) sequences are widely used in human molecular evolutionary studies, and therefore accurate assessment of rate heterogeneity among sites is essential. We used the maximum-likelihood method to estimate the gamma shape parameter alpha for variable substitution rates among sites for HVI from humans and chimpanzees to provide estimates for future studies. The complete data of 839 humans and 224 chimpanzees, as well as many subsets of these data, were analyzed to examine the effect of sequence sampling. The effects of the genealogical tree and the nucleotide substitution model were also examined. The transition/transversion rate ratio (kappa) is estimated to be about 25, although much larger and biased estimates were also obtained from small data sets at low divergences. Estimates of alpha were 0.28-0.39 for human data sets of different sizes and 0.20-0.39 for data sets including different chimpanzee subspecies. The combined data set of both species gave estimates of 0.42-0.45. While all those estimates suggest highly variable substitution rates among sites, smaller samples tend to give smaller estimates of alpha. Possible causes for this pattern were examined, such as biases in the estimation procedure and shifts in the rate distribution along certain lineages. Computer simulations suggest that the estimation procedure is quite reliable for large trees but can be biased for small samples at low divergences. Thus, an alpha of 0.4 appears suitable for both humans and chimpanzees. Estimates of alpha can be affected by the nucleotide sites included in the data, the overall tree length (the amount of sequence divergence), the number of rate classes used for the estimation, and to a lesser extent, the included sequences. The genealogical tree, the substitution model, and demographic processes such as population expansion do not have much effect.  相似文献   

13.
A popular approach to detecting positive selection is to estimate the parameters of a probabilistic model of codon evolution and perform inference based on its maximum likelihood parameter values. This approach has been evaluated intensively in a number of simulation studies and found to be robust when the available data set is large. However, uncertainties in the estimated parameter values can lead to errors in the inference, especially when the data set is small or there is insufficient divergence between the sequences. We introduce a Bayesian model comparison approach to infer whether the sequence as a whole contains sites at which the rate of nonsynonymous substitution is greater than the rate of synonymous substitution. We incorporated this probabilistic model comparison into a Bayesian approach to site-specific inference of positive selection. Using simulated sequences, we compared this approach to the commonly used empirical Bayes approach and investigated the effect of tree length on the performance of both methods. We found that the Bayesian approach outperforms the empirical Bayes method when the amount of sequence divergence is small and is less prone to false-positive inference when the sequences are saturated, while the results are indistinguishable for intermediate levels of sequence divergence.  相似文献   

14.
The longitudinal spread of temperate organisms into refugial populations in Southern Europe is generally assumed to predate the last interglacial. However, few studies have attempted to quantify this process in nonmodel organisms using explicit models and multilocus data. We used sequence data for 20 intron‐spanning loci (12 kb per individual) to resolve the history of refugial populations of a widespread western Palaearctic oak gall parasitoid Cecidostiba fungosa (Pteromalidae). Using maximum likelihood and Bayesian methods we assess alternative population tree topologies and estimate divergence times and ancestral population sizes under a model of divergence between three refugia (Middle East, Balkans and Iberia). Both methods support an “Out of the East” history for C. fungosa, matching the pattern previously inferred for their gallwasp hosts. However, coalescent‐based estimates of the ages of population divides are much more recent (coinciding with the Eemian interglacial) than nodal ages of single gene trees for C. fungosa and other species. We also find that increasing the sample size from one haploid sequence per refugial population to three only marginally improves parameter estimates. Our results suggest that there is significant information in the minimal samples currently analyzable with maximum likelihood methods, and that similar methods could be applied to multiple species to test alternative models of assemblage evolution.  相似文献   

15.
P. D. Keightley 《Genetics》1994,138(4):1315-1322
Parameters of continuous distributions of effects and rates of spontaneous mutation for relative viability in Drosophila are estimated by maximum likelihood from data of two published experiments on accumulation of mutations on protected second chromosomes. A model of equal mutant effects gives a poor fit to the data of the two experiments; higher likelihoods are obtained with leptokurtic distributions or for models in which there is more than one class of mutation effect. Minimum estimates of mutation rates (events per generation) at polygenes affecting viability on chromosome 2 are 0.14 and 0.068, but estimates are strongly confounded with other parameters in the model. Separate information on rates of molecular divergence between Drosophila species and from rates of movement of transposable elements is used to infer the overall genomic mutation rate in Drosophila, and the viability data are analyzed with mutation rate as a known parameter. If, for example, a mutation rate for chromosome 2 of 0.4 is assumed, maximum likelihood estimates of mean mutant effect on relative viability are 0.4% and 1%, but the majority of mutations have very much smaller effects than these values as distributions are highly leptokurtic. The methodology is applied to estimate viability effects of single P element insertional mutations. The mean effect per insertion is found to be higher, and their distribution is found to be less leptokurtic than for spontaneous mutations. The equilibrium genetic variance of viability predicted by a mutation-selection balance model with parameters estimated from the mutation accumulation experiments is similar to laboratory estimates of genetic variance of viability from natural populations of Drosophila.  相似文献   

16.
Mitochondrial DNA (mtDNA) sequences that include (a) a part of the cytochrome b gene, (b) two tRNA genes, and (c) a part of the noncoding D-loop region of 31 Anguilla japonica (Japanese eel) and 1 A. marmorata collected from Taiwan, Japan, and mainland China were determined to evaluate the population structure of Japanese eel. Among 30 genotypes identified from the 31 Japanese eel mtDNAs sequenced, there are 58 variable sites, predominantly clustered at the D-loop region. The phylogenetic tree constructed by the unweighted pair-group method with arithmetic mean shows neither significant genealogical branches nor geographic clusters. Furthermore, the sequence-statistics test reveals little, if any, significant genetic differentiation. These results indicate that the 31 Japanese eels might come from a single population. Analysis of sequence variation in mtDNA by using the relationship between the number of segregating sites and the average number of nucleotide differences under the neutral mutation hypothesis reveals that neutral mutation acts as a major factor influencing the evolutionary divergence of the Japanese eel mitochondrial genome sequenced, especially in the noncoding region.   相似文献   

17.
Wang J 《Genetics》2006,173(3):1679-1692
A variety of estimators have been developed to use genetic marker information in inferring the admixture proportions (parental contributions) of a hybrid population. The majority of these estimators used allele frequency data, ignored molecular information that is available in markers such as microsatellites and DNA sequences, and assumed that mutations are absent since the admixture event. As a result, these estimators may fail to deliver an estimate or give rather poor estimates when admixture is ancient and thus mutations are not negligible. A previous molecular estimator based its inference of admixture proportions on the average coalescent times between pairs of genes taken from within and between populations. In this article I propose an estimator that considers the entire genealogy of all of the sampled genes and infers admixture proportions from the numbers of segregating sites in DNA sequence samples. By considering the genealogy of all sequences rather than pairs of sequences, this new estimator also allows the joint estimation of other interesting parameters in the admixture model, such as admixture time, divergence time, population size, and mutation rate. Comparative analyses of simulated data indicate that the new coalescent estimator generally yields better estimates of admixture proportions than the previous molecular estimator, especially when the parental populations are not highly differentiated. It also gives reasonably accurate estimates of other admixture parameters. A human mtDNA sequence data set was analyzed to demonstrate the method, and the analysis results are discussed and compared with those from previous studies.  相似文献   

18.
Several maximum likelihood and distance matrix methods for estimating phylogenetic trees from homologous DNA sequences were compared when substitution rates at sites were assumed to follow a gamma distribution. Computer simulations were performed to estimate the probabilities that various tree estimation methods recover the true tree topology. The case of four species was considered, and a few combinations of parameters were examined. Attention was applied to discriminating among different sources of error in tree reconstruction, i.e., the inconsistency of the tree estimation method, the sampling error in the estimated tree due to limited sequence length, and the sampling error in the estimated probability due to the number of simulations being limited. Compared to the least squares method based on pairwise distance estimates, the joint likelihood analysis is found to be more robust when rate variation over sites is present but ignored and an assumption is thus violated. With limited data, the likelihood method has a much higher probability of recovering the true tree and is therefore more efficient than the least squares method. The concept of statistical consistency of a tree estimation method and its implications were explored, and it is suggested that, while the efficiency (or sampling error) of a tree estimation method is a very important property, statistical consistency of the method over a wide range of, if not all, parameter values is prerequisite.  相似文献   

19.
A statistical method of estimating population splitting times is developed in this paper. We consider three populations, with an assumed known tree topology for their phylogenetic tree. From simulation studies, we find that the method of moments performs very well, in particular for both large number of loci and divergence times, in estimating genetic divergence times. The bias decreases as the number of loci increases. The maximum likelihood method proves to be a good method for constructing an unknown phylogenetic tree, in particular for large divergence times.  相似文献   

20.
We propose a model based approach to use multiple gene trees to estimate the species tree. The coalescent process requires that gene divergences occur earlier than species divergences when there is any polymorphism in the ancestral species. Under this scenario, speciation times are restricted to be smaller than the corresponding gene split times. The maximum tree (MT) is the tree with the largest possible speciation times in the space of species trees restricted by available gene trees. If all populations have the same population size, the MT is the maximum likelihood estimate of the species tree. It can be shown the MT is a consistent estimator of the species tree even when the MT is built upon the estimates of the true gene trees if the gene tree estimates are statistically consistent. The MT converges in probability to the true species tree at an exponential rate.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号