首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
A series of new results useful to the study of DNA sequences using Markov models of substitution are presented with proofs. General time-reversible distances can be extended to accommodate any fixed distribution of rates across sites by replacing the logarithmic function of a matrix with the inverse of a moment generating function. Estimators are presented assuming a gamma distribution, the inverse Gaussian distribution, or a mixture of either of these with invariant sites. Also considered are the different ways invariant sites may be removed and how these differences may affect estimated distances. Through collaboration, we implemented these distances into PAUP* in 1994. The variance of these new distances is approximated via the delta method. It is also shown how to predict the divergence expected for a pair of sequences given a rate matrix and a distribution of rates across sites, allowing iterated ML estimates of distances under any reversible model. A simple test of whether a rate matrix is time reversible is also presented. These new methods are used to estimate the divergence time of humans and chimps from mtDNA sequence data. These analyses support suggestions that the human lineage has an enhanced transition rate relative to other hominoids. These studies also show that transversion distances differ substantially from the overall distances which are dominated by transitions. Transversions alone apparently suggest a very recent divergence time for humans versus chimps and/or a very old (>16 myr) divergence time for humans versus organgutans. This work illustrates graphically ways to interpret the reliability of distance-based transformations, using the corrected transition to transversion ratio returned for pairs of sequences which are successively more diverged.  相似文献   

2.
Phylogenetic methods that use matrices of pairwise distances between sequences (e.g., neighbor joining) will only give accurate results when the initial estimates of the pairwise distances are accurate. For many different models of sequence evolution, analytical formulae are known that give estimates of the distance between two sequences as a function of the observed numbers of substitutions of various classes. These are often of a form that we call "log transform formulae". Errors in these distance estimates become larger as the time t since divergence of the two sequences increases. For long times, the log transform formulae can sometimes give divergent distance estimates when applied to finite sequences. We show that these errors become significant when t approximately 1/2 |lambda(max)|(-1) logN, where lambda(max) is the eigenvalue of the substitution rate matrix with the largest absolute value and N is the sequence length. Various likelihood-based methods have been proposed to estimate the values of parameters in rate matrices. If rate matrix parameters are known with reasonable accuracy, it is possible to use the maximum likelihood method to estimate evolutionary distances while keeping the rate parameters fixed. We show that errors in distances estimated in this way only become significant when t approximately 1/2 |lambda(1)|(-1) logN, where lambda(1) is the eigenvalue of the substitution rate matrix with the smallest nonzero absolute value. The accuracy of likelihood-based distance estimates is therefore much higher than those based on log transform formulae, particularly in cases where there is a large range of timescales involved in the rate matrix (e.g., when the ratio of transition to transversion rates is large). We discuss several practical ways of estimating the rate matrix parameters before distance calculation and hence of increasing the accuracy of distance estimates.  相似文献   

3.
Various estimates of the time at which the human mitochondrial Eve lived have ranged from as little as 60,000 yr to more than 500,000 yr ago. Because of this immense range, it is impossible to distinguish between single-origin and multiple-origins hypotheses for the evolution of our species. In an attempt to reduce the uncertainty, I have examined the largest available body of sequence information, comprising the mitochondrial control region, for clues to how the observed diversity arose. In this region it is possible to show, by examining the distribution of polymorphic sites, that transitions have occurred at some sites at a much higher rate than at others. Computer simulations can, when two rates for transitions are postulated, provide close approximations to the distribution of substitutions seen in the actual data. The “best fit” was obtained when the rate at ¾ of the sites was 4 times the transversion rate, and the rate at the remainder 160 times the transversion rate. The likelihood of such a high rate at some sites helps to explain why tree-building methods employing these data have provided so little phylogenetic information. Furthermore, it is possible to show that transversions do not appear to occur preferentially at these transition “hot-spot” sites and that such huge differences in substitution rates are not seen for transversions, suggesting that the rules governing the mutation and acceptance or rejection of transversions are different from those governing transitions. The great majority of transversions appear to occur at a low rate throughout the region. Thus, methods for determining the age of Eve that are based on recent divergence in human populations, or on applying a mutation probability matrix based on an assumption of uniform mutation rates, are likely to result in underestimates. The rate of accumulation of transversions is shown to be a more accurate estimator of the age of Eve. The conclusion is reached that Eve probably lived (depending on when the ancestors of humans and chimpanzees diverged) between 436,000 and 806,000 yr ago.  相似文献   

4.
Invariant sites are a common feature of amino acid sequence evolution. The presence of invariant sites is frequently attributed to the need to preserve function through site-specific conservation of amino acid residues. Amino acid substitution models without a provision for invariant sites often fit the data significantly worse than those that allow for an excess of invariant sites beyond those predicted by models that only incorporate rate variation among sites (e.g., a Gamma distribution). An alternative is epistasis between sites to preserve residue interactions that can create invariant sites. Through computer-simulated sequence evolution, we evaluated the relative effects of site-specific preferences and site-site couplings in the generation of invariant sites and the modulation of the rate of molecular evolution. In an analysis of ten major families of protein domains with diverse sequence and functional properties, we find that the negative selection imposed by epistasis creates many more invariant sites than site-specific residue preferences alone. Further, epistasis plays an increasingly larger role in creating invariant sites over longer evolutionary periods. Epistasis also dictates rates of domain evolution over time by exerting significant additional purifying selection to preserve site couplings. These patterns illuminate the mechanistic role of epistasis in the processes underlying observed site invariance and evolutionary rates.  相似文献   

5.
Mitochondrial D-loop hypervariable region I (HVI) sequences are widely used in human molecular evolutionary studies, and therefore accurate assessment of rate heterogeneity among sites is essential. We used the maximum-likelihood method to estimate the gamma shape parameter alpha for variable substitution rates among sites for HVI from humans and chimpanzees to provide estimates for future studies. The complete data of 839 humans and 224 chimpanzees, as well as many subsets of these data, were analyzed to examine the effect of sequence sampling. The effects of the genealogical tree and the nucleotide substitution model were also examined. The transition/transversion rate ratio (kappa) is estimated to be about 25, although much larger and biased estimates were also obtained from small data sets at low divergences. Estimates of alpha were 0.28-0.39 for human data sets of different sizes and 0.20-0.39 for data sets including different chimpanzee subspecies. The combined data set of both species gave estimates of 0.42-0.45. While all those estimates suggest highly variable substitution rates among sites, smaller samples tend to give smaller estimates of alpha. Possible causes for this pattern were examined, such as biases in the estimation procedure and shifts in the rate distribution along certain lineages. Computer simulations suggest that the estimation procedure is quite reliable for large trees but can be biased for small samples at low divergences. Thus, an alpha of 0.4 appears suitable for both humans and chimpanzees. Estimates of alpha can be affected by the nucleotide sites included in the data, the overall tree length (the amount of sequence divergence), the number of rate classes used for the estimation, and to a lesser extent, the included sequences. The genealogical tree, the substitution model, and demographic processes such as population expansion do not have much effect.  相似文献   

6.
This paper presents a maximum likelihood approach to estimating the variation of substitution rate among nucleotide sites. We assume that the rate varies among sites according to an invariant+gamma distribution, which has two parameters: the gamma parameter alpha and the proportion of invariable sites theta. Theoretical treatments on three, four, and five sequences have been conducted, and computer program have been developed. It is shown that rho = (1 + theta alpha)/(1 + alpha) is a good measure for the rate heterogeneity among sites. Extensive simulations show that (1) if the proportion of invariable sites is negligible, i.e., theta = 0, the gamma parameter alpha can be satisfactorily estimated, even with three sequences; (2) if the proportion of invariable sites is not negligible, the heterogeneity rho can still be suitably estimated with four or more sequences; and (3) the distances estimated by the proposed method are almost unbiased and are robust against violation of the assumption of the invariant + gamma distribution.   相似文献   

7.
A codon-based model of nucleotide substitution for protein-coding DNA sequences   总被引:34,自引:23,他引:11  
A codon-based model for the evolution of protein-coding DNA sequences is presented for use in phylogenetic estimation. A Markov process is used to describe substitutions between codons. Transition/transversion rate bias and codon usage bias are allowed in the model, and selective restraints at the protein level are accommodated using physicochemical distances between the amino acids coded for by the codons. Analyses of two data sets suggest that the new codon-based model can provide a better fit to data than can nucleotide-based models and can produce more reliable estimates of certain biologically important measures such as the transition/transversion rate ratio and the synonymous/nonsynonymous substitution rate ratio.   相似文献   

8.
Much effort and interest have focused on assessing the importance of natural selection, particularly positive natural selection, in shaping the human genome. Although scans for positive selection have identified candidate loci that may be associated with positive selection in humans, such scans do not indicate whether adaptation is frequent in general in humans. Studies based on the reasoning of the MacDonald–Kreitman test, which, in principle, can be used to evaluate the extent of positive selection, suggested that adaptation is detectable in the human genome but that it is less common than in Drosophila or Escherichia coli. Both positive and purifying natural selection at functional sites should affect levels and patterns of polymorphism at linked nonfunctional sites. Here, we search for these effects by analyzing patterns of neutral polymorphism in humans in relation to the rates of recombination, functional density, and functional divergence with chimpanzees. We find that the levels of neutral polymorphism are lower in the regions of lower recombination and in the regions of higher functional density or divergence. These correlations persist after controlling for the variation in GC content, density of simple repeats, selective constraint, mutation rate, and depth of sequencing coverage. We argue that these results are most plausibly explained by the effects of natural selection at functional sites—either recurrent selective sweeps or background selection—on the levels of linked neutral polymorphism. Natural selection at both coding and regulatory sites appears to affect linked neutral polymorphism, reducing neutral polymorphism by 6% genome-wide and by 11% in the gene-rich half of the human genome. These findings suggest that the effects of natural selection at linked sites cannot be ignored in the study of neutral human polymorphism.  相似文献   

9.
Mitochondrial 16S ( approximately 550 bp) and cytochrome oxidase I (COI) ( approximately 700 bp) sequences were utilized as markers to reconstruct a phylogeography for representative populations or biotypes of Bemisia tabaci. 16S sequences exhibited less divergence than COI sequences. Of the 429 characters examined for COI sequences, 185 sites were invariant, 244 were variable and 108 were informative. COI sequence identities yielded distances ranging from less than 1% to greater than 17%. Whitefly 16S sequences of 456 characters were analysed which consisted of 298 invariant sites, 158 variable sites and 53 informative sites. Phylogenetic analyses conducted by maximum parsimony, maximum-likelihood and neighbour-joining methods yielded almost identical phylogenetic reconstructions of trees that separated whiteflies based on geographical origin. The 16S and COI sequence data indicate that the B-biotype originated in the Old World (Europe, Asia and Africa) and is most closely related to B-like variants from Israel and Yemen, with the next closest relative being a biotype from Sudan. These data confirm the biochemical, genetic and behavioural polymorphisms described previously for B. tabaci. The consideration of all global variants of B. tabaci as a highly cryptic group of sibling species is argued.  相似文献   

10.
We propose and study a new approach for the analysis of families of protein sequences. This method is related to the LogDet distances used in phylogenetic reconstructions; it can be viewed as an attempt to embed these distances into a multidimensional framework. The proposed method starts by associating a Markov matrix to each pairwise alignment deduced from a given multiple alignment. The central objects under consideration here are matrix-valued logarithms L of these Markov matrices, which exist under conditions that are compatible with fairly large divergence between the sequences. These logarithms allow us to compare data from a family of aligned proteins with simple models (in particular, continuous reversible Markov models) and to test the adequacy of such models. If one neglects fluctuations arising from the finite length of sequences, any continuous reversible Markov model with a single rate matrix Q over an arbitrary tree predicts that all the observed matrices L are multiples of Q. Our method exploits this fact, without relying on any tree estimation. We test this prediction on a family of proteins encoded by the mitochondrial genome of 26 multicellular animals, which include vertebrates, arthropods, echinoderms, molluscs, and nematodes. A principal component analysis of the observed matrices L shows that a single rate model can be used as a rough approximation to the data, but that systematic deviations from any such model are unmistakable and related to the evolutionary history of the species under consideration.  相似文献   

11.
As methods of molecular phylogeny have become more explicit and more biologically realistic following the pioneering work of Thomas Jukes, they have had to relax their initial assumption that rates of evolution were equal at all sites. Distance matrix and likelihood methods of inferring phylogenies make this assumption; parsimony, when valid, is less limited by it. Nucleotide sequences, including RNA sequences, can show substantial rate variation; protein sequences show rates that vary much more widely. Assuming a prior distribution of rates such as a gamma distribution or lognormal distribution has deservedly been popular, but for likelihood methods it leads to computational difficulties. These can be resolved using hidden Markov model (HMM) methods which approximate the distribution by one with a modest number of discrete rates. Generalized Laguerre quadrature can be used to improve the selection of rates and their probabilities so as to more nearly approach the desired gamma distribution. A model based on population genetics is presented predicting how the rates of evolution might vary from locus to locus. Challenges for the future include allowing rates at a given site to vary along the tree, as in the ``covarion' model, and allowing them to have correlations that reflect three-dimensional structure, rather than position in the coding sequence. Markov chain Monte Carlo likelihood methods may be the only practical way to carry out computations for these models. Received: 8 February 2001 / Accepted: 20 May 2001  相似文献   

12.
A model of nucleotide substitution that allows the transition/transversion rate bias to vary across sites was constructed. We examined the fit of this model using likelihood-ratio tests by analyzing 13 protein coding genes and 1 pseudogene. Likelihood-ratio testing indicated that a model that allows variation in the transition/transversion rate bias across sites provided a significant improvement in fit for most protein coding genes but not for the pseudogene. When the analysis was repeated with parameters estimated separately for first, second, and third codon positions, strong heterogeneity was uncovered for the first and second codon positions; the variation in the transition/transversion rate was generally weaker at the third codon position. The transition rate bias and branch lengths are underestimated when variation in the transition/transversion rate was not accommodated, suggesting that it may be important to accommodate variation in the pattern of nucleotide substitution for accurate estimation of evolutionary parameters. Received: 4 November 1997 / Accepted: 19 May 1998  相似文献   

13.
The hypervariable segment I of the control region of the mtDNA (positions 16024-16383) was amplified from hair roots by PCR and sequenced in 45 unrelated individuals from Anatolia (Asian Turkey). Forty different sequences were found, defined by 56 variable positions, of which only one involves a transversion. The neighbor-joining tree of Kimura's distance matrix for all sequences shows four main clusters. Cluster D was found to be the most statistically robust of the four, and all the sequences in it shared a mutation that is present only in European and West Asian populations. The variability in cluster D could have originated between 37,000 and 107,000 years ago. No branch is unexpectedly long, denoting the absence of sequences that diverged much before the others. The pairwise difference distribution is bell-shaped, in accordance with a population expansion occurring roughly 35,000 to 100,000 years ago. When compared to other Caucasoid populations through the pairwise difference distribution, there is a pattern from the Middle East (older expansion) to the various European populations, with Turkey in an intermediate position; when Turkish sequences are compared through a neighbor-joining tree on a genetic distance matrix of populations, this position is again evidenced. Although there is a very low level of genetic divergence among Caucasoid populations as shown by mtDNA control region sequences, a geographic pattern of genetic variation emerges, denoting a stepping-stone position of Turkey between the Middle East and Europe, which is in agreement with the hypothesis of a replacement of Neanderthals by modern humans, which could be related to the Upper Paleolithic cultural expansion.   相似文献   

14.
A new class of lowly repetitive DNA sequences has been detected in the primate genome. The renaturation rate of this sequence class is practically indistinguishable from the renaturation rate of single-copy sequences. Consequently, this lowly repetitive sequence class has not been previously observed in DNA renaturation rate studies. This new sequence class is significant in that it might occupy a major fraction of the primate genome.Based on a study of the thermal stabilities of DNA heteroduplexes constructed from human DNA and either bonnet monkey or galago DNAs, we are able to compare the relative mutation rates of repetitive and single-copy sequences in the primate genome. We find that the mutation rate of short, interspersed repetitive sequences is either less than or approximately equal to the mutation rate of single-copy sequences. This implies that the base sequence of these repetitive sequences is important to their biological function.We also find that numerous mutations have accumulated in interspersed repeated sequences since the divergence of galago and human. These mutations are only recognizable because they occur at specific sites in the repeated sequence rather than at random sites in the sequence. Although interspersed repetitive sequences from human and galago can readily cross-hybridize, these site-specific mutations identify them as being two distinct classes. In contrast, far fewer site-specific mutations have occurred since the divergence of human and monkey.  相似文献   

15.
Analyses of complete cytochrome b sequences from all species of cranes (Aves: Gruidae) reveal aspects of sequence evolution in the early stages of divergence. These DNA sequences are > or = 89% identical, but expected departures from random substitution are evident. Silent, third- position pyrimidine transitions are the dominant substitution type, with transversion comprising only a small fraction of sequence differences. Substitution patterns are not clearly manifested until divergence has reached a moderate level (> 3%), as expected for a stochastic process. Variation in the frequency of mismatch types among lineages decreases at larger divergences, but the level of bias does not decay. Divergence varies up to fivefold among gene regions but is not correlated with structural domain. All protein structural domains except extramembrane 4 display < 20% variable residues. Regions corresponding to putative functional domains show the excepted conservation of amino acids, although the C-terminal portion of the Q0 reaction center displays several nonconservative replacements. Phylogenetic analyses incorporating substitution asymmetries produced mixed results. Distances estimated with multiple parameters (transition, codon-position, composition, and pyrimidine-transition biases) yielded identical additive tree topologies with comparable bootstrap values, all consistent with uncontroversial species relationships. Maximum likelihood analysis incorporating these biases, as well as equally weighted parsimony analysis, produced similar results. Static, differential weighting for parsimony did not improve the phylogenetic signal but produced unusual trees with low bootstraps. The overall rate of nucleotide substitution varies slightly but significantly among cranes, and calibration of distances against fossil dates suggests divergence rates of 0.7%-1.7% per million years.   相似文献   

16.
Two ways of estimating superimposed fixed mutations in the divergent descent of proteins are examined. One method counts these in terms of a Poisson process operating within selective constraints. The other uses the maximum parsimony method to connect the contemporary sequences through intervening ancestral sequences in an evolutionary tree, and then, from the distribution of fixed mutations in dense regions of this genealogy, estimates how many fixations should be added to sparse regions. An algorithm is described which determines such augmented distances. The two methods yield similar estimates of genetic divergence when tested on a series of cytochrome c amino acid sequences. Within those constraints imposed by Darwinian selection, the dynamic behavior of the evolutionary divergence of proteins is described by the probabilistic pathways of the stochastic model. The parsimony model provides a valid Aufbau-Prinzip for examining which of those pathways occurred along a particular lineage. Concordance of the numerical magnitudes of genetic divergence estimates made by the two methods reveals them as logically consistent complements, not as mutually exclusive antagonists. Both methods indicate that cytochrome c has evolved in a non-uniform manner over geological time and more rapidly than previously estimated.  相似文献   

17.
Fay JC  Benavides JA 《Genetics》2005,170(4):1575-1587
Compared to protein-coding sequences, the evolution of noncoding sequences and the selective constraints placed on these sequences is not well characterized. To compare the evolution of coding and noncoding sequences, we have conducted a survey for DNA polymorphism at five randomly chosen loci among a diverse collection of 81 strains of Saccharomyces cerevisiae. Average rates of both polymorphism and divergence are 40% lower at noncoding sites and 90% lower at nonsynonymous sites in comparison to synonymous sites. Although noncoding and coding sequences show substantial variability in ratios of polymorphism to divergence, two of the loci, MLS1 and PDR10, show a higher rate of polymorphism at noncoding compared to synonymous sites. The high rate of polymorphism is not accompanied by a high rate of divergence and is limited to a few small regions. These hypervariable regions include sites with three segregating bases at a single site and adjacent polymorphic sites. We show that this clustering of polymorphic sites is significantly greater than one would expect on the basis of the spacing between polymorphic fourfold degenerate sites. Although hypervariable noncoding sequences could result from selection on regulatory mutations, they could also result from transient mutational hotspots.  相似文献   

18.
Changes in the physical interaction between cis-regulatory DNA sequences and proteins drive the evolution of gene expression. However, it has proven difficult to accurately quantify evolutionary rates of such binding change or to estimate the relative effects of selection and drift in shaping the binding evolution. Here we examine the genome-wide binding of CTCF in four species of Drosophila separated by between ∼2.5 and 25 million years. CTCF is a highly conserved protein known to be associated with insulator sequences in the genomes of human and Drosophila. Although the binding preference for CTCF is highly conserved, we find that CTCF binding itself is highly evolutionarily dynamic and has adaptively evolved. Between species, binding divergence increased linearly with evolutionary distance, and CTCF binding profiles are diverging rapidly at the rate of 2.22% per million years (Myr). At least 89 new CTCF binding sites have originated in the Drosophila melanogaster genome since the most recent common ancestor with Drosophila simulans. Comparing these data to genome sequence data from 37 different strains of Drosophila melanogaster, we detected signatures of selection in both newly gained and evolutionarily conserved binding sites. Newly evolved CTCF binding sites show a significantly stronger signature for positive selection than older sites. Comparative gene expression profiling revealed that expression divergence of genes adjacent to CTCF binding site is significantly associated with the gain and loss of CTCF binding. Further, the birth of new genes is associated with the birth of new CTCF binding sites. Our data indicate that binding of Drosophila CTCF protein has evolved under natural selection, and CTCF binding evolution has shaped both the evolution of gene expression and genome evolution during the birth of new genes.  相似文献   

19.
Mutation rate varies greatly between nucleotide sites of the human genome and depends both on the global genomic location and the local sequence context of a site. In particular, CpG context elevates the mutation rate by an order of magnitude. Mutations also vary widely in their effect on the molecular function, phenotype, and fitness. Independence of the probability of occurrence of a new mutation''s effect has been a fundamental premise in genetics. However, highly mutable contexts may be preserved by negative selection at important sites but destroyed by mutation at sites under no selection. Thus, there may be a positive correlation between the rate of mutations at a nucleotide site and the magnitude of their effect on fitness. We studied the impact of CpG context on the rate of human–chimpanzee divergence and on intrahuman nucleotide diversity at non-synonymous coding sites. We compared nucleotides that occupy identical positions within codons of identical amino acids and only differ by being within versus outside CpG context. Nucleotides within CpG context are under a stronger negative selection, as revealed by their lower, proportionally to the mutation rate, rate of evolution and nucleotide diversity. In particular, the probability of fixation of a non-synonymous transition at a CpG site is two times lower than at a CpG site. Thus, sites with different mutation rates are not necessarily selectively equivalent. This suggests that the mutation rate may complement sequence conservation as a characteristic predictive of functional importance of nucleotide sites.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号