首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
We have investigated the effects of different among-site rate variation models on the estimation of substitution model parameters, branch lengths, topology, and bootstrap proportions under minimum evolution (ME) and maximum likelihood (ML). Specifically, we examined equal rates, invariable sites, gamma-distributed rates, and site-specific rates (SSR) models, using mitochondrial DNA sequence data from three protein-coding genes and one tRNA gene from species of the New Zealand cicada genus Maoricicada. Estimates of topology were relatively insensitive to the substitution model used; however, estimates of bootstrap support, branch lengths, and R-matrices (underlying relative substitution rate matrix) were strongly influenced by the assumptions of the substitution model. We identified one situation where ME and ML tree building became inaccurate when implemented with an inappropriate among-site rate variation model. Despite the fact the SSR models often have a better fit to the data than do invariable sites and gamma rates models, SSR models have some serious weaknesses. First, SSR rate parameters are not comparable across data sets, unlike the proportion of invariable sites or the alpha shape parameter of the gamma distribution. Second, the extreme among-site rate variation within codon positions is problematic for SSR models, which explicitly assume rate homogeneity within each rate class. Third, the SSR models appear to give severe underestimates of R-matrices and branch lengths relative to invariable sites and gamma rates models in this example. We recommend performing phylogenetic analyses under a range of substitution models to test the effects of model assumptions not only on estimates of topology but also on estimates of branch length and nodal support.  相似文献   

2.
This paper presents a maximum likelihood approach to estimating the variation of substitution rate among nucleotide sites. We assume that the rate varies among sites according to an invariant+gamma distribution, which has two parameters: the gamma parameter alpha and the proportion of invariable sites theta. Theoretical treatments on three, four, and five sequences have been conducted, and computer program have been developed. It is shown that rho = (1 + theta alpha)/(1 + alpha) is a good measure for the rate heterogeneity among sites. Extensive simulations show that (1) if the proportion of invariable sites is negligible, i.e., theta = 0, the gamma parameter alpha can be satisfactorily estimated, even with three sequences; (2) if the proportion of invariable sites is not negligible, the heterogeneity rho can still be suitably estimated with four or more sequences; and (3) the distances estimated by the proposed method are almost unbiased and are robust against violation of the assumption of the invariant + gamma distribution.   相似文献   

3.
Previous work has shown that it is often essential to account for the variation in rates at different sites in phylogenetic models in order to avoid phylogenetic artifacts such as long branch attraction. In most current models, the gamma distribution is used for the rates-across-sites distributions and is implemented as an equal-probability discrete gamma. In this article, we introduce discrete distribution estimates with large numbers of equally spaced rate categories allowing us to investigate the appropriateness of the gamma model. With large numbers of rate categories, these discrete estimates are flexible enough to approximate the shape of almost any distribution. Likelihood ratio statistical tests and a nonparametric bootstrap confidence-bound estimation procedure based on the discrete estimates are presented that can be used to test the fit of a parametric family. We applied the methodology to several different protein data sets, and found that although the gamma model often provides a good parametric model for this type of data, rate estimates from an equal-probability discrete gamma model with a small number of categories will tend to underestimate the largest rates. In cases when the gamma model assumption is in doubt, rate estimates coming from the discrete rate distribution estimate with a large number of rate categories provide a robust alternative to gamma estimates. An alternative implementation of the gamma distribution is proposed that, for equal numbers of rate categories, is computationally more efficient during optimization than the standard gamma implementation and can provide more accurate estimates of site rates.  相似文献   

4.
The covarion hypothesis of molecular evolution proposes that selective pressures on an amino acid or nucleotide site change through time, thus causing changes of evolutionary rate along the edges of a phylogenetic tree. Several kinds of Markov models for the covarion process have been proposed. One model, proposed by Huelsenbeck (2002), has 2 substitution rate classes: the substitution process at a site can switch between a single variable rate, drawn from a discrete gamma distribution, and a zero invariable rate. A second model, suggested by Galtier (2001), assumes rate switches among an arbitrary number of rate classes but switching to and from the invariable rate class is not allowed. The latter model allows for some sites that do not participate in the rate-switching process. Here we propose a general covarion model that combines features of both models, allowing evolutionary rates not only to switch between variable and invariable classes but also to switch among different rates when they are in a variable state. We have implemented all 3 covarion models in a maximum likelihood framework for amino acid sequences and tested them on 23 protein data sets. We found significant likelihood increases for all data sets for the 3 models, compared with a model that does not allow site-specific rate switches along the tree. Furthermore, we found that the general model fit the data better than the simpler covarion models in the majority of the cases, highlighting the complexity in modeling the covarion process. The general covarion model can be used for comparing tree topologies, molecular dating studies, and the investigation of protein adaptation.  相似文献   

5.
The use of parameter-rich substitution models in molecular phylogenetics has been criticized on the basis that these models can cause a reduction both in accuracy and in the ability to discriminate among competing topologies. We have explored the relationship between nucleotide substitution model complexity and nonparametric bootstrap support under maximum likelihood (ML) for six data sets for which the true relationships are known with a high degree of certainty. We also performed equally weighted maximum parsimony analyses in order to assess the effects of ignoring branch length information during tree selection. We observed that maximum parsimony gave the lowest mean estimate of bootstrap support for the correct set of nodes relative to the ML models for every data set except one. For several data sets, we established that the exact distribution used to model among-site rate variation was critical for a successful phylogenetic analysis. Site-specific rate models were shown to perform very poorly relative to gamma and invariable sites models for several of the data sets most likely because of the gross underestimation of branch lengths. The invariable sites model also performed poorly for several data sets where this model had a poor fit to the data, suggesting that addition of the gamma distribution can be critical. Estimates of bootstrap support for the correct nodes often increased under gamma and invariable sites models relative to equal rates models. Our observations are contrary to the prediction that such models cause reduced confidence in phylogenetic hypotheses. Our results raise several issues regarding the process of model selection, and we briefly discuss model selection uncertainty and the role of sensitivity analyses in molecular phylogenetics.  相似文献   

6.
A Space-Time Process Model for the Evolution of DNA Sequences   总被引:20,自引:3,他引:17       下载免费PDF全文
Z. Yang 《Genetics》1995,139(2):993-1005
We describe a model for the evolution of DNA sequences by nucleotide substitution, whereby nucleotide sites in the sequence evolve over time, whereas the rates of substitution are variable and correlated over sites. The temporal process used to describe substitutions between nucleotides is a continuous-time Markov process, with the four nucleotides as the states. The spatial process used to describe variation and dependence of substitution rates over sites is based on a serially correlated gamma distribution, i.e., an auto-gamma model assuming Markov-dependence of rates at adjacent sites. To achieve computational efficiency, we use several equal-probability categories to approximate the gamma distribution, and the result is an auto-discrete-gamma model for rates over sites. Correlation of rates at sites then is modeled by the Markov chain transition of rates at adjacent sites from one rate category to another, the states of the chain being the rate categories. Two versions of nonparametric models, which place no restrictions on the distributional forms of rates for sites, also are considered, assuming either independence or Markov dependence. The models are applied to data of a segment of mitochondrial genome from nine primate species. Model parameters are estimated by the maximum likelihood method, and models are compared by the likelihood ratio test. Tremendous variation of rates among sites in the sequence is revealed by the analyses, and when rate differences for different codon positions are appropriately accounted for in the models, substitution rates at adjacent sites are found to be strongly (positively) correlated. Robustness of the results to uncertainty of the phylogenetic tree linking the species is examined.  相似文献   

7.
A new method is developed for calculating sequence substitution probabilities using Markov chain Monte Carlo (MCMC) methods. The basic strategy is to use uniformization to transform the original continuous time Markov process into a Poisson substitution process and a discrete Markov chain of state transitions. An efficient MCMC algorithm for evaluating substitution probabilities by this approach using a continuous gamma distribution to model site-specific rates is outlined. The method is applied to the problem of inferring branch lengths and site-specific rates from nucleotide sequences under a general time-reversible (GTR) model and a computer program BYPASSR is developed. Simulations are used to examine the performance of the new program relative to an existing program BASEML that uses a discrete approximation for the gamma distributed prior on site-specific rates. It is found that BASEML and BYPASSR are in close agreement when inferring branch lengths, regardless of the number of rate categories used, but that BASEML tends to underestimate high site-specific substitution rates, and to overestimate intermediate rates, when fewer than 50 rate categories are used. Rate estimates obtained using BASEML agree more closely with those of BYPASSR as the number of rate categories increases. Analyses of the posterior distributions of site-specific rates from BYPASSR suggest that a large number of taxa are needed to obtain precise estimates of site-specific rates, especially when rates are very high or very low. The method is applied to analyze 45 sequences of the alpha 2B adrenergic receptor gene (A2AB) from a sample of eutherian taxa. In general, the pattern expected for regions under negative selection is observed with third codon positions having the highest inferred rates, followed by first codon positions and with second codon positions having the lowest inferred rates. Several sites show exceptionally high substitution rates at second codon positions that may represent the effects of positive selection.  相似文献   

8.
S. Kumar 《Genetics》1996,143(1):537-548
Maximum likelihood methods were used to study the differences in substitution rates among the four nucleotides and among different nucleotide sites in mitochondrial protein-coding genes of vertebrates. In the 1st+2nd codon position data, the frequency of nucleotide G is negatively correlated with evolutionary rates of genes, substitution rates vary substantially among sites, and the transition/transversion rate bias (R) is two to five times larger than that expected at random. Generally, largest transition biases and greatest differences in substitution rates among sites are found in the highly conserved genes. The 3rd positions in placental mammal genes exhibit strong nucleotide composition biases and the transitional rates exceed transversional rates by one to two orders of magnitude. Tamura-Nei and Hasegawa-Kishino-Yano models with gamma distributed variable rates among sites (gamma parameter, α) adequately describe the nucleotide substitution process in 1st+2nd position data. In these data, ignoring differences in substitution rates among sites leads to largest biases while estimating substitution rates. Kimura's two-parameter model with variable-rates among sites performs satisfactorily in likelihood estimation of R, α, and overall amount of evolution for 1st+2nd position data. It can also be used to estimate pairwise distances with appropriate values of α for a majority of genes.  相似文献   

9.
Maximum likelihood estimation of phylogenetic trees from nucleotide sequences is completely consistent when nucleotide substitution is governed by the general time reversible (GTR) model with rates that vary over sites according to the invariable sites plus gamma (I + gamma) distribution.  相似文献   

10.
We propose two approximate methods (one based on parsimony and one on pairwise sequence comparison) for estimating the pattern of nucleotide substitution and a parsimony-based method for estimating the gamma parameter for variable substitution rates among sites. The matrix of substitution rates that represents the substitution pattern can be recovered through its relationship with the observable matrix of site pattern frequences in pairwise sequence comparisons. In the parsimony approach, the ancestral sequences reconstructed by the parsimony algorithm were used, and the two sequences compared are those at the ends of a branch in the phylogenetic tree. The method for estimating the gamma parameter was based on a reinterpretation of the numbers of changes at sites inferred by parsimony. Three data sets were analyzed to examine the utility of the approximate methods compared with the more reliable likelihood methods. The new methods for estimating the substitution pattern were found to produce estimates quite similar to those obtained from the likelihood analyses. The new method for estimating the gamma parameter was effective in reducing the bias in conventional parsimony estimates, although it also overestimated the parameter. The approximate methods are computationally very fast and appear useful for analyzing large data sets, for which use of the likelihood method requires excessive computation.   相似文献   

11.
Models of nucleotide substitution were constructed for combined analyses of heterogeneous sequence data (such as those of multiple genes) from the same set of species. The models account for different aspects of the heterogeneity in the evolutionary process of different genes, such as differences in nucleotide frequencies, in substitution rate bias (for example, the transition/transversion rate bias), and in the extent of rate variation across sites. Model parameters were estimated by maximum likelihood and the likelihood ratio test was used to test hypotheses concerning sequence evolution, such as rate constancy among lineages (the assumption of a molecular clock) and proportionality of branch lengths for different genes. The example data from a segment of the mitochondrial genome of six hominoid species (human, common and pygmy chimpanzees, gorilla, orangutan, and siamang) were analyzed. Nucleotides at the three codon positions in the protein-coding regions and from the tRNA-coding regions were considered heterogeneous data sets. Statistical tests showed that the amount of evolution in the sequence data reflected in the estimated branch lengths can be explained by the codon-position effect and lineage effect of substitution rates. The assumption of a molecular clock could not be rejected when the data were analyzed separately or when the rate variation among sites was ignored. However, significant differences in substitution rate among lineages were found when the data sets were combined and when the rate variation among sites was accounted for in the models. Under the assumption that the orangutan and African apes diverged 13 million years ago, the combined analysis of the sequence data estimated the times for the human-chimpanzee separation and for the separation of the gorilla as 4.3 and 6.8 million years ago, respectively.  相似文献   

12.
Simplifying assumptions made in various tree reconstruction methods-- notably rate constancy among nucleotide sites, homogeneity, and stationarity of the substitutional processes--are clearly violated when nucleotide sequences are used to infer distant relationships. Use of tree reconstruction methods based on such oversimplified assumptions can lead to misleading results, as pointed out by previous authors. In this paper, we made use of a (discretized) gamma distribution to account for variable rates of substitution among sites and built models that allowed for unequal base frequencies in different sequences. The models were nonhomogeneous Markov-process models, assuming different patterns of substitution in different parts of the tree. Data of the small-subunit rRNAs from four species were analyzed, where base frequencies were quite different among sequences and rates of substitution were highly variable at sites. Parameters in the models were estimated by maximum likelihood, and models were compared by the likelihood-ratio test. The nonhomogeneous models provided significantly better fit to the data than homogeneous models despite their involvement of many parameters. They also appeared to produce reasonable estimation of the phylogenetic tree; in particular, they seemed able to identify the root of the tree.   相似文献   

13.
Estimating Substitution Rates in Ribosomal RNA Genes   总被引:7,自引:0,他引:7       下载免费PDF全文
A. Rzhetsky 《Genetics》1995,141(2):771-783
A model is introduced describing nucleotide substitution in ribosomal RNA (rRNA) genes. In this model, substitution in the stem and loop regions of rRNA is modeled with 16- and four-state continuous time Markov chains, respectively. The mean substitution rates at nucleotide sites are assumed to follow gamma distributions that are different for the two types of regions. The simplest formulation of the model allows for explicit expressions for transition probabilities of the Markov processes to be found. These expressions were used to analyze several 16S-like rRNA genes from higher eukaryotes with the maximum likelihood method. Although the observed proportion of invariable sites was only slightly higher in the stem regions, the estimated average substitution rates in the stem regions were almost two times as high as in the loop regions. Therefore, the degree of site heterogeneity of substitution rates in the stem regions seems to be higher than in the loop regions of animal 16S-like rRNAs due to presence of a few rapidly evolving sites. The model appears to be helpful in understanding the regularities of nucleotide substitution in rRNAs and probably minimizing errors in recovering phylogeny for distantly related taxa from these genes.  相似文献   

14.
Felsenstein's maximum-likelihood approach for inferring phylogeny from DNA sequences assumes that the rate of nucleotide substitution is constant over different nucleotide sites. This assumption is sometimes unrealistic, as has been revealed by analysis of real sequence data. In the present paper Felsenstein's method is extended to the case where substitution rates over sites are described by the gamma distribution. A numerical example is presented to show that the method fits the data better than do previous models.   相似文献   

15.
Hyracoids have been allied with either perissodactyls or tethytheres (i.e., Proboscidea + Sirenia) based on morphological data. The latter hypothesis, termed Paenungulata, is corroborated by numerous molecular studies. However, molecular studies have failed to support Tethytheria, a group that is supported by morphological data. We examined relationships among living paenungulate orders using a multigene data set that included sequences from four mitochondrial genes (12S rRNA, tRNA valine, 16S rRNA, cytochrome b) and four nuclear genes (aquaporin, A2AB, IRBP, vWF). Nineteen maximum-likelihood models were employed, including models with process partitions for base composition and substitution parameterizations. With the inclusion of partitions with a heterogeneous base composition, 18 of 19 models favored Hyracoidea + Sirenia. All 19 models favored Hyracoidea + Sirenia after excluding heterogeneous base composition partitions. Most of the support for Hyracoidea + Sirenia derived from the mitochondrial genes (bootstrap support ranged from 51 to 99%); Tethytheria, in turn, received 0 to 19% support in different analyses. Bootstrap support deriving from the nuclear genes was more evenly split among the competing hypotheses (3 to 45% for Tethytheria; 17.5 to 62% for Hyracoidea + Sirenia). Lineage-specific rate variation among both mitochondrial and nuclear genes may contribute to the different results that were obtained with mitochondrial versus nuclear data. Whether Tethytheria or a competing hypothesis is correct, short internodes on the molecular phylogenies suggest that paenungulate orders diverged from each other over a 5- to 8-million-year time window extending from the late Paleocene into the early Eocene. We also used likelihood-ratio tests to compare different models of sequence evolution. A gamma distribution of rates results in a greater improvement in likelihood scores than does an allowance for invariant sites. Twenty-one rate partitions corresponding to stems, loops, and codon positions of different genes result in higher likelihood scores than a gamma distribution of rates and/or an allowance for invariant sites. Process partitions of the data that incorporate base composition and substitution parameterizations result in significant improvements in likelihood scores in comparison to models that allow only for relative rate differences among partitions.  相似文献   

16.
Genetic sequence data typically exhibit variability in substitution rates across sites. In practice, there is often too little variation to fit a different rate for each site in the alignment, but the distribution of rates across sites may not be well modeled using simple parametric families. Mixtures of different distributions can capture more complex patterns of rate variation, but are often parameter-rich and difficult to fit. We present a simple hierarchical model in which a baseline rate distribution, such as a gamma distribution, is discretized into several categories, the quantiles of which are estimated using a discretized beta distribution. Although this approach involves adding only two extra parameters to a standard distribution, a wide range of rate distributions can be captured. Using simulated data, we demonstrate that a "beta-" model can reproduce the moments of the rate distribution more accurately than the distribution used to simulate the data, even when the baseline rate distribution is misspecified. Using hepatitis C virus and mammalian mitochondrial sequences, we show that a beta- model can fit as well or better than a model with multiple discrete rate categories, and compares favorably with a model which fits a separate rate category to each site. We also demonstrate this discretization scheme in the context of codon models specifically aimed at identifying individual sites undergoing adaptive or purifying evolution.  相似文献   

17.
We consider three approaches for estimating the rates of nonsynonymous and synonymous changes at each site in a sequence alignment in order to identify sites under positive or negative selection: (1) a suite of fast likelihood-based "counting methods" that employ either a single most likely ancestral reconstruction, weighting across all possible ancestral reconstructions, or sampling from ancestral reconstructions; (2) a random effects likelihood (REL) approach, which models variation in nonsynonymous and synonymous rates across sites according to a predefined distribution, with the selection pressure at an individual site inferred using an empirical Bayes approach; and (3) a fixed effects likelihood (FEL) method that directly estimates nonsynonymous and synonymous substitution rates at each site. All three methods incorporate flexible models of nucleotide substitution bias and variation in both nonsynonymous and synonymous substitution rates across sites, facilitating the comparison between the methods. We demonstrate that the results obtained using these approaches show broad agreement in levels of Type I and Type II error and in estimates of substitution rates. Counting methods are well suited for large alignments, for which there is high power to detect positive and negative selection, but appear to underestimate the substitution rate. A REL approach, which is more computationally intensive than counting methods, has higher power than counting methods to detect selection in data sets of intermediate size but may suffer from higher rates of false positives for small data sets. A FEL approach appears to capture the pattern of rate variation better than counting methods or random effects models, does not suffer from as many false positives as random effects models for data sets comprising few sequences, and can be efficiently parallelized. Our results suggest that previously reported differences between results obtained by counting methods and random effects models arise due to a combination of the conservative nature of counting-based methods, the failure of current random effects models to allow for variation in synonymous substitution rates, and the naive application of random effects models to extremely sparse data sets. We demonstrate our methods on sequence data from the human immunodeficiency virus type 1 env and pol genes and simulated alignments.  相似文献   

18.
Several maximum likelihood and distance matrix methods for estimating phylogenetic trees from homologous DNA sequences were compared when substitution rates at sites were assumed to follow a gamma distribution. Computer simulations were performed to estimate the probabilities that various tree estimation methods recover the true tree topology. The case of four species was considered, and a few combinations of parameters were examined. Attention was applied to discriminating among different sources of error in tree reconstruction, i.e., the inconsistency of the tree estimation method, the sampling error in the estimated tree due to limited sequence length, and the sampling error in the estimated probability due to the number of simulations being limited. Compared to the least squares method based on pairwise distance estimates, the joint likelihood analysis is found to be more robust when rate variation over sites is present but ignored and an assumption is thus violated. With limited data, the likelihood method has a much higher probability of recovering the true tree and is therefore more efficient than the least squares method. The concept of statistical consistency of a tree estimation method and its implications were explored, and it is suggested that, while the efficiency (or sampling error) of a tree estimation method is a very important property, statistical consistency of the method over a wide range of, if not all, parameter values is prerequisite.  相似文献   

19.
As methods of molecular phylogeny have become more explicit and more biologically realistic following the pioneering work of Thomas Jukes, they have had to relax their initial assumption that rates of evolution were equal at all sites. Distance matrix and likelihood methods of inferring phylogenies make this assumption; parsimony, when valid, is less limited by it. Nucleotide sequences, including RNA sequences, can show substantial rate variation; protein sequences show rates that vary much more widely. Assuming a prior distribution of rates such as a gamma distribution or lognormal distribution has deservedly been popular, but for likelihood methods it leads to computational difficulties. These can be resolved using hidden Markov model (HMM) methods which approximate the distribution by one with a modest number of discrete rates. Generalized Laguerre quadrature can be used to improve the selection of rates and their probabilities so as to more nearly approach the desired gamma distribution. A model based on population genetics is presented predicting how the rates of evolution might vary from locus to locus. Challenges for the future include allowing rates at a given site to vary along the tree, as in the ``covarion' model, and allowing them to have correlations that reflect three-dimensional structure, rather than position in the coding sequence. Markov chain Monte Carlo likelihood methods may be the only practical way to carry out computations for these models. Received: 8 February 2001 / Accepted: 20 May 2001  相似文献   

20.
Two approximate methods are proposed for maximum likelihood phylogenetic estimation, which allow variable rates of substitution across nucleotide sites. Three data sets with quite different characteristics were analyzed to examine empirically the performance of these methods. The first, called the discrete gamma model, uses several categories of rates to approximate the gamma distribution, with equal probability for each category. The mean of each category is used to represent all the rates falling in the category. The performance of this method is found to be quite good, and four such categories appear to be sufficient to produce both an optimum, or near-optimum fit by the model to the data, and also an acceptable approximation to the continuous distribution. The second method, called fixed-rates model, classifies sites into several classes according to their rates predicted assuming the star tree. Sites in different classes are then assumed to be evolving at these fixed rates when other tree topologies are evaluated. Analyses of the data sets suggest that this method can produce reasonable results, but it seems to share some properties of a least-squares pairwise comparison; for example, interior branch lengths in nonbest trees are often found to be zero. The computational requirements of the two methods are comparable to that of Felsenstein's (1981, J Mol Evol 17:368–376) model, which assumes a single rate for all the sites.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号