首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 78 毫秒
1.

Background

Model selection is a vital part of most phylogenetic analyses, and accounting for the heterogeneity in evolutionary patterns across sites is particularly important. Mixture models and partitioning are commonly used to account for this variation, and partitioning is the most popular approach. Most current partitioning methods require some a priori partitioning scheme to be defined, typically guided by known structural features of the sequences, such as gene boundaries or codon positions. Recent evidence suggests that these a priori boundaries often fail to adequately account for variation in rates and patterns of evolution among sites. Furthermore, new phylogenomic datasets such as those assembled from ultra-conserved elements lack obvious structural features on which to define a priori partitioning schemes. The upshot is that, for many phylogenetic datasets, partitioned models of molecular evolution may be inadequate, thus limiting the accuracy of downstream phylogenetic analyses.

Results

We present a new algorithm that automatically selects a partitioning scheme via the iterative division of the alignment into subsets of similar sites based on their rates of evolution. We compare this method to existing approaches using a wide range of empirical datasets, and show that it consistently leads to large increases in the fit of partitioned models of molecular evolution when measured using AICc and BIC scores. In doing so, we demonstrate that some related approaches to solving this problem may have been associated with a small but important bias.

Conclusions

Our method provides an alternative to traditional approaches to partitioning, such as dividing alignments by gene and codon position. Because our method is data-driven, it can be used to estimate partitioned models for all types of alignments, including those that are not amenable to traditional approaches to partitioning.  相似文献   

2.
Li C  Lu G  Ortí G 《Systematic biology》2008,57(4):519-539
Data partitioning, the combined phylogenetic analysis of homogeneous blocks of data, is a common strategy used to accommodate heterogeneities in complex multilocus data sets. Variation in evolutionary rates and substitution patterns among sites are typically addressed by partitioning data by gene, codon position, or both. Excessive partitioning of the data, however, could lead to overparameterization; therefore, it seems critical to define the minimum numbers of partitions necessary to improve the overall fit of the model. We propose a new method, based on cluster analysis, to find an optimal partitioning strategy for multilocus protein-coding data sets. A heuristic exploration of alternative partitioning schemes, based on Bayesian and maximum likelihood (ML) criteria, is shown here to produce an optimal number of partitions. We tested this method using sequence data of 10 nuclear genes collected from 52 ray-finned fish (Actinopterygii) and four tetrapods. The concatenated sequences included 7995 nucleotide sites maximally split into 30 partitions defined a priori based on gene and codon position. Our results show that a model based on only 10 partitions defined by cluster analysis performed better than partitioning by both gene and codon position. Alternative data partitioning schemes also are shown to affect the topologies resulting from phylogenetic analysis, especially when Bayesian methods are used, suggesting that overpartitioning may be of major concern. The phylogenetic relationships among the major clades of ray-finned fish were assessed using the best data-partitioning schemes under ML and Bayesian methods. Some significant results include the monophyly of "Holostei" (Amia and Lepisosteus), the sister-group relationships between (1) esociforms and salmoniforms and (2) osmeriforms and stomiiforms, the polyphyly of Perciformes, and a close relationship of cichlids and atherinomorphs.  相似文献   

3.
Models of codon evolution are useful for investigating the strength and direction of natural selection via a parameter for the nonsynonymous/synonymous rate ratio (omega = d(N)/d(S)). Different codon models are available to account for diversity of the evolutionary patterns among sites. Codon models that specify data partitions as fixed effects allow the most evolutionary diversity among sites but require that site partitions are a priori identifiable. Models that use a parametric distribution to express the variability in the omega ratio across site do not require a priori partitioning of sites, but they permit less among-site diversity in the evolutionary process. Simulation studies presented in this paper indicate that differences among sites in estimates of omega under an overly simplistic analytical model can reflect more than just natural selection pressure. We also find that the classic likelihood ratio tests for positive selection have a high false-positive rate in some situations. In this paper, we developed a new method for assigning codon sites into groups where each group has a different model, and the likelihood over all sites is maximized. The method, called likelihood-based clustering (LiBaC), can be viewed as a generalization of the family of model-based clustering approaches to models of codon evolution. We report the performance of several LiBaC-based methods, and selected alternative methods, over a wide variety of scenarios. We find that LiBaC, under an appropriate model, can provide reliable parameter estimates when the process of evolution is very heterogeneous among groups of sites. Certain types of proteins, such as transmembrane proteins, are expected to exhibit such heterogeneity. A survey of genes encoding transmembrane proteins suggests that overly simplistic models could be leading to false signal for positive selection among such genes. In these cases, LiBaC-based methods offer an important addition to a "toolbox" of methods thereby helping to uncover robust evidence for the action of positive selection.  相似文献   

4.
The single rate codon model of non-synonymous substitution is ubiquitous in phylogenetic modeling. Indeed, the use of a non-synonymous to synonymous substitution rate ratio parameter has facilitated the interpretation of selection pressure on genomes. Although the single rate model has achieved wide acceptance, we argue that the assumption of a single rate of non-synonymous substitution is biologically unreasonable, given observed differences in substitution rates evident from empirical amino acid models. Some have attempted to incorporate amino acid substitution biases into models of codon evolution and have shown improved model performance versus the single rate model. Here, we show that the single rate model of non-synonymous substitution is easily outperformed by a model with multiple non-synonymous rate classes, yet in which amino acid substitution pairs are assigned randomly to these classes. We argue that, since the single rate model is so easy to improve upon, new codon models should not be validated entirely on the basis of improved model fit over this model. Rather, we should strive to both improve on the single rate model and to approximate the general time-reversible model of codon substitution, with as few parameters as possible, so as to reduce model over-fitting. We hint at how this can be achieved with a Genetic Algorithm approach in which rate classes are assigned on the basis of sequence information content.  相似文献   

5.
ki ctes over whether molecular sequence data should be partitioned for phylogenetic analysis often confound two types of heterogeneity among partitions. We distinguish historical heterogeneity (i.e., different partitions have different evolutionary relationships) from dynamic heterogeneity (i.e., different partitions show different patterns of sequence evolution) and explore the impact of the latter on phylogenetic accuracy and precision with a two-gene, mitochondrial data set for cranes. The well-established phylogeny of cranes allows us to contrast tree-based estimates of relevant parameter values with estimates based on pairwise comparisons and to ascertain the effects of incorporating different amounts of process information into phylogenetic estimates. We show that codon positions in the cytochrome b and NADH dehydrogenase subunit 6 genes are dynamically heterogenous under both Poisson and invariable-sites + gamma-rates versions of the F84 model and that heterogeneity includes variation in base composition and transition bias as well as substitution rate. Estimates of transition-bias and relative-rate parameters from pairwise sequence comparisons were comparable to those obtained as tree-based maximum likelihood estimates. Neither rate-category nor mixed-model partitioning strategies resulted in a loss of phylogenetic precision relative to unpartitioned analyses. We suggest that weighted-average distances provide a computationally feasible alternative to direct maximum likelihood estimates of phylogeny for mixed-model analyses of large, dynamically heterogenous data sets.  相似文献   

6.
The strength and direction of selection on the identity of an amino acid residue in a protein is typically measured by the ratio of the rate of non-synonymous substitutions to the rate of synonymous substitutions. In attempting to predict positively selected sites from amino acid alignments, we made the unexpected observation that the site likelihood of an alignment column for a given tree tends to be negatively correlated with the posterior probability that site is in the positive selection class under widely-used codon models. This is likely because positively selected sites tend to be more variable and display more “radical” amino acid changes; both of these features are expected to result in low site log-likelihoods. We explored the efficacy of using the site log-likelihood (SLL) score as a predictor for positive selection. Through simulation we show that a SLL-based test has a low false positive rate and comparable power as the codon models. In one case where the simulated data violated the assumption that synonymous substitution rates were constant across the sites, the codon models were not able to detect positive selection in the data while the SLL test did. We applied the new method to ten empirical datasets and found that it made similar predictions as the codon models in eight of them. For the tax gene dataset the SLL test seemed to produce more reasonable results. The SLL methods are a valuable complement to codon models, especially for some cases where the assumptions of codon models are likely violated.  相似文献   

7.
We introduce a new model for relaxing the assumption of a strict molecular clock for use as a prior in Bayesian methods for divergence time estimation. Lineage-specific rates of substitution are modeled using a Dirichlet process prior (DPP), a type of stochastic process that assumes lineages of a phylogenetic tree are distributed into distinct rate classes. Under the Dirichlet process, the number of rate classes, assignment of branches to rate classes, and the rate value associated with each class are treated as random variables. The performance of this model was evaluated by conducting analyses on data sets simulated under a range of different models. We compared the Dirichlet process model with two alternative models for rate variation: the strict molecular clock and the independent rates model. Our results show that divergence time estimation under the DPP provides robust estimates of node ages and branch rates without significantly reducing power. Further analyses were conducted on a biological data set, and we provide examples of ways to summarize Markov chain Monte Carlo samples under this model.  相似文献   

8.
In phylogenetic analyses of molecular sequence data, partitioning involves estimating independent models of molecular evolution for different sets of sites in a sequence alignment. Choosing an appropriate partitioning scheme is an important step in most analyses because it can affect the accuracy of phylogenetic reconstruction. Despite this, partitioning schemes are often chosen without explicit statistical justification. Here, we describe two new objective methods for the combined selection of best-fit partitioning schemes and nucleotide substitution models. These methods allow millions of partitioning schemes to be compared in realistic time frames and so permit the objective selection of partitioning schemes even for large multilocus DNA data sets. We demonstrate that these methods significantly outperform previous approaches, including both the ad hoc selection of partitioning schemes (e.g., partitioning by gene or codon position) and a recently proposed hierarchical clustering method. We have implemented these methods in an open-source program, PartitionFinder. This program allows users to select partitioning schemes and substitution models using a range of information-theoretic metrics (e.g., the Bayesian information criterion, akaike information criterion [AIC], and corrected AIC). We hope that PartitionFinder will encourage the objective selection of partitioning schemes and thus lead to improvements in phylogenetic analyses. PartitionFinder is written in Python and runs under Mac OSX 10.4 and above. The program, source code, and a detailed manual are freely available from www.robertlanfear.com/partitionfinder.  相似文献   

9.
The study of morphological evolution after the inferred origin of active flight homologous with that in Aves has historically been characterized by an emphasis on anatomically disjunct, mosaic patterns of change. Relatively few prior studies have used discrete morphological character data in a phylogenetic context to quantitatively investigate morphological evolution or mosaic evolution in particular. One such previously employed method, which used summed unambiguously optimized synapomorphies, has been the basis for proposing disassociated and sequential "modernizing" or "fine-tuning" of pectoral and then pelvic locomotor systems after the origin of flight ("pectoral early-pelvic late" hypothesis). We use one of the most inclusive phylogenetic data sets of basal birds to investigate properties of this method and to consider the application of a Bayesian phylogenetic approach. Bayes factor and statistical comparisons of branch length estimates were used to evaluate support for a mosaic pattern of character change and the specific pectoral early-pelvic late hypothesis. Partitions were defined a priori based on anatomical subregion (e.g., pelvic, pectoral) and were based on those hypothesized using the summed synapomorphy approach. We compare 80 models all implementing the M(k) model for morphological data but varying in the number of anatomical subregion partitions, the models for among-partition rate variation and among-character rate variation, as well as the branch length prior. Statistical analysis reveals that partitioning data by anatomical subregion, independently estimating branch lengths for partitioned data, and use of shared or per partition gamma-shaped among-character rate distribution significantly increases estimated model likelihoods. Simulation studies reveal that partitioned models where characters are randomly assigned perform significantly worse than both the observed model and the single-partition equal-rate model, suggesting that only partitioning by anatomical subregion increases model performance. The preference for models with partitions defined a priori by anatomical subregion is consistent with a disjunctive pattern of character change for the data set investigated and may have implications for parameterization of Bayesian analyses of morphological data more generally. Statistical tests of differences in estimated branch lengths from the pectoral and pelvic partitions do not support the specific pectoral early-pelvic late hypothesis proposed from the summed synapomorphy approach; however, results suggest limited support for some pectoral branch lengths being significantly longer only early at/after the origin of flight.  相似文献   

10.
Phylogenetic studies incorporating multiple loci, and multiple genomes, are becoming increasingly common. Coincident with this trend in genetic sampling, model-based likelihood techniques including Bayesian phylogenetic methods continue to gain popularity. Few studies, however, have examined model fit and sensitivity to such potentially heterogeneous data partitions within combined data analyses using empirical data. Here we investigate the relative model fit and sensitivity of Bayesian phylogenetic methods when alternative site-specific partitions of among-site rate variation (with and without autocorrelated rates) are considered. Our primary goal in choosing a best-fit model was to employ the simplest model that was a good fit to the data while optimizing topology and/or Bayesian posterior probabilities. Thus, we were not interested in complex models that did not practically affect our interpretation of the topology under study. We applied these alternative models to a four-gene data set including one protein-coding nuclear gene (c-mos), one protein-coding mitochondrial gene (ND4), and two mitochondrial rRNA genes (12S and 16S) for the diverse yet poorly known lizard family Gymnophthalmidae. Our results suggest that the best-fit model partitioned among-site rate variation separately among the c-mos, ND4, and 12S + 16S gene regions. We found this model yielded identical topologies to those from analyses based on the GTR+I+G model, but significantly changed posterior probability estimates of clade support. This partitioned model also produced more precise (less variable) estimates of posterior probabilities across generations of long Bayesian runs, compared to runs employing a GTR+I+G model estimated for the combined data. We use this three-way gamma partitioning in Bayesian analyses to reconstruct a robust phylogenetic hypothesis for the relationships of genera within the lizard family Gymnophthalmidae. We then reevaluate the higher-level taxonomic arrangement of the Gymnophthalmidae. Based on our findings, we discuss the utility of nontraditional parameters for modeling among-site rate variation and the implications and future directions for complex model building and testing.  相似文献   

11.
This article generalizes previous models for codon substitution and rate variation in molecular phylogeny. Particular attention is paid to (1) reversibility, (2) acceptance and rejection of proposed codon changes, (3) varying rates of evolution among codon sites, and (4) the interaction of these sites in determining evolutionary rates. To accommodate spatial variation in rates, Markov random fields rather than Markov chains are introduced. Because these innovations complicate maximum likelihood estimation in phylogeny reconstruction, it is necessary to formulate new algorithms for the evaluation of the likelihood and its derivatives with respect to the underlying kinetic, acceptance, and spatial parameters. To derive the most from maximum likelihood analysis of sequence data, it is useful to compute posterior probabilities assigning residues to internal nodes and evolutionary rate classes to codon sites. It is also helpful to search through tree space in a way that respects accepted phylogenetic relationships. Our phylogeny program LINNAEUS implements algorithms realizing these goals. Readers may consult our companion article in this issue for several examples.  相似文献   

12.

Background

Models of codon evolution have proven useful for investigating the strength and direction of natural selection. In some cases, a priori biological knowledge has been used successfully to model heterogeneous evolutionary dynamics among codon sites. These are called fixed-effect models, and they require that all codon sites are assigned to one of several partitions which are permitted to have independent parameters for selection pressure, evolutionary rate, transition to transversion ratio or codon frequencies. For single gene analysis, partitions might be defined according to protein tertiary structure, and for multiple gene analysis partitions might be defined according to a gene's functional category. Given a set of related fixed-effect models, the task of selecting the model that best fits the data is not trivial.

Results

In this study, we implement a set of fixed-effect codon models which allow for different levels of heterogeneity among partitions in the substitution process. We describe strategies for selecting among these models by a backward elimination procedure, Akaike information criterion (AIC) or a corrected Akaike information criterion (AICc). We evaluate the performance of these model selection methods via a simulation study, and make several recommendations for real data analysis. Our simulation study indicates that the backward elimination procedure can provide a reliable method for model selection in this setting. We also demonstrate the utility of these models by application to a single-gene dataset partitioned according to tertiary structure (abalone sperm lysin), and a multi-gene dataset partitioned according to the functional category of the gene (flagellar-related proteins of Listeria).

Conclusion

Fixed-effect models have advantages and disadvantages. Fixed-effect models are desirable when data partitions are known to exhibit significant heterogeneity or when a statistical test of such heterogeneity is desired. They have the disadvantage of requiring a priori knowledge for partitioning sites. We recommend: (i) selection of models by using backward elimination rather than AIC or AICc, (ii) use a stringent cut-off, e.g., p = 0.0001, and (iii) conduct sensitivity analysis of results. With thoughtful application, fixed-effect codon models should provide a useful tool for large scale multi-gene analyses.
  相似文献   

13.
The nonsynonymous to synonymous substitution rate ratio (omega = d(N)/d(S)) provides a sensitive measure of selective pressure at the protein level, with omega values <1, =1, and >1 indicating purifying selection, neutral evolution, and diversifying selection, respectively. Maximum likelihood models of codon substitution developed recently account for variable selective pressures among amino acid sites by employing a statistical distribution for the omega ratio among sites. Those models, called random-sites models, are suitable when we do not know a priori which sites are under what kind of selective pressure. Sometimes prior information (such as the tertiary structure of the protein) might be available to partition sites in the protein into different classes, which are expected to be under different selective pressures. It is then sensible to use such information in the model. In this paper, we implement maximum likelihood models for prepartitioned data sets, which account for the heterogeneity among site partitions by using different omega parameters for the partitions. The models, referred to as fixed-sites models, are also useful for combined analysis of multiple genes from the same set of species. We apply the models to data sets of the major histocompatibility complex (MHC) class I alleles from human populations and of the abalone sperm lysin genes. Structural information is used to partition sites in MHC into two classes: those in the antigen recognition site (ARS) and those outside. Positive selection is detected in the ARS by the fixed-sites models. Similarly, sites in lysin are classified into the buried and solvent-exposed classes according to the tertiary structure, and positive selection was detected at the solvent-exposed sites. The random-sites models identified a number of sites under positive selection in each data set, confirming and elaborating the results of the fixed-sites models. The analysis demonstrates the utility of the fixed-sites models, as well as the power of previous random-sites models, which do not use the prior information to partition sites.  相似文献   

14.
As larger, more complex data sets are being used to infer phylogenies, accuracy of these phylogenies increasingly requires models of evolution that accommodate heterogeneity in the processes of molecular evolution. We investigated the effect of improper data partitioning on phylogenetic accuracy, as well as the type I error rate and sensitivity of Bayes factors, a commonly used method for choosing among different partitioning strategies in Bayesian analyses. We also used Bayes factors to test empirical data for the need to divide data in a manner that has no expected biological meaning. Posterior probability estimates are misleading when an incorrect partitioning strategy is assumed. The error was greatest when the assumed model was underpartitioned. These results suggest that model partitioning is important for large data sets. Bayes factors performed well, giving a 5% type I error rate, which is remarkably consistent with standard frequentist hypothesis tests. The sensitivity of Bayes factors was found to be quite high when the across-class model heterogeneity reflected that of empirical data. These results suggest that Bayes factors represent a robust method of choosing among partitioning strategies. Lastly, results of tests for the inclusion of unexpected divisions in empirical data mirrored the simulation results, although the outcome of such tests is highly dependent on accounting for rate variation among classes. We conclude by discussing other approaches for partitioning data, as well as other applications of Bayes factors.  相似文献   

15.
Miyazawa S 《PloS one》2011,6(12):e28892
BACKGROUND: A mechanistic codon substitution model, in which each codon substitution rate is proportional to the product of a codon mutation rate and the average fixation probability depending on the type of amino acid replacement, has advantages over nucleotide, amino acid, and empirical codon substitution models in evolutionary analysis of protein-coding sequences. It can approximate a wide range of codon substitution processes. If no selection pressure on amino acids is taken into account, it will become equivalent to a nucleotide substitution model. If mutation rates are assumed not to depend on the codon type, then it will become essentially equivalent to an amino acid substitution model. Mutation at the nucleotide level and selection at the amino acid level can be separately evaluated. RESULTS: The present scheme for single nucleotide mutations is equivalent to the general time-reversible model, but multiple nucleotide changes in infinitesimal time are allowed. Selective constraints on the respective types of amino acid replacements are tailored to each gene in a linear function of a given estimate of selective constraints. Their good estimates are those calculated by maximizing the respective likelihoods of empirical amino acid or codon substitution frequency matrices. Akaike and Bayesian information criteria indicate that the present model performs far better than the other substitution models for all five phylogenetic trees of highly-divergent to highly-homologous sequences of chloroplast, mitochondrial, and nuclear genes. It is also shown that multiple nucleotide changes in infinitesimal time are significant in long branches, although they may be caused by compensatory substitutions or other mechanisms. The variation of selective constraint over sites fits the datasets significantly better than variable mutation rates, except for 10 slow-evolving nuclear genes of 10 mammals. An critical finding for phylogenetic analysis is that assuming variable mutation rates over sites lead to the overestimation of branch lengths.  相似文献   

16.
刘超洋  庄文颖 《菌物学报》2013,32(3):563-573
在使用rRNA基因进行系统发育分析过程中,不同位点间进化速度的差异性可能是导致系统误差的一个重要原因。以52个真菌为研究对象,利用rRNA二级结构特征构建分区策略,探讨不同分区策略对贝叶斯分析的影响。结果显示各结构分区的最优核酸替代模型及其参数与分区类型密切相关。与传统的贝叶斯方法相比,使用结构环的分区策略对结果没有显著影响,而引入臂元素的方法却导致更高的边际似然值和支持率。此外,不考虑结构特征,简单的增加子分区数量的分区策略尽管也能导致贝叶斯因素值的增加,却没有提高解决亲缘关系的能力,说明一个合理的分区策略应该基于生物学功能(或二级结构特征)而非纯数学因素。  相似文献   

17.
The recent development of Bayesian phylogenetic inference using Markov chain Monte Carlo (MCMC) techniques has facilitated the exploration of parameter-rich evolutionary models. At the same time, stochastic models have become more realistic (and complex) and have been extended to new types of data, such as morphology. Based on this foundation, we developed a Bayesian MCMC approach to the analysis of combined data sets and explored its utility in inferring relationships among gall wasps based on data from morphology and four genes (nuclear and mitochondrial, ribosomal and protein coding). Examined models range in complexity from those recognizing only a morphological and a molecular partition to those having complex substitution models with independent parameters for each gene. Bayesian MCMC analysis deals efficiently with complex models: convergence occurs faster and more predictably for complex models, mixing is adequate for all parameters even under very complex models, and the parameter update cycle is virtually unaffected by model partitioning across sites. Morphology contributed only 5% of the characters in the data set but nevertheless influenced the combined-data tree, supporting the utility of morphological data in multigene analyses. We used Bayesian criteria (Bayes factors) to show that process heterogeneity across data partitions is a significant model component, although not as important as among-site rate variation. More complex evolutionary models are associated with more topological uncertainty and less conflict between morphology and molecules. Bayes factors sometimes favor simpler models over considerably more parameter-rich models, but the best model overall is also the most complex and Bayes factors do not support exclusion of apparently weak parameters from this model. Thus, Bayes factors appear to be useful for selecting among complex models, but it is still unclear whether their use strikes a reasonable balance between model complexity and error in parameter estimates.  相似文献   

18.
Phylogenetic analyses of DNA sequences were conducted to evaluate four alternative hypotheses of phrynosomatine sand lizard relationships. Sequences comprising 2871 aligned base pair positions representing the regions spanning ND1-COI and cyt b-tRNA(Thr) of the mitochondrial genome from all recognized sand lizard species were analyzed using unpartitioned parsimony and likelihood methods, likelihood methods with assumed partitions, Bayesian methods with assumed partitions, and Bayesian mixture models. The topology (Uma, (Callisaurus, (Cophosaurus, Holbrookia))) and thus monophyly of the "earless" taxa, Cophosaurus and Holbrookia, is supported by all analyses. Previously proposed topologies in which Uma and Callisaurus are sister taxa and those in which Holbrookia is the sister group to all other sand lizard taxa are rejected using both parsimony and likelihood-based significance tests with the combined, unparitioned data set. Bayesian hypothesis tests also reject those topologies using six assumed partitioning strategies, and the two partitioning strategies presumably associated with the most powerful tests also reject a third previously proposed topology, in which Callisaurus and Cophosaurus are sister taxa. For both maximum likelihood and Bayesian methods with assumed partitions, those partitions defined by codon position and tRNA stem and nonstems explained the data better than other strategies examined. Bayes factor estimates comparing results of assumed partitions versus mixture models suggest that mixture models perform better than assumed partitions when the latter were not based on functional characteristics of the data, such as codon position and tRNA stem and nonstems. However, assumed partitions performed better than mixture models when functional differences were incorporated. We reiterate the importance of accounting for heterogeneous evolutionary processes in the analysis of complex data sets and emphasize the importance of implementing mixed model likelihood methods.  相似文献   

19.
Genetic sequence data typically exhibit variability in substitution rates across sites. In practice, there is often too little variation to fit a different rate for each site in the alignment, but the distribution of rates across sites may not be well modeled using simple parametric families. Mixtures of different distributions can capture more complex patterns of rate variation, but are often parameter-rich and difficult to fit. We present a simple hierarchical model in which a baseline rate distribution, such as a gamma distribution, is discretized into several categories, the quantiles of which are estimated using a discretized beta distribution. Although this approach involves adding only two extra parameters to a standard distribution, a wide range of rate distributions can be captured. Using simulated data, we demonstrate that a "beta-" model can reproduce the moments of the rate distribution more accurately than the distribution used to simulate the data, even when the baseline rate distribution is misspecified. Using hepatitis C virus and mammalian mitochondrial sequences, we show that a beta- model can fit as well or better than a model with multiple discrete rate categories, and compares favorably with a model which fits a separate rate category to each site. We also demonstrate this discretization scheme in the context of codon models specifically aimed at identifying individual sites undergoing adaptive or purifying evolution.  相似文献   

20.
Evolutionary studies commonly model single nucleotide substitutions and assume that they occur as independent draws from a unique probability distribution across the sequence studied. This assumption is violated for protein-coding sequences, and we consider modeling approaches where codon positions (CPs) are treated as separate categories of sites because within each category the assumption is more reasonable. Such "codon-position" models have been shown to explain the evolution of codon data better than homogenous models in previous studies. This paper examines the ways in which codon-position models outperform homogeneous models and characterizes the differences in estimates of model parameters across CPs. Using the PANDIT database of multiple species DNA sequence alignments, we quantify the differences in the evolutionary processes at the 3 CPs in a systematic and comprehensive manner, characterizing previously undescribed features of protein evolution. We relate our findings to the functional constraints imposed by the genetic code, protein function, and the types of mutation that cause synonymous and nonsynonymous codon changes. The results increase our understanding of selective constraints and could be incorporated into phylogenetic analyses or gene-finding techniques in the future. The methods used are extended to an overlapping reading frame data set, and we discover that overlapping reading frames do not necessarily cause more stringent evolutionary constraints.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号