首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
We develop a new approach to estimate a matrix of pairwise evolutionary distances from a codon-based alignment based on a codon evolutionary model. The method first computes a standard distance matrix for each of the three codon positions. Then these three distance matrices are weighted according to an estimate of the global evolutionary rate of each codon position and averaged into a unique distance matrix. Using a large set of both real and simulated codon-based alignments of nucleotide sequences, we show that this approach leads to distance matrices that have a significantly better treelikeness compared to those obtained by standard nucleotide evolutionary distances. We also propose an alternative weighting to eliminate the part of the noise often associated with some codon positions, particularly the third position, which is known to induce a fast evolutionary rate. Simulation results show that fast distance-based tree reconstruction algorithms on distance matrices based on this codon position weighting can lead to phylogenetic trees that are at least as accurate as, if not better, than those inferred by maximum likelihood. Finally, a well-known multigene dataset composed of eight yeast species and 106 codon-based alignments is reanalyzed and shows that our codon evolutionary distances allow building a phylogenetic tree which is similar to those obtained by non-distance-based methods (e.g., maximum parsimony and maximum likelihood) and also significantly improved compared to standard nucleotide evolutionary distance estimates.  相似文献   

2.
Phylogenetic methods that use matrices of pairwise distances between sequences (e.g., neighbor joining) will only give accurate results when the initial estimates of the pairwise distances are accurate. For many different models of sequence evolution, analytical formulae are known that give estimates of the distance between two sequences as a function of the observed numbers of substitutions of various classes. These are often of a form that we call "log transform formulae". Errors in these distance estimates become larger as the time t since divergence of the two sequences increases. For long times, the log transform formulae can sometimes give divergent distance estimates when applied to finite sequences. We show that these errors become significant when t approximately 1/2 |lambda(max)|(-1) logN, where lambda(max) is the eigenvalue of the substitution rate matrix with the largest absolute value and N is the sequence length. Various likelihood-based methods have been proposed to estimate the values of parameters in rate matrices. If rate matrix parameters are known with reasonable accuracy, it is possible to use the maximum likelihood method to estimate evolutionary distances while keeping the rate parameters fixed. We show that errors in distances estimated in this way only become significant when t approximately 1/2 |lambda(1)|(-1) logN, where lambda(1) is the eigenvalue of the substitution rate matrix with the smallest nonzero absolute value. The accuracy of likelihood-based distance estimates is therefore much higher than those based on log transform formulae, particularly in cases where there is a large range of timescales involved in the rate matrix (e.g., when the ratio of transition to transversion rates is large). We discuss several practical ways of estimating the rate matrix parameters before distance calculation and hence of increasing the accuracy of distance estimates.  相似文献   

3.
Miyazawa S 《PloS one》2011,6(12):e28892
BACKGROUND: A mechanistic codon substitution model, in which each codon substitution rate is proportional to the product of a codon mutation rate and the average fixation probability depending on the type of amino acid replacement, has advantages over nucleotide, amino acid, and empirical codon substitution models in evolutionary analysis of protein-coding sequences. It can approximate a wide range of codon substitution processes. If no selection pressure on amino acids is taken into account, it will become equivalent to a nucleotide substitution model. If mutation rates are assumed not to depend on the codon type, then it will become essentially equivalent to an amino acid substitution model. Mutation at the nucleotide level and selection at the amino acid level can be separately evaluated. RESULTS: The present scheme for single nucleotide mutations is equivalent to the general time-reversible model, but multiple nucleotide changes in infinitesimal time are allowed. Selective constraints on the respective types of amino acid replacements are tailored to each gene in a linear function of a given estimate of selective constraints. Their good estimates are those calculated by maximizing the respective likelihoods of empirical amino acid or codon substitution frequency matrices. Akaike and Bayesian information criteria indicate that the present model performs far better than the other substitution models for all five phylogenetic trees of highly-divergent to highly-homologous sequences of chloroplast, mitochondrial, and nuclear genes. It is also shown that multiple nucleotide changes in infinitesimal time are significant in long branches, although they may be caused by compensatory substitutions or other mechanisms. The variation of selective constraint over sites fits the datasets significantly better than variable mutation rates, except for 10 slow-evolving nuclear genes of 10 mammals. An critical finding for phylogenetic analysis is that assuming variable mutation rates over sites lead to the overestimation of branch lengths.  相似文献   

4.
Recently, much attention has been devoted to the construction of phylogenetic networks which generalize phylogenetic trees in order to accommodate complex evolutionary processes. Here, we present an efficient, practical algorithm for reconstructing level-1 phylogenetic networks--a type of network slightly more general than a phylogenetic tree--from triplets. Our algorithm has been made publicly available as the program LEV1ATHAN. It combines ideas from several known theoretical algorithms for phylogenetic tree and network reconstruction with two novel subroutines. Namely, an exponential-time exact and a greedy algorithm both of which are of independent theoretical interest. Most importantly, LEV1ATHAN runs in polynomial time and always constructs a level-1 network. If the data are consistent with a phylogenetic tree, then the algorithm constructs such a tree. Moreover, if the input triplet set is dense and, in addition, is fully consistent with some level-1 network, it will find such a network. The potential of LEV1ATHAN is explored by means of an extensive simulation study and a biological data set. One of our conclusions is that LEV1ATHAN is able to construct networks consistent with a high percentage of input triplets, even when these input triplets are affected by a low to moderate level of noise.  相似文献   

5.
Selection at the protein-level can influence nucleotide substitution patterns for protein-coding genes, which in turn can affect their performance as phylogenetic characters. In this study, we compare two protein-coding nuclear genes that appear to have evolved under markedly different selective constraints and evaluate how selection has shaped their phylogenetic signal. We sequenced 1,100+ bp of exon 6 of the gene encoding dentin matrix protein 1 (DMP1) from most of the currently recognized genera of New World opossums (family: Didelphidae) and compared these data to an existing matrix of sequences from the interphotoreceptor retinoid-binding protein gene (IRBP) and morphological characters. In comparison to IRBP, DMP1 has far fewer sites under strong purifying selection and exhibits a number of sites under positive directional selection. Furthermore, selection on the DMP1 protein appears to conserve short, acidic, serine-rich domains rather than primary amino acid sequence; as a result, DMP1 has significantly different nucleotide substitution patterns from IRBP. Using Bayesian methods, we determined that DMP1 evolves almost 30% faster than IRBP, has 2.5 times more variable sites, has less among-site rate heterogeneity, is skewed toward A and away from CT (IRBP has relatively even base frequencies), and has a significantly lower rate of change between adenine and any other nucleotide. Despite these different nucleotide substitution patterns, estimates of didelphid relationships based on separate phylogenetic analyses of these genes are remarkably congruent whether patterns of nucleotide substitution are explicitly modeled or not. Nonetheless, DMP1 contains more phylogenetically informative characters per unit sequence and resolves more nodes with higher support than does IRBP. Thus, for these two genes, relaxed functional constraints and positive selection appear to improve the efficiency of phylogenetic estimation without compromising its accuracy.  相似文献   

6.
A major assumption of many molecular phylogenetic methods is the homogeneity of nucleotide frequencies among taxa, which refers to the equality of the nucleotide frequency bias among species. Changes in nucleotide frequency among different lineages in a data set are thought to lead to erroneous phylogenetic inference because unrelated clades may appear similar because of evolutionarily unrelated similarities in nucleotide frequencies. We tested the effects of the heterogeneity of nucleotide frequency bias on phylogenetic inference, along with the interaction between this heterogeneity and stratified taxon sampling, by means of computer simulations using evolutionary parameters derived from genomic databases. We found that the phylogenetic trees inferred from data sets simulated under realistic, observed levels of heterogeneity for mammalian genes were reconstructed with accuracy comparable to those simulated with homogeneous nucleotide frequencies; the results hold for Neighbor-Joining, minimum evolution, maximum parsimony, and maximum-likelihood methods. The LogDet distance method, specifically designed to deal with heterogeneous nucleotide frequencies, does not perform better than distance methods that assume substitution pattern homogeneity among sequences. In these specific simulation conditions, we did not find a significant interaction between phylogenetic accuracy and substitution pattern heterogeneity among lineages, even when the taxon sampling is increased.  相似文献   

7.
Phylogenetic analyses frequently rely on models of sequence evolution that detail nucleotide substitution rates, nucleotide frequencies, and site-to-site rate heterogeneity. These models can influence hypothesis testing and can affect the accuracy of phylogenetic inferences. Maximum likelihood methods of simultaneously constructing phylogenetic tree topologies and estimating model parameters are computationally intensive, and are not feasible for sample sizes of 25 or greater using personal computers. Techniques that initially construct a tree topology and then use this non-maximized topology to estimate ML substitution rates, however, can quickly arrive at a model of sequence evolution. The accuracy of this two-step estimation technique was tested using simulated data sets with known model parameters. The results showed that for a star-like topology, as is often seen in human immunodeficiency virus type 1 (HIV-1) subtype B sequences, a random starting topology could produce nucleotide substitution rates that were not statistically different than the true rates. Samples were isolated from 100 HIV-1 subtype B infected individuals from the United States and a 620 nt region of the env gene was sequenced for each sample. The sequence data were used to obtain a substitution model of sequence evolution specific for HIV-1 subtype B env by estimating nucleotide substitution rates and the site-to-site heterogeneity in 100 individuals from the United States. The method of estimating the model should provide users of large data sets with a way to quickly compute a model of sequence evolution, while the nucleotide substitution model we identified should prove useful in the phylogenetic analysis of HIV-1 subtype B env sequences. Received: 4 October 2000 / Accepted: 1 March 2001  相似文献   

8.
jModelTest: phylogenetic model averaging   总被引:15,自引:0,他引:15  
jModelTest is a new program for the statistical selection of models of nucleotide substitution based on "Phyml" (Guindon and Gascuel 2003. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol. 52:696-704.). It implements 5 different selection strategies, including "hierarchical and dynamical likelihood ratio tests," the "Akaike information criterion," the "Bayesian information criterion," and a "decision-theoretic performance-based" approach. This program also calculates the relative importance and model-averaged estimates of substitution parameters, including a model-averaged estimate of the phylogeny. jModelTest is written in Java and runs under Mac OSX, Windows, and Unix systems with a Java Runtime Environment installed. The program, including documentation, can be freely downloaded from the software section at http://darwin.uvigo.es.  相似文献   

9.
Lake's evolutionary parsimony (EP) method of constructing a phylogenetic tree is primarily applied to four DNA sequences. In this method, three quantities--X, Y, and Z--that correspond to three possible unrooted trees are computed, and an invariance property of these quantities is used for choosing the best tree. However, Lake's method depends on a number of unrealistic assumptions. We therefore examined the theoretical basis of his method and reached the following conclusions: (1) When the rates of two transversional changes from a nucleotide are unequal, his invariance property breaks down. (2) Even if the rates of two transversional changes are equal, the invariance property requires some additional conditions. (3) When Kimura's two- parameter model of nucleotide substitution applies and the rate of nucleotide substitution varies greatly with branch, the EP method is generally better than the standard maximum-parsimony (MP) method in recovering the correct tree but is inferior to the neighbor-joining (NJ) and a few other distance matrix methods. (4) When the rate of nucleotide substitution is the same or nearly the same for all branches, the EP method is inferior to the MP method even if the proportion of transitional changes is high. (5) When Lake's assumptions fail, his chi2 test may identify an erroneous tree as the correct tree. This happens because the test is not for comparing different trees. (6) As long as a proper distance measure is used, the NJ method is better than the EP and MP methods whether there is a transition/transversion bias or whether there is variation in substitution rate among different nucleotide sites.   相似文献   

10.
Elevated substitution rates estimated from ancient DNA sequences   总被引:1,自引:0,他引:1  
Ancient DNA sequences are able to offer valuable insights into molecular evolutionary processes, which are not directly accessible via modern DNA. They are particularly suitable for the estimation of substitution rates because their ages provide calibrating information in phylogenetic analyses, circumventing the difficult task of choosing independent calibration points. The substitution rates obtained from such datasets have typically been high, falling between the rates estimated from pedigrees and species phylogenies. Many of these estimates have been made using a Bayesian phylogenetic method that explicitly accommodates heterochronous data. Stimulated by recent criticism of this method, we present a comprehensive simulation study that validates its performance. For datasets of moderate size, it produces accurate estimates of rates, while appearing robust to assumptions about demographic history. We then analyse a large collection of 749 ancient and 727 modern DNA sequences from 19 species of animals, plants and bacteria. Our new estimates confirm that the substitution rates estimated from ancient DNA sequences are elevated above long-term phylogenetic levels.  相似文献   

11.
An important issue in the phylogenetic analysis of nucleotide sequence data using the maximum likelihood (ML) method is the underlying evolutionary model employed. We consider the problem of simultaneously estimating the tree topology and the parameters in the underlying substitution model and of obtaining estimates of the standard errors of these parameter estimates. Given a fixed tree topology and corresponding set of branch lengths, the ML estimates of standard evolutionary model parameters are asymptotically efficient, in the sense that their joint distribution is asymptotically normal with the variance–covariance matrix given by the inverse of the Fisher information matrix. We propose a new estimate of this conditional variance based on estimation of the expected information using a Monte Carlo sampling (MCS) method. Simulations are used to compare this conditional variance estimate to the standard technique of using the observed information under a variety of experimental conditions. In the case in which one wishes to estimate simultaneously the tree and parameters, we provide a bootstrapping approach that can be used in conjunction with the MCS method to estimate the unconditional standard error. The methods developed are applied to a real data set consisting of 30 papillomavirus sequences. This overall method is easily incorporated into standard bootstrapping procedures to allow for proper variance estimation.  相似文献   

12.
Few estimates of relative substitution rates, and the underlying mutation rates, exist between mitochondrial and nuclear genes in insects. Previous estimates for insects indicate a 2-9 times faster substitution rate in mitochondrial genes relative to nuclear genes. Here we use novel methods for estimating relative rates of substitution, which incorporate multiple substitutions, and apply these methods to a group of insects (lice, Order: Phthiraptera). First, we use a modification of copath analysis (branch length regression) to construct independent comparisons of rates, consisting of each branch in a phylogenetic tree. The branch length comparisons use maximum likelihood models to correct for multiple substitution. In addition, we estimate codon-specific rates under maximum likelihood for the different genes and compare these values. Estimates of the relative synonymous substitution rates between a mitochondrial (COI) and nuclear (EF-1alpha) gene in lice indicate a relative rate of several 100 to 1. This rapid relative mitochondrial rate (>100 times) is at least an order of magnitude faster than previous estimates for any group of organisms. Comparisons using the same methods for another group of insects (aphids) reveals that this extreme relative rate estimate is not simply attributable to the methods we used, because estimates from aphids are substantially lower. Taxon sampling affects the relative rate estimate, with comparisons involving more closely related taxa resulting in a higher estimate. Relative rate estimates also increase with model complexity, indicating that methods accounting for more multiple substitution estimate higher relative rates.  相似文献   

13.
Using evolutionary Expectation Maximization to estimate indel rates   总被引:4,自引:0,他引:4  
MOTIVATION: The Expectation Maximization (EM) algorithm, in the form of the Baum-Welch algorithm (for hidden Markov models) or the Inside-Outside algorithm (for stochastic context-free grammars), is a powerful way to estimate the parameters of stochastic grammars for biological sequence analysis. To use this algorithm for multiple-sequence evolutionary modelling, it would be useful to apply the EM algorithm to estimate not only the probability parameters of the stochastic grammar, but also the instantaneous mutation rates of the underlying evolutionary model (to facilitate the development of stochastic grammars based on phylogenetic trees, also known as Statistical Alignment). Recently, we showed how to do this for the point substitution component of the evolutionary process; here, we extend these results to the indel process. RESULTS: We present an algorithm for maximum-likelihood estimation of insertion and deletion rates from multiple sequence alignments, using EM, under the single-residue indel model owing to Thorne, Kishino and Felsenstein (the 'TKF91' model). The algorithm converges extremely rapidly, gives accurate results on simulated data that are an improvement over parsimonious estimates (which are shown to underestimate the true indel rate), and gives plausible results on experimental data (coronavirus envelope domains). Owing to the algorithm's close similarity to the Baum-Welch algorithm for training hidden Markov models, it can be used in an 'unsupervised' fashion to estimate rates for unaligned sequences, or estimate several sets of rates for sequences with heterogenous rates. AVAILABILITY: Software implementing the algorithm and the benchmark is available under GPL from http://www.biowiki.org/  相似文献   

14.
15.
We propose two approximate methods (one based on parsimony and one on pairwise sequence comparison) for estimating the pattern of nucleotide substitution and a parsimony-based method for estimating the gamma parameter for variable substitution rates among sites. The matrix of substitution rates that represents the substitution pattern can be recovered through its relationship with the observable matrix of site pattern frequences in pairwise sequence comparisons. In the parsimony approach, the ancestral sequences reconstructed by the parsimony algorithm were used, and the two sequences compared are those at the ends of a branch in the phylogenetic tree. The method for estimating the gamma parameter was based on a reinterpretation of the numbers of changes at sites inferred by parsimony. Three data sets were analyzed to examine the utility of the approximate methods compared with the more reliable likelihood methods. The new methods for estimating the substitution pattern were found to produce estimates quite similar to those obtained from the likelihood analyses. The new method for estimating the gamma parameter was effective in reducing the bias in conventional parsimony estimates, although it also overestimated the parameter. The approximate methods are computationally very fast and appear useful for analyzing large data sets, for which use of the likelihood method requires excessive computation.   相似文献   

16.
In phylogenetic inference by maximum-parsimony (MP), minimum-evolution (ME), and maximum-likelihood (ML) methods, it is customary to conduct extensive heuristic searches of MP, ME, and ML trees, examining a large number of different topologies. However, these extensive searches tend to give incorrect tree topologies. Here we show by extensive computer simulation that when the number of nucleotide sequences (m) is large and the number of nucleotides used (n) is relatively small, the simple MP or ML tree search algorithms such as the stepwise addition (SA) plus nearest neighbor interchange (NNI) search and the SA plus subtree pruning regrafting (SPR) search are as efficient as the extensive search algorithms such as the SA plus tree bisection-reconnection (TBR) search in inferring the true tree. In the case of ME methods, the simple neighbor-joining (NJ) algorithm is as efficient as or more efficient than the extensive NJ+TBR search. We show that when ME methods are used, the simple p distance generally gives better results in phylogenetic inference than more complicated distance measures such as the Hasegawa-Kishino-Yano (HKY) distance, even when nucleotide substitution follows the HKY model. When ML methods are used, the simple Jukes-Cantor (JC) model of phylogenetic inference generally shows a better performance than the HKY model even if the likelihood value for the HKY model is much higher than that for the JC model. This indicates that at least in the present case, selecting of a substitution model by using the likelihood ratio test or the AIC index is not appropriate. When n is small relative to m and the extent of sequence divergence is high, the NJ method with p distance often shows a better performance than ML methods with the JC model. However, when the level of sequence divergence is low, this is not the case.  相似文献   

17.
Tragopogon comprises approximately 150 described species distributed throughout Eurasia from Ireland and the UK to India and China with a few species in North Africa. Most of the species diversity is found in Eastern Europe to Western Asia. Previous phylogenetic analyses identified several major clades, generally corresponding to recognized taxonomic sections, although relationships both among these clades and among species within clades remain largely unresolved. These patterns are consistent with rapid diversification following the origin of Tragopogon, and this study addresses the timing and rate of diversification in Tragopogon. Using BEAST to simultaneously estimate a phylogeny and divergence times, we estimate the age of a major split and subsequent rapid divergence within Tragopogon to be ~2.6 Ma (and 1.7–5.4 Ma using various clock estimates). Based on the age estimates obtained with BEAST (HPD 1.7–5.4 Ma) for the origin of crown group Tragopogon and 200 estimated species (to accommodate a large number of cryptic species), the diversification rate of Tragopogon is approximately 0.84–2.71 species/Myr for the crown group, assuming low levels of extinction. This estimate is comparable in rate to a rapid Eurasian radiation in Dianthus (0.66–3.89 species/Myr), which occurs in the same or similar habitats. Using available data, we show that subclades of various plant taxa that occur in the same semi‐arid habitats of Eurasia also represent rapid radiations occurring during roughly the same window of time (1.7–5.4 Ma), suggesting similar causal events. However, not all species‐rich plant genera from the same habitats diverged at the same time, or at the same tempo. Radiations of several other clades in this same habitat (e.g. Campanula, Knautia, Scabiosa) occurred at earlier dates (45–4.28 Ma). Existing phylogenetic data and diversification estimates therefore indicate that, although some elements of these semi‐arid communities radiated during the Plio‐Pleistocene period, other clades sharing the same habitat appear to have diversified earlier.  相似文献   

18.
Most of the sophisticated methods to estimate evolutionary divergence between DNA sequences assume that the two sequences have evolved with the same pattern of nucleotide substitution after their divergence from their most recent common ancestor (homogeneity assumption). If this assumption is violated, the evolutionary distance estimated will be biased, which may result in biased estimates of divergence times and substitution rates, and may lead to erroneous branching patterns in the inferred phylogenies. Here we present a simple modification for existing distance estimation methods to relax the assumption of the substitution pattern homogeneity among lineages when analyzing DNA and protein sequences. Results from computer simulations and empirical data analyses for human and mouse genes are presented to demonstrate that the proposed modification reduces the estimation bias considerably and that the modified method performs much better than the LogDet methods, which do not require the homogeneity assumption in estimating the number of substitutions per site. We also discuss the relationship of the substitution and mutation rate estimates when the substitution pattern is not the same in the lineages leading to the two sequences compared.  相似文献   

19.
Summary Studies are carried out on the uniqueness of the stationary point on the likelihood function for estimating molecular phylogenetic trees, yielding proof that there exists at most one stationary point, i.e., the maximum point, in the parameter range for the one parameter model of nucleotide substitution. The proof is simple yet applicable to any type of tree topology with an arbitrary number of operational taxonomic units (OTUs). The proof ensures that any valid approximation algorithm be able to reach the unique maximum point under the conditions mentioned above. An algorithm developed incorporating Newton's approximation method is then compared with the conventional one by means of computers simulation. The results show that the newly developed algorithm always requires less CPU time than the conventional one, whereas both algorithms lead to identical molecular phylogenetic trees in accordance with the proof. Contribution No. 1780 from the National Institute of Genetics, Mishima 411, Japan  相似文献   

20.
A comparison of phylogenetic network methods using computer simulation   总被引:1,自引:0,他引:1  

Background

We present a series of simulation studies that explore the relative performance of several phylogenetic network approaches (statistical parsimony, split decomposition, union of maximum parsimony trees, neighbor-net, simulated history recombination upper bound, median-joining, reduced median joining and minimum spanning network) compared to standard tree approaches, (neighbor-joining and maximum parsimony) in the presence and absence of recombination.

Principal Findings

In the absence of recombination, all methods recovered the correct topology and branch lengths nearly all of the time when the substitution rate was low, except for minimum spanning networks, which did considerably worse. At a higher substitution rate, maximum parsimony and union of maximum parsimony trees were the most accurate. With recombination, the ability to infer the correct topology was halved for all methods and no method could accurately estimate branch lengths.

Conclusions

Our results highlight the need for more accurate phylogenetic network methods and the importance of detecting and accounting for recombination in phylogenetic studies. Furthermore, we provide useful information for choosing a network algorithm and a framework in which to evaluate improvements to existing methods and novel algorithms developed in the future.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号