首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
ABSTRACT: BACKGROUND: Distance-based phylogenetic reconstruction methods use evolutionary distances between species in order to reconstruct the phylogenetic tree spanning them. There are many different methods for estimating distances from sequence data. These methods assume different substitution models and have different statistical properties. Since the true substitution model is typically unknown, it is important to consider the effect of model misspecification on the performance of a distance estimation method. RESULTS: This paper continues the line of research which attempts to adjust to each given set of input sequences a distance function which maximizes the expected topological accuracy of the reconstructed tree. We focus here on the effect of systematic error caused by assuming an inadequate model, but consider also the stochastic error caused by using short sequences. We introduce a theoretical framework for analyzing both sources of error based on the notion of deviation from additivity, which quantifies the contribution of model misspecification to the estimation error. We demonstrate this framework by studying the behavior of the Jukes-Cantor distance function when applied to data generated according to Kimura's two-parameter model with a transition-transversion bias. We provide both a theoretical derivation for this case, and a detailed simulation study on quartet trees. CONCLUSIONS: We demonstrate both analytically and experimentally that by deliberately assuming an oversimplified evolutionary model, it is possible to increase the topological accuracy of reconstruction. Our theoretical framework provides new insights into the mechanisms that enables statistically inconsistent reconstruction methods to outperform consistent methods.  相似文献   

2.
Two methods are commonly employed for evaluating the extent of the uncertainty of evolutionary distances between sequences: either some estimator of the variance of the distance estimator, or the bootstrap method. However, both approaches can be misleading, particularly when the evolutionary distance is small. We propose using another statistical method which does not have the same defect: interval estimation. We show how confidence intervals may be constructed for the Jukes and Cantor (1969) and Kimura two-parameter (1980) estimators. We compare the exact confidence intervals thus obtained with the approximate intervals derived by the two previous methods, using artificial and biological data. The results show that the usual methods clearly underestimate the variability when the substitution rate is low and when sequences are short. Moreover, our analysis suggests that similar results may be expected for other evolutionary distance estimators.   相似文献   

3.
Most of the sophisticated methods to estimate evolutionary divergence between DNA sequences assume that the two sequences have evolved with the same pattern of nucleotide substitution after their divergence from their most recent common ancestor (homogeneity assumption). If this assumption is violated, the evolutionary distance estimated will be biased, which may result in biased estimates of divergence times and substitution rates, and may lead to erroneous branching patterns in the inferred phylogenies. Here we present a simple modification for existing distance estimation methods to relax the assumption of the substitution pattern homogeneity among lineages when analyzing DNA and protein sequences. Results from computer simulations and empirical data analyses for human and mouse genes are presented to demonstrate that the proposed modification reduces the estimation bias considerably and that the modified method performs much better than the LogDet methods, which do not require the homogeneity assumption in estimating the number of substitutions per site. We also discuss the relationship of the substitution and mutation rate estimates when the substitution pattern is not the same in the lineages leading to the two sequences compared.  相似文献   

4.
The process of inferring phylogenetic trees from molecular sequences almost always starts with a multiple alignment of these sequences but can also be based on methods that do not involve multiple sequence alignment. Very little is known about the accuracy with which such alignment-free methods recover the correct phylogeny or about the potential for increasing their accuracy. We conducted a large-scale comparison of ten alignment-free methods, among them one new approach that does not calculate distances and a faster variant of our pattern-based approach; all distance-based alignment-free methods are freely available from http://www.bioinformatics.org.au (as Python package decaf+py). We show that most methods exhibit a higher overall reconstruction accuracy in the presence of high among-site rate variation. Under all conditions that we considered, variants of the pattern-based approach were significantly better than the other alignment-free methods. The new pattern-based variant achieved a speed-up of an order of magnitude in the distance calculation step, accompanied by a small loss of tree reconstruction accuracy. A method of Bayesian inference from k-mers did not improve on classical alignment-free (and distance-based) methods but may still offer other advantages due to its Bayesian nature. We found the optimal word length k of word-based methods to be stable across various data sets, and we provide parameter ranges for two different alphabets. The influence of these alphabets was analyzed to reveal a trade-off in reconstruction accuracy between long and short branches. We have mapped the phylogenetic accuracy for many alignment-free methods, among them several recently introduced ones, and increased our understanding of their behavior in response to biologically important parameters. In all experiments, the pattern-based approach emerged as superior, at the expense of higher resource consumption. Nonetheless, no alignment-free method that we examined recovers the correct phylogeny as accurately as does an approach based on maximum-likelihood distance estimates of multiply aligned sequences.  相似文献   

5.
The assumption of a molecular clock for dating events from sequence information is often frustrated by the presence of heterogeneity among evolutionary rates due, among other factors, to positively selected sites. In this work, our goal is to explore methods to estimate infection dates from sequence analysis. One such method, based on site stripping for clock detection, was proposed to unravel the clocklike molecular evolution in sequences showing high variability of evolutionary rates and in the presence of positive selection. Other alternatives imply accommodating heterogeneity in evolutionary rates at various levels, without eliminating any information from the data. Here we present the analysis of a data set of hepatitis C virus (HCV) sequences from 24 patients infected by a single individual with known dates of infection. We first used a simple criterion of relative substitution rate for site removal prior to a regression analysis. Time was regressed on maximum likelihood pairwise evolutionary distances between the sequences sampled from the source individual and infected patients. We show that it is indeed the fastest evolving sites that disturb the molecular clock and that these sites correspond to positively selected codons. The high computational efficiency of the regression analysis allowed us to compare the site-stripping scheme with random removal of sites. We demonstrate that removing the fast-evolving sites significantly increases the accuracy of estimation of infection times based on a single substitution rate. However, the time-of-infection estimations improved substantially when a more sophisticated and computationally demanding Bayesian method was used. This method was used with the same data set but keeping all the sequence positions in the analysis. Consequently, despite the distortion introduced by positive selection on evolutionary rates, it is possible to obtain quite accurate estimates of infection dates, a result of especial relevance for molecular epidemiology studies.  相似文献   

6.
We introduce a new approach to estimate the evolutionary distance between two sequences. This approach uses a tree with three leaves: two of them correspond to the studied sequences, whereas the third is chosen to handle long-distance estimation. The branch lengths of this tree are obtained by likelihood maximization and are then used to deduce the desired distance. This approach, called TripleML, improves the precision of evolutionary distance estimates, and thus the topological accuracy of distance-based methods. TripleML can be used with neighbor-joining-like (NJ-like) methods not only to compute the initial distance matrix but also to estimate new distances encountered during the agglomeration process. Computer simulations indicate that using TripleML significantly improves the topological accuracy of NJ, BioNJ, and Weighbor, while conserving a reasonable computation time. With randomly generated 24-taxon trees and realistic parameter values, combining NJ with TripleML reduces the number of wrongly inferred branches by about 11% (against 2.6% and 5.5% for BioNJ and Weighbor, respectively). Moreover, this combination requires only about 1.5 min to infer a phylogeny of 96 sequences composed of 1,200 nucleotides, as compared with 6.5 h for FastDNAml on the same machine (PC 466 MHz).  相似文献   

7.
Reconstructing the evolutionary history of protein sequences will provide a better understanding of divergence mechanisms of protein superfamilies and their functions. Long-term protein evolution often includes dynamic changes such as insertion, deletion, and domain shuffling. Such dynamic changes make reconstructing protein sequence evolution difficult and affect the accuracy of molecular evolutionary methods, such as multiple alignments and phylogenetic methods. Unfortunately, currently available simulation methods are not sufficiently flexible and do not allow biologically realistic dynamic protein sequence evolution. We introduce a new method, indel-Seq-Gen (iSG), that can simulate realistic evolutionary processes of protein sequences with insertions and deletions (indels). Unlike other simulation methods, iSG allows the user to simulate multiple subsequences according to different evolutionary parameters, which is necessary for generating realistic protein families with multiple domains. iSG tracks all evolutionary events including indels and outputs the "true" multiple alignment of the simulated sequences. iSG can also generate a larger sequence space by allowing the use of multiple related root sequences. With all these functions, iSG can be used to test the accuracy of, for example, multiple alignment methods, phylogenetic methods, evolutionary hypotheses, ancestral protein reconstruction methods, and protein family classification methods. We empirically evaluated the performance of iSG against currently available methods by simulating the evolution of the G protein-coupled receptor and lipocalin protein families. We examined their true multiple alignments, reconstruction of the transmembrane regions and beta-strands, and the results of similarity search against a protein database using the simulated sequences. We also presented an example of using iSG for examining how phylogenetic reconstruction is affected by high indel rates.  相似文献   

8.
We present a Bayesian method for the reconstruction of pedigrees in clonal populations using co-dominant genomic markers such as microsatellites and single nucleotide polymorphisms (SNPs). The accuracy of the algorithm is demonstrated for simulated data. We show that the joint estimation of parameters of interest such as the rate of self-fertilization is possible with high accuracy even with marker panels of moderate power. Classical methods can only assign a very limited number of statistically significant parentages in this case and would therefore fail. Statistical confidence is estimated by Markov Chain Monte Carlo (MCMC) sampling. The method is implemented in a fast and easy to use open source software that scales to large datasets with many thousand individuals.  相似文献   

9.
J S Lopes  M Arenas  D Posada  M A Beaumont 《Heredity》2014,112(3):255-264
The estimation of parameters in molecular evolution may be biased when some processes are not considered. For example, the estimation of selection at the molecular level using codon-substitution models can have an upward bias when recombination is ignored. Here we address the joint estimation of recombination, molecular adaptation and substitution rates from coding sequences using approximate Bayesian computation (ABC). We describe the implementation of a regression-based strategy for choosing subsets of summary statistics for coding data, and show that this approach can accurately infer recombination allowing for intracodon recombination breakpoints, molecular adaptation and codon substitution rates. We demonstrate that our ABC approach can outperform other analytical methods under a variety of evolutionary scenarios. We also show that although the choice of the codon-substitution model is important, our inferences are robust to a moderate degree of model misspecification. In addition, we demonstrate that our approach can accurately choose the evolutionary model that best fits the data, providing an alternative for when the use of full-likelihood methods is impracticable. Finally, we applied our ABC method to co-estimate recombination, substitution and molecular adaptation rates from 24 published human immunodeficiency virus 1 coding data sets.  相似文献   

10.
Phylogenetic methods that use matrices of pairwise distances between sequences (e.g., neighbor joining) will only give accurate results when the initial estimates of the pairwise distances are accurate. For many different models of sequence evolution, analytical formulae are known that give estimates of the distance between two sequences as a function of the observed numbers of substitutions of various classes. These are often of a form that we call "log transform formulae". Errors in these distance estimates become larger as the time t since divergence of the two sequences increases. For long times, the log transform formulae can sometimes give divergent distance estimates when applied to finite sequences. We show that these errors become significant when t approximately 1/2 |lambda(max)|(-1) logN, where lambda(max) is the eigenvalue of the substitution rate matrix with the largest absolute value and N is the sequence length. Various likelihood-based methods have been proposed to estimate the values of parameters in rate matrices. If rate matrix parameters are known with reasonable accuracy, it is possible to use the maximum likelihood method to estimate evolutionary distances while keeping the rate parameters fixed. We show that errors in distances estimated in this way only become significant when t approximately 1/2 |lambda(1)|(-1) logN, where lambda(1) is the eigenvalue of the substitution rate matrix with the smallest nonzero absolute value. The accuracy of likelihood-based distance estimates is therefore much higher than those based on log transform formulae, particularly in cases where there is a large range of timescales involved in the rate matrix (e.g., when the ratio of transition to transversion rates is large). We discuss several practical ways of estimating the rate matrix parameters before distance calculation and hence of increasing the accuracy of distance estimates.  相似文献   

11.
We present a novel maximum-likelihood-based algorithm for estimating the distribution of alignment scores from the scores of unrelated sequences in a database search. Using a new method for measuring the accuracy of p-values, we show that our maximum-likelihood-based algorithm is more accurate than existing regression-based and lookup table methods. We explore a more sophisticated way of modeling and estimating the score distributions (using a two-component mixture model and expectation maximization), but conclude that this does not improve significantly over simply ignoring scores with small E-values during estimation. Finally, we measure the classification accuracy of p-values estimated in different ways and observe that inaccurate p-values can, somewhat paradoxically, lead to higher classification accuracy. We explain this paradox and argue that statistical accuracy, not classification accuracy, should be the primary criterion in comparisons of similarity search methods that return p-values that adjust for target sequence length.  相似文献   

12.
The comparative and evolutionary analysis of molecular data has allowed researchers to tackle biological questions that have long remained unresolved. The evolution of DNA and amino acid sequences can now be modeled accurately enough that the information conveyed can be used to reconstruct the past. The methods to infer phylogeny (the pattern of historical relationships among lineages of organisms and/or sequences) range from the simplest, based on parsimony, to more sophisticated and highly parametric ones based on likelihood and Bayesian approaches. In general, molecular systematics provides a powerful statistical framework for hypothesis testing and the estimation of evolutionary processes, including the estimation of divergence times among taxa. The field of molecular systematics has experienced a revolution in recent years, and, although there are still methodological problems and pitfalls, it has become an essential tool for the study of evolutionary patterns and processes at different levels of biological organization. This review aims to present a brief synthesis of the approaches and methodologies that are most widely used in the field of molecular systematics today, as well as indications of future trends and state-of-the-art approaches.  相似文献   

13.
The phylogenetic inference of ancestral protein sequences is a powerful technique for the study of molecular evolution, but any conclusions drawn from such studies are only as good as the accuracy of the reconstruction method. Every inference method leads to errors in the ancestral protein sequence, resulting in potentially misleading estimates of the ancestral protein's properties. To assess the accuracy of ancestral protein reconstruction methods, we performed computational population evolution simulations featuring near-neutral evolution under purifying selection, speciation, and divergence using an off-lattice protein model where fitness depends on the ability to be stable in a specified target structure. We were thus able to compare the thermodynamic properties of the true ancestral sequences with the properties of “ancestral sequences” inferred by maximum parsimony, maximum likelihood, and Bayesian methods. Surprisingly, we found that methods such as maximum parsimony and maximum likelihood that reconstruct a “best guess” amino acid at each position overestimate thermostability, while a Bayesian method that sometimes chooses less-probable residues from the posterior probability distribution does not. Maximum likelihood and maximum parsimony apparently tend to eliminate variants at a position that are slightly detrimental to structural stability simply because such detrimental variants are less frequent. Other properties of ancestral proteins might be similarly overestimated. This suggests that ancestral reconstruction studies require greater care to come to credible conclusions regarding functional evolution. Inferred functional patterns that mimic reconstruction bias should be reevaluated.  相似文献   

14.
Fossil taxa are critical to inferences of historical diversity and the origins of modern biodiversity, but realizing their evolutionary significance is contingent on restoring fossil species to their correct position within the tree of life. For most fossil species, morphology is the only source of data for phylogenetic inference; this has traditionally been analysed using parsimony, the predominance of which is currently challenged by the development of probabilistic models that achieve greater phylogenetic accuracy. Here, based on simulated and empirical datasets, we explore the relative efficacy of competing phylogenetic methods in terms of clade support. We characterize clade support using bootstrapping for parsimony and Maximum Likelihood, and intrinsic Bayesian posterior probabilities, collapsing branches that exhibit less than 50% support. Ignoring node support, Bayesian inference is the most accurate method in estimating the tree used to simulate the data. After assessing clade support, Bayesian and Maximum Likelihood exhibit comparable levels of accuracy, and parsimony remains the least accurate method. However, Maximum Likelihood is less precise than Bayesian phylogeny estimation, and Bayesian inference recaptures more correct nodes with higher support compared to all other methods, including Maximum Likelihood. We assess the effects of these findings on empirical phylogenies. Our results indicate probabilistic methods should be favoured over parsimony.  相似文献   

15.
In recent years, a number of phylogenetic methods have been developed for estimating molecular rates and divergence dates under models that relax the molecular clock constraint by allowing rate change throughout the tree. These methods are being used with increasing frequency, but there have been few studies into their accuracy. We tested the accuracy of several relaxed-clock methods (penalized likelihood and Bayesian inference using various models of rate change) using nucleotide sequences simulated on a nine-taxon tree. When the sequences evolved with a constant rate, the methods were able to infer rates accurately, but estimates were more precise when a molecular clock was assumed. When the sequences evolved under a model of auto-correlated rate change, rates were accurately estimated using penalized likelihood and by Bayesian inference using lognormal and exponential models of rate change, while other models did not perform as well. When the sequences evolved under a model of uncorrelated rate change, only Bayesian inference using an exponential rate model performed well. Collectively, the results provide a strong recommendation for using the exponential model of rate change if a conservative approach to divergence time estimation is required. A case study is presented in which we use a simulation-based approach to examine the hypothesis of elevated rates in the Cambrian period, and it is found that these high rate estimates might be an artifact of the rate estimation method. If this bias is present, then the ages of metazoan divergences would be systematically underestimated. The results of this study have implications for studies of molecular rates and divergence dates.  相似文献   

16.
Distance-based methods are popular for reconstructing evolutionary trees of protein sequences, mainly because of their speed and generality. A number of variants of the classical neighbor-joining (NJ) algorithm have been proposed, as well as a number of methods to estimate protein distances. We here present a large-scale assessment of performance in reconstructing the correct tree topology for the most popular algorithms. The programs BIONJ, FastME, Weighbor, and standard NJ were run using 12 distance estimators, producing 48 tree-building/distance estimation method combinations. These were evaluated on a test set based on real trees taken from 100 Pfam families. Each tree was used to generate multiple sequence alignments with the ROSE program using three evolutionary models. The accuracy of each method was analyzed as a function of both sequence divergence and location in the tree. We found that BIONJ produced the overall best results, although the average accuracy differed little between the tree-building methods (normally less than 1%). A noticeable trend was that FastME performed poorer than the rest on long branches. Weighbor was several orders of magnitude slower than the other programs. Larger differences were observed when using different distance estimators. Protein-adapted Jukes-Cantor and Kimura distance correction produced clearly poorer results than the other methods, even worse than uncorrected distances. We also assessed the recently developed Scoredist measure, which performed equally well as more complex methods.  相似文献   

17.
Unbiased estimation of evolutionary distance between nucleotide sequences   总被引:7,自引:2,他引:5  
A new algorithm for estimating the number of nucleotide substitutions per site (i.e., the evolutionary distance) between two nucleotide sequences is presented. This algorithm can be applied to many estimation methods, such as Jukes and Cantor's method, Kimura's transition/transversion method, and Tajima and Nei's method. Unlike ordinary methods, this algorithm is always applicable. Numerical computations and computer simulations indicate that this algorithm gives an almost unbiased estimate of the evolutionary distance, unless the evolutionary distance is very large. This algorithm should be useful especially when we analyze short nucleotide sequences. It can also be applied to amino acid sequences, for estimating the number of amino acid replacements.   相似文献   

18.
Distance based reconstruction methods of phylogenetic trees consist of two independent parts: first, inter-species distances are inferred assuming some stochastic model of sequence evolution; then the inferred distances are used to construct a tree. In this paper we concentrate on the task of inter-species distance estimation. Specifically, we characterize the family of valid distance functions for the assumed substitution model and show that deliberate selection of distance function significantly improves the accuracy of distance estimates and, consequently, also improves the accuracy of the reconstructed tree.Our contribution consists of three parts: first, we present a general framework for constructing families of additive distance functions for stochastic evolutionary models. Then, we present a method for selecting (near) optimal distance functions, and we conclude by presenting simulation results which support our theoretical analysis.  相似文献   

19.
Controversies over the molecular clock hypothesis were reviewed. Since it is evident that the molecular clock does not hold in an exact sense, accounting for evolution of the rate of molecular evolution is a prerequisite when estimating divergence times with molecular sequences. Recently proposed statistical methods that account for this rate variation are overviewed and one of these procedures is applied to the mitochondrial protein sequences and to the nuclear gene sequences from many mammalian species in order to estimate the time scale of eutherian evolution. This Bayesian method not only takes account of the variation of molecular evolutionary rate among lineages and among genes, but it also incorporates fossil evidence via constraints on node times. With denser taxonomic sampling and a more realistic model of molecular evolution, this Bayesian approach is expected to increase the accuracy of divergence time estimates.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号