共查询到20条相似文献,搜索用时 31 毫秒
1.
Improvement of distance-based phylogenetic methods by a local maximum likelihood approach using triplets 总被引:5,自引:0,他引:5
We introduce a new approach to estimate the evolutionary distance between two sequences. This approach uses a tree with three leaves: two of them correspond to the studied sequences, whereas the third is chosen to handle long-distance estimation. The branch lengths of this tree are obtained by likelihood maximization and are then used to deduce the desired distance. This approach, called TripleML, improves the precision of evolutionary distance estimates, and thus the topological accuracy of distance-based methods. TripleML can be used with neighbor-joining-like (NJ-like) methods not only to compute the initial distance matrix but also to estimate new distances encountered during the agglomeration process. Computer simulations indicate that using TripleML significantly improves the topological accuracy of NJ, BioNJ, and Weighbor, while conserving a reasonable computation time. With randomly generated 24-taxon trees and realistic parameter values, combining NJ with TripleML reduces the number of wrongly inferred branches by about 11% (against 2.6% and 5.5% for BioNJ and Weighbor, respectively). Moreover, this combination requires only about 1.5 min to infer a phylogeny of 96 sequences composed of 1,200 nucleotides, as compared with 6.5 h for FastDNAml on the same machine (PC 466 MHz). 相似文献
2.
Background
Phylogenetic methods which do not rely on multiple sequence alignments are important tools in inferring trees directly from completely sequenced genomes. Here, we extend the recently described Genome BLAST Distance Phylogeny (GBDP) strategy to compute phylogenetic trees from all completely sequenced plastid genomes currently available and from a selection of mitochondrial genomes representing the major eukaryotic lineages. BLASTN, TBLASTX, or combinations of both are used to locate high-scoring segment pairs (HSPs) between two sequences from which pairwise similarities and distances are computed in different ways resulting in a total of 96 GBDP variants. The suitability of these distance formulae for phylogeny reconstruction is directly estimated by computing a recently described measure of "treelikeness", the so-called δ value, from the respective distance matrices. Additionally, we compare the trees inferred from these matrices using UPGMA, NJ, BIONJ, FastME, or STC, respectively, with the NCBI taxonomy tree of the taxa under study. 相似文献3.
Efficiencies of different genes and different tree-building methods in recovering a known vertebrate phylogeny 总被引:24,自引:6,他引:18
The relative efficiencies of different protein-coding genes of the
mitochondrial genome and different tree-building methods in recovering a
known vertebrate phylogeny (two whale species, cow, rat, mouse, opossum,
chicken, frog, and three bony fish species) was evaluated. The
tree-building methods examined were the neighbor joining (NJ), minimum
evolution (ME), maximum parsimony (MP), and maximum likelihood (ML), and
both nucleotide sequences and deduced amino acid sequences were analyzed.
Generally speaking, amino acid sequences were better than nucleotide
sequences in obtaining the true tree (topology) or trees close to the true
tree. However, when only first and second codon positions data were used,
nucleotide sequences produced reasonably good trees. Among the 13 genes
examined, Nd5 produced the true tree in all tree-building methods or
algorithms for both amino acid and nucleotide sequence data. Genes Cytb and
Nd4 also produced the correct tree in most tree-building algorithms when
amino acid sequence data were used. By contrast, Co2, Nd1, and Nd41 showed
a poor performance. In general, large genes produced better results, and
when the entire set of genes was used, all tree-building methods generated
the true tree. In each tree-building method, several distance measures or
algorithms were used, but all these distance measures or algorithms
produced essentially the same results. The ME method, in which many
different topologies are examined, was no better than the NJ method, which
generates a single final tree. Similarly, an ML method, in which many
topologies are examined, was no better than the ML star decomposition
algorithm that generates a single final tree. In ML the best substitution
model chosen by using the Akaike information criterion produced no better
results than simpler substitution models. These results question the
utility of the currently used optimization principles in phylogenetic
construction. Relatively simple methods such as the NJ and ML star
decomposition algorithms seem to produce as good results as those obtained
by more sophisticated methods. The efficiencies of the NJ, ME, MP, and ML
methods in obtaining the correct tree were nearly the same when amino acid
sequence data were used. The most important factor in constructing reliable
phylogenetic trees seems to be the number of amino acids or nucleotides
used.
相似文献
4.
Fabio Pardi Sylvain Guillemot Olivier Gascuel 《Bulletin of mathematical biology》2010,72(7):1820-1839
Minimum evolution is the guiding principle of an important class of distance-based phylogeny reconstruction methods, including
neighbor-joining (NJ), which is the most cited tree inference algorithm to date. The minimum evolution principle involves
searching for the tree with minimum length, where the length is estimated using various least-squares criteria. Since evolutionary
distances cannot be known precisely but only estimated, it is important to investigate the robustness of phylogenetic reconstruction
to imprecise estimates for these distances. The safety radius is a measure of this robustness: it consists of the maximum relative deviation that the input distances can have from the
correct distances, without compromising the reconstruction of the correct tree structure. Answering some open questions, we
here derive the safety radius of two popular minimum evolution criteria: balanced minimum evolution (BME) and minimum evolution
based on ordinary least squares (OLS + ME). Whereas BME has a radius of
\frac12\frac{1}{2}, which is the best achievable, OLS + ME has a radius tending to 0 as the number of taxa increases. This difference may explain
the gap in reconstruction accuracy observed in practice between OLS + ME and BME (which forms the basis of popular programs
such as NJ and FastME). 相似文献
5.
Weighted neighbor joining: a likelihood-based approach to distance-based phylogeny reconstruction 总被引:22,自引:0,他引:22
We introduce a distance-based phylogeny reconstruction method called "weighted neighbor joining," or "Weighbor" for short. As in neighbor joining, two taxa are joined in each iteration; however, the Weighbor criterion for choosing a pair of taxa to join takes into account that errors in distance estimates are exponentially larger for longer distances. The criterion embodies a likelihood function on the distances, which are modeled as correlated Gaussian random variables with different means and variances, computed under a probabilistic model for sequence evolution. The Weighbor criterion consists of two terms, an additivity term and a positivity term, that quantify the implications of joining the pair. The first term evaluates deviations from additivity of the implied external branches, while the second term evaluates confidence that the implied internal branch has a positive branch length. Compared with maximum-likelihood phylogeny reconstruction, Weighbor is much faster, while building trees that are qualitatively and quantitatively similar. Weighbor appears to be relatively immune to the "long branches attract" and "long branch distracts" drawbacks observed with neighbor joining, BIONJ, and parsimony. 相似文献
6.
拓扑树间的通经拓扑距离 总被引:1,自引:1,他引:0
给出了一种新的系统树间的拓扑距离,使用NJ,MP,UPGMA等3种方法对13种动物的线粒体中14个基因(含组合的)DNA序列数据进行系统树的构建,利用分割拓扑距离和本文给出的通经拓扑距离对这14种系统树这间及其与真树进行比较。结果显示,NJ法对获得已知树的有效率最高,MP法次之,UPGMA法最低。这14种DNA序列所构建的系统树与已知树的拓扑距离基本上是随其DNA序列长度增加而减小,但两者的相关系数并未达到显著水平,分割拓扑距离在总体上可反映树间的拓扑结构差异,但其测度精确度比通经拓扑距离要低。 相似文献
7.
Efficient determination of evolutionary distances is important for the correct reconstruction of phylogenetic trees. The performance of the pooled distance required for reconstructing a phylogenetic tree can be improved by applying large weights to appropriate distances for reconstructing phylogenetic trees and small weights to inappropriate distances. We developed two weighting methods, the modified Tajima–Takezaki method and the modified least-squares method, for reconstructing phylogenetic trees from multiple loci. By computer simulations, we found that both of the new methods were more efficient in reconstructing correct topologies than the no-weight method. Hence, we reconstructed hominoid phylogenetic trees from mitochondrial DNA using our new methods, and found that the levels of bootstrap support were significantly increased by the modified Tajima–Takezaki and by the modified least-squares method. 相似文献
8.
The neighbor-joining (NJ) method is widely used in reconstructing large phylogenies because of its computational speed and
the high accuracy in phylogenetic inference as revealed in computer simulation studies. However, most computer simulation
studies have quantified the overall performance of the NJ method in terms of the percentage of branches inferred correctly
or the percentage of replications in which the correct tree is recovered. We have examined other aspects of its performance,
such as the relative efficiency in correctly reconstructing shallow (close to the external branches of the tree) and deep
branches in large phylogenies; the contribution of zero-length branches to topological errors in the inferred trees; and the
influence of increasing the tree size (number of sequences), evolutionary rate, and sequence length on the efficiency of the
NJ method. Results show that the correct reconstruction of deep branches is no more difficult than that of shallower branches.
The presence of zero-length branches in realized trees contributes significantly to the overall error observed in the NJ tree,
especially in large phylogenies or slowly evolving genes. Furthermore, the tree size does not influence the efficiency of
NJ in reconstructing shallow and deep branches in our simulation study, in which the evolutionary process is assumed to be
homogeneous in all lineages.
Received: 7 March 2000 / Accepted: 2 August 2000 相似文献
9.
10.
Maximum parsimony (MP) methods aim to reconstruct the phylogeny of extant species by finding the most parsimonious evolutionary scenario using the species' genome data. MP methods are considered to be accurate, but they are also computationally expensive especially for a large number of species. Several disk-covering methods (DCMs), which decompose the input species to multiple overlapping subgroups (or disks), have been proposed to solve the problem in a divide-and-conquer way. We design a new DCM based on the spectral method and also develop the COGNAC (Comparing Orders of Genes using Novel Algorithms and high-performance Computers) software package. COGNAC uses the new DCM to reduce the phylogenetic tree search space and selects an output tree from the reduced search space based on the MP principle. We test the new DCM using gene order data and inversion distance. The new DCM not only reduces the number of candidate tree topologies but also excludes erroneous tree topologies which can be selected by original MP methods. Initial labeling of internal genomes affects the accuracy of MP methods using gene order data, and the new DCM enables more accurate initial labeling as well. COGNAC demonstrates superior accuracy as a consequence. We compare COGNAC with FastME and the combination of the state of the art DCM (Rec-I-DCM3) and GRAPPA. COGNAC clearly outperforms FastME in accuracy. COGNAC--using the new DCM--also reconstructs a much more accurate tree in significantly shorter time than GRAPPA with Rec-I-DCM3. 相似文献
11.
Background
Distance-based methods are popular for reconstructing evolutionary trees thanks to their speed and generality. A number of methods exist for estimating distances from sequence alignments, which often involves some sort of correction for multiple substitutions. The problem is to accurately estimate the number of true substitutions given an observed alignment. So far, the most accurate protein distance estimators have looked for the optimal matrix in a series of transition probability matrices, e.g. the Dayhoff series. The evolutionary distance between two aligned sequences is here estimated as the evolutionary distance of the optimal matrix. The optimal matrix can be found either by an iterative search for the Maximum Likelihood matrix, or by integration to find the Expected Distance. As a consequence, these methods are more complex to implement and computationally heavier than correction-based methods. Another problem is that the result may vary substantially depending on the evolutionary model used for the matrices. An ideal distance estimator should produce consistent and accurate distances independent of the evolutionary model used. 相似文献12.
Among the criteria to evaluate the performance of a phylogenetic method, robustness to model violation is of particular practical importance as complete a priori knowledge of evolutionary processes is typically unavailable. For studies of robustness in phylogenetic inference, a utility to add well-defined model violations to the simulated data would be helpful. We therefore introduce ImOSM, a tool to imbed intermittent evolution as model violation into an alignment. Intermittent evolution refers to extra substitutions occurring randomly on branches of a tree, thus changing alignment site patterns. This means that the extra substitutions are placed on the tree after the typical process of sequence evolution is completed. We then study the robustness of widely used phylogenetic methods: maximum likelihood (ML), maximum parsimony (MP), and a distance-based method (BIONJ) to various scenarios of model violation. Violation of rates across sites (RaS) heterogeneity and simultaneous violation of RaS and the transition/transversion ratio on two nonadjacent external branches hinder all the methods recovery of the true topology for a four-taxon tree. For an eight-taxon balanced tree, the violations cause each of the three methods to infer a different topology. Both ML and MP fail, whereas BIONJ, which calculates the distances based on the ML estimated parameters, reconstructs the true tree. Finally, we report that a test of model homogeneity and goodness of fit tests have enough power to detect such model violations. The outcome of the tests can help to actually gain confidence in the inferred trees. Therefore, we recommend using these tests in practical phylogenetic analyses. 相似文献
13.
Bordewich Magnus Gascuel Olivier Huber Katharina T. Moulton Vincent 《IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM》2009,6(1):110-117
Many phylogenetic algorithms search the space of possible trees using topological rearrangements and some optimality criterion. FastME is such an approach that uses the {em balanced minimum evolution (BME)} principle, which computer studies have demonstrated to have high accuracy. FastME includes two variants: {em balanced subtree prune and regraft (BSPR)} and {em balanced nearest neighbor interchange (BNNI)}. These algorithms take as input a distance matrix and a putative phylogenetic tree. The tree is modified using SPR or NNI operations, respectively, to reduce the BME length relative to the distance matrix, until a tree with (locally) shortest BME length is found. Following computer simulations, it has been conjectured that BSPR and BNNI are consistent, i.e. for an input distance that is a tree-metric, they converge to the corresponding tree. We prove that the BSPR algorithm is consistent. Moreover, even if the input contains small errors relative to a tree-metric, we show that the BSPR algorithm still returns the corresponding tree. Whether BNNI is consistent remains open. 相似文献
14.
While the maximum-likelihood (ML) method of tree reconstruction is statistically rigorous, it is extremely time-consuming for reconstructing large trees. We previously developed a hybrid method (NJML) that combines the neighbor-joining (NJ) and ML methods and thus is much faster than the ML method and improves the performance of NJ. However, we considered only nucleotide sequence data, so NJML is not suitable for handling amino acid sequence data, which requires even more computer time. NJML+ is an implementation of a further improved method for practical data analyses (including protein sequence data). Our extensive simulations using nucleotide and amino acid sequences showed that NJML+ gave good results in tree reconstruction. Indeed, NJML+ showed substantial improvements over existing methods in terms of both computational times and efficiencies, especially for amino acid sequence data. We also developed a "user-friendly" interface for the NJML+ program, including a simple tree viewer. 相似文献
15.
Background
Some distance methods are among the most commonly used methods for reconstructing phylogenetic trees from sequence data. The input to a distance method is a distance matrix, containing estimated pairwise distances between all pairs of taxa. Distance methods themselves are often fast, e.g., the famous and popular Neighbor Joining (NJ) algorithm reconstructs a phylogeny of n taxa in time O(n 3). Unfortunately, the fastest practical algorithms known for Computing the distance matrix, from n sequences of length l, takes time proportional to l·n 2. Since the sequence length typically is much larger than the number of taxa, the distance estimation is the bottleneck in phylogeny reconstruction. This bottleneck is especially apparent in reconstruction of large phylogenies or in applications where many trees have to be reconstructed, e.g., bootstrapping and genome wide applications. 相似文献16.
In the field of phylogenetics and comparative genomics, it is important to establish orthologous relationships when comparing homologous sequences. Due to the slight sequence dissimilarity between orthologs and paralogs, it is prone to regarding paralogs as orthologs. For this reason, several methods based on evolutionary distance, phylogeny and BLAST have tried to detect orthologs with more precision. Depending on their algorithmic implementations, each of these methods sometimes has increased false negative or false positive rates. Here, we developed a novel algorithm for orthology detection that uses a distance method based on the phylogenetic criterion of minimum evolution. Our algorithm assumes that sets of sequences exhibiting orthologous relationships are evolutionarily less costly than sets that include one or more paralogous relationships. Calculation of evolutionary cost requires the reconstruction of a neighbor-joining (NJ) tree, but calculations are unaffected by the topology of any given NJ tree. Unlike tree reconciliation, our algorithm appears free from the problem of incorrect topologies of species and gene trees. The reliability of the algorithm was tested in a comparative analysis with two other orthology detection methods using 95 manually curated KOG datasets and 21 experimentally verified EXProt datasets. Sensitivity and specificity estimates indicate that the concept of minimum evolution could be valuable for the detection of orthologs. 相似文献
17.
The most commonly used measure of evolutionary distance in molecular
phylogenetics is the number of nucleotide substitutions per site. However,
this number is not necessarily most efficient for reconstructing a
phylogenetic tree. In order to evaluate the accuracy of evolutionary
distance, D(t), for obtaining the correct tree topology, an accuracy index,
A(t), was proposed. This index is defined as D'(t)/square root of[D(t)],
where D'(t) is the first derivative of D(t) with respect to evolutionary
time and V[D(t)] is the sampling variance of evolutionary distance. Using
A(t), namely, finding the condition under which A(t) gives the maximum
value, we can obtain an evolutionary distance which is efficient for
obtaining the correct topology. Under the assumption that the
transversional changes do not occur as frequently as the transitional
changes, we obtained the evolutionary distances which are expected to give
the correct topology more often than are the other distances.
相似文献
18.
Distance based reconstruction methods of phylogenetic trees consist of two independent parts: first, inter-species distances are inferred assuming some stochastic model of sequence evolution; then the inferred distances are used to construct a tree. In this paper we concentrate on the task of inter-species distance estimation. Specifically, we characterize the family of valid distance functions for the assumed substitution model and show that deliberate selection of distance function significantly improves the accuracy of distance estimates and, consequently, also improves the accuracy of the reconstructed tree.Our contribution consists of three parts: first, we present a general framework for constructing families of additive distance functions for stochastic evolutionary models. Then, we present a method for selecting (near) optimal distance functions, and we conclude by presenting simulation results which support our theoretical analysis. 相似文献
19.
ABSTRACT: BACKGROUND: Distance-based phylogenetic reconstruction methods use evolutionary distances between species in order to reconstruct the phylogenetic tree spanning them. There are many different methods for estimating distances from sequence data. These methods assume different substitution models and have different statistical properties. Since the true substitution model is typically unknown, it is important to consider the effect of model misspecification on the performance of a distance estimation method. RESULTS: This paper continues the line of research which attempts to adjust to each given set of input sequences a distance function which maximizes the expected topological accuracy of the reconstructed tree. We focus here on the effect of systematic error caused by assuming an inadequate model, but consider also the stochastic error caused by using short sequences. We introduce a theoretical framework for analyzing both sources of error based on the notion of deviation from additivity, which quantifies the contribution of model misspecification to the estimation error. We demonstrate this framework by studying the behavior of the Jukes-Cantor distance function when applied to data generated according to Kimura's two-parameter model with a transition-transversion bias. We provide both a theoretical derivation for this case, and a detailed simulation study on quartet trees. CONCLUSIONS: We demonstrate both analytically and experimentally that by deliberately assuming an oversimplified evolutionary model, it is possible to increase the topological accuracy of reconstruction. Our theoretical framework provides new insights into the mechanisms that enables statistically inconsistent reconstruction methods to outperform consistent methods. 相似文献
20.
Effects of nucleotide sequence alignment on phylogeny estimation: a case study of 18S rDNAs of apicomplexa 总被引:13,自引:5,他引:8
The reconstruction of phylogenetic history is predicated on being able to
accurately establish hypotheses of character homology, which involves
sequence alignment for studies based on molecular sequence data. In an
empirical study investigating nucleotide sequence alignment, we inferred
phylogenetic trees for 43 species of the Apicomplexa and 3 of Dinozoa based
on complete small-subunit rDNA sequences, using six different
multiple-alignment procedures: manual alignment based on the secondary
structure of the 18S rRNA molecule, and automated similarity-based
alignment algorithms using the PileUp, ClustalW, TreeAlign, MALIGN, and SAM
computer programs. Trees were constructed using neighboring-joining,
weighted-parsimony, and maximum- likelihood methods. All of the multiple
sequence alignment procedures yielded the same basic structure for the
estimate of the phylogenetic relationship among the taxa, which presumably
represents the underlying phylogenetic signal. However, the placement of
many of the taxa was sensitive to the alignment procedure used; and the
different alignments produced trees that were on average more dissimilar
from each other than did the different tree-building methods used. The
multiple alignments from the different procedures varied greatly in length,
but aligned sequence length was not a good predictor of the similarity of
the resulting phylogenetic trees. We also systematically varied the gap
weights (the relative cost of inserting a new gap into a sequence or
extending an already-existing gap) for the ClustalW program, and this
produced alignments that were at least as different from each other as
those produced by the different alignment algorithms. Furthermore, there
was no combination of gap weights that produced the same tree as that from
the structure alignment, in spite of the fact that many of the alignments
were similar in length to the structure alignment. We also investigated the
phylogenetic information content of the helical and nonhelical regions of
the rDNA, and conclude that the helical regions are the most informative.
We therefore conclude that many of the literature disagreements concerning
the phylogeny of the Apicomplexa are probably based on differences in
sequence alignment strategies rather than differences in data or
tree-building methods.
相似文献