首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
A challenging task in computational biology is the reconstruction of genomic sequences of extinct ancestors, given the phylogenetic tree and the sequences at the leafs. This task is best solved by calculating the most likely estimate of the ancestral sequences, along with the most likely edge lengths. We deal with this problem and also the variant in which the phylogenetic tree in addition to the ancestral sequences need to be estimated. The latter problem is known to be NP-hard, while the computational complexity of the former is unknown. Currently, all algorithms for solving these problems are heuristics without performance guarantees. The biological importance of these problems calls for developing better algorithms with guarantees of finding either optimal or approximate solutions.We develop approximation, fix parameter tractable (FPT), and fast heuristic algorithms for two variants of the problem; when the phylogenetic tree is known and when it is unknown. The approximation algorithm guarantees a solution with a log-likelihood ratio of 2 relative to the optimal solution. The FPT has a running time which is polynomial in the length of the sequences and exponential in the number of taxa. This makes it useful for calculating the optimal solution for small trees. Moreover, we combine the approximation algorithm and the FPT into an algorithm with arbitrary good approximation guarantee (PTAS). We tested our algorithms on both synthetic and biological data. In particular, we used the FPT for computing the most likely ancestral mitochondrial genomes of hominidae (the great apes), thereby answering an interesting biological question. Moreover, we show how the approximation algorithms find good solutions for reconstructing the ancestral genomes for a set of lentiviruses (relatives of HIV). Supplementary material of this work is available at www.nada.kth.se/~isaac/publications/aml/aml.html.  相似文献   

2.
Phylogeny reconstruction is the process of inferring evolutionary relationships from molecular sequences, and methods that are expected to accurately reconstruct trees from sequences of reasonable length are highly desirable. To formalize this concept, the property of fast-convergence has been introduced to describe phylogeny reconstruction methods that, with high probability, recover the true tree from sequences that grow polynomially in the number of taxa n. While provably fast-converging methods have been developed, the neighbor-joining (NJ) algorithm of Saitou and Nei remains one of the most popular methods used in practice. This algorithm is known to converge for sequences that are exponential in n, but no lower bound for its convergence rate has been established. To address this theoretical question, we analyze the performance of the NJ algorithm on a type of phylogeny known as a 'caterpillar tree'. We find that, for sequences of polynomial length in the number of taxa n, the variability of the NJ criterion is sufficiently high that the algorithm is likely to fail even in the first step of the phylogeny reconstruction process, regardless of the degree of polynomial considered. This result demonstrates that, for general n-taxa trees, the exponential bound cannot be improved.  相似文献   

3.
Among the fundamental problems in molecular evolution and in the analysis of homologous sequences are alignment, phylogeny reconstruction, and the reconstruction of ancestral sequences. This paper presents a fast, combined solution to these problems. The new algorithm gives an approximation to the minimal history in terms of a distance function on sequences. The distance function on sequences is a minimal weighted path length constructed from substitutions and insertions-deletions of segments of any length. Substitutions are weighted with an arbitrary metric on the set of nucleotides or amino acids, and indels are weighted with a gap penalty function of the form gk = a + (bxk), where k is the length of the indel and a and b are two positive numbers. A novel feature is the introduction of the concept of sequence graphs and a generalization of the traditional dynamic sequence comparison algorithm to the comparison of sequence graphs. Sequence graphs ease several computational problems. They are used to represent large sets of sequences that can then be compared simultaneously. Furthermore, they allow the handling of multiple, equally good, alignments, where previous methods were forced to make arbitrary choices. A program written in C implemented this method; it was tested first on 22 5S RNA sequences.   相似文献   

4.
MOTIVATION: We developed an algorithm to reconstruct ancestral sequences, taking into account the rate variation among sites of the protein sequences. Our algorithm maximizes the joint probability of the ancestral sequences, assuming that the rate is gamma distributed among sites. Our algorithm probably finds the global maximum. The use of 'joint' reconstruction is motivated by studies that use the sequences at all the internal nodes in a phylogenetic tree, such as, for instance, the inference of patterns of amino-acid replacement, or tracing the biochemical changes that occurred during the evolution of a given protein family. RESULTS: We give an algorithm that guarantees finding the global maximum. The efficient search method makes our method applicable to datasets with large number sequences. We analyze ancestral sequences of five gene families, exploring the effect of the amount of among-site-rate-variation, and the degree of sequence divergence on the resulting ancestral states. AVAILABILITY AND SUPPLEMENTARY INFORMATION: http://evolu3.ism.ac.jp/~tal/ Contact: tal@ism.ac.jp  相似文献   

5.
In this paper we investigate mathematical questions concerning the reliability (reconstruction accuracy) of Fitch's maximum parsimony algorithm for reconstructing the ancestral state given a phylogenetic tree and a character. In particular, we consider the question whether the maximum parsimony method applied to a subset of taxa can reconstruct the ancestral state of the root more accurately than when applied to all taxa, and we give an example showing that this indeed is possible. A surprising feature of our example is that ignoring a taxon closer to the root improves the reliability of the method. On the other hand, in the case of the two-state symmetric substitution model, we answer affirmatively a conjecture of Li, Steel and Zhang which states that under a molecular clock the probability that the state at a single taxon is a correct guess of the ancestral state is a lower bound on the reconstruction accuracy of Fitch's method applied to all taxa.  相似文献   

6.
GeneTRACE-reconstruction of gene content of ancestral species   总被引:4,自引:0,他引:4  
While current computational methods allow the reconstruction of individual ancestral protein sequences, reconstruction of complete gene content of ancestral species is not yet an established task. In this paper, we describe GENETRACE, an efficient linear-time algorithm that allows the reconstruction of evolutionary history of individual protein families as well as the complete gene content of ancestral species. The performance of the method was validated with a simulated evolution program called SimulEv. Our results indicate that given a set of correct phylogenetic profiles and a correct species tree, ancestral gene content can be reconstructed with sensitivity and selectivity of more than 90%. SimulEv simulations were also used to evaluate performance of the reconstruction of gene content-based phylogenetic trees, suggesting that these trees may be accurate at the terminal branches but suffer from long branch attraction near the root of the tree.  相似文献   

7.
A set of experiments based on simulation and analysis found that using the parsimony algorithm for ancestral state estimation can benefit from increased sampling of terminal taxa. Estimation at the base of small clades showed strong sensitivity to tree topology and number of descendent tips. These effects were largely driven by the creation and negation of ambiguity across a topology. Root state and internal state estimation showed similar behavior. We conclude that increased taxon sampling density is generally advisable, and attention to topological effects may be advisable in evaluating the confidence placed in state estimation. We also explore the factors affecting ancestral state estimation and conjecture that as taxa are added to a tree, the total amount of information for root state estimation depends on the tree topology and distance to root state of added taxa. For a pure-birth model tree, we conjecture that the addition of N taxa increases root state information in proportion to log(N).  相似文献   

8.
Accurate reconstruction of ancestral character states on a phylogeny is crucial in many genomics studies. We study how to select species to achieve the best reconstruction of ancestral character states on a phylogeny. We first show that the marginal maximum likelihood has the monotonicity property that more taxa give better reconstruction, but the Fitch method does not have it even on an ultrametric phylogeny. We further validate a greedy approach for species selection using simulation. The validation tests indicate that backward greedy selection outperforms forward greedy selection. In addition, by applying our selection strategy, we obtain a set of the ten most informative species for the reconstruction of the genomic sequence of the so-called boreoeutherian ancestor of placental mammals. This study has broad relevance in comparative genomics and paleogenomics since limited research resources do not allow researchers to sequence the large number of descendant species required to reconstruct an ancestral sequence.  相似文献   

9.
Accurate phylogenetic reconstruction methods are inherently computationally heavy and therefore are limited to relatively small numbers of taxa. Supertree construction is the task of amalgamating small trees over partial sets into a big tree over the complete taxa set. The need for fast and accurate supertree methods has become crucial due to the enormous number of new genomic sequences generated by modern technology and the desire to use them for classification purposes. In particular, the Assembling the Tree of Life (ATOL) program aims at constructing the evolutionary history of all living organisms on Earth. When dealing with unrooted trees, a quartet - an unrooted tree over four taxa - is the most basic piece of phylogenetic information. Therefore, quartet amalgamation stands at the heart of any supertree problem as it concerns combining many minimal pieces of information into a single, coherent, and more comprehensive piece of information.We have devised an extremely fast algorithm for quartet amalgamation and implemented it in a very efficient code. The new code can handle over a hundred millions of quartet trees over several hundreds of taxa with very high accuracy.  相似文献   

10.
Ultrastructural studies of tetrasporalean green algae have suggested the order is polyphyletic. These features, including the absolute orientation of the flagellar apparatus and the bi- versus quadriflagellated motile cell morphology, suggest that Chaetopeltis as well as a number of others, may be ancestral to a group that includes Tetraspora. In this study, we examine the phylogenetic relationships of selected tetrasporalean taxa based on analysis of 18S ribosomal RNA gene sequences. Results show that the tetrasporalean taxa are polyphyletic. Biflagellated genera group with biflagellated volvocalean taxa, whereas the quadriflagellated species compose a distinct monophyletic clade not closely related to the biflagellated taxa. In addition, tetrasporalean taxa group with other chlorophycean algal species with similar flagellar apparatus absolute orientation, but the quadriflagellated Tetrasporales do not appear to be ancestral to the entire Chlorophyceae. These results are concordant with previous conclusions drawn from ultrastructural data and further confirm the utility of (small-subunit) ribosomal RNA gene sequences to discern green algal evolutionary relationships.  相似文献   

11.
MOTIVATION: The diversity of a haplotype, represented as a string of polymorphic sites along a DNA sequence, increases exponentially with the number of sites if recombinations are taking place. Reconstructing the history of recombinations compared with that of the polymorphic sites is thus extremely difficult. However, in the human genome, because of the relatively simple pattern of haplotype diversity dominated by a few ancestral haplotypes, the complexity of the recombinational network can be reduced, thus making its reconstruction feasible. We focus on the problem of inferring the recombination pathways starting with putative ancestral haplotypes and leading to new rare recombinant haplotypes. RESULTS: We describe classes of recombinations that represent the whole set of minimal recombination pathways leading to a new haplotype. We present an O(n(2)) algorithm that outputs such representative recombination pathways. We apply it to haplotypes of the 8 kb dystrophin gene segment dys44. AVAILABILITY: A software implementing the algorithm and some other extentions has been developed on a Java platform (JDK 1.3.1). It is freely available at http://www.iro.umontreal.ca/~mabrouk/  相似文献   

12.
We examined the effect of increasing the number of sampled amplified fragment length polymorphism (AFLP) bands to reconstruct an accurate and well-supported AFLP-based phylogeny. In silico AFLP was performed using simulated DNA sequences evolving along balanced and unbalanced model trees with recent, uniform and ancient radiations and average branch lengths (from the most internal node to the tip) ranging from 0.02 to 0.05 substitutions per site. Trees were estimated by minimum evolution (ME) and maximum parsimony (MP) methods from both DNA sequences and virtual AFLP fingerprints. The comparison of the true tree with the estimated AFLP trees suggests that moderate numbers of AFLP bands are necessary to recover the correct topology with high bootstrap support values (i.e. >70%). Fewer numbers of bands are necessary for shorter tree lengths and for balanced than for unbalanced tree topologies. However, branch length estimation was rather unreliable and did not improve substantially after a certain number of bands were sampled. These results hold for different levels of genome coverage and number of taxa analysed. In silico AFLP using bacterial genomic DNA sequences recovered a well-supported tree topology that mirrored an empirical phylogeny based on a set of 31 orthologous gene sequences when as few as 263 AFLP bands were scored. These results suggest that AFLPs may be an efficient alternative to traditional DNA sequencing for accurate topology reconstruction of shallow trees when not very short ancestral branches exist.  相似文献   

13.
The evolution of the realized climatic niche in the genus Arabidopsis was studied using an almost complete phylogenetic tree based on DNA sequences of the ribosomal internal transcribed spacers. The realized climatic niche (climate space) was determined by the intersections of the distribution ranges of the taxa with climate data and is presented in temperature/precipitation diagrams. A positive correlation exists between the climate spaces of the taxa and their range sizes. The diagrams revealed a core climate; that is, a climate space in which all taxa co-exist. This core climate is almost identical to the most parsimonious reconstruction of the genus' ancestral climate space and may be considered an ancestral state of these characters. Mapping the evolutionary changes occurring in the realized climatic space on the phylogenetic tree from the core climate proved to be the most parsimonious procedure. The character complex is homoplastic; that is, many parallel evolutionary events have occurred in the subclades. With the exception of A. thaliana, which is sister to the other species of the genus and occupies a very large climate space, the late-diverged taxa of the other subclades experienced great evolutionary changes whereas the realized climate space of the taxa that diverged earlier resembles the core climate. The latter also show some parallel contractions in the climate space. It is hypothesized that the diversification of Arabidopsis may have started from small to midsized ranges in a temperate climate.  相似文献   

14.
We used new 18S and 28S rRNA sequences analysed with parsimony, maximum likelihood and Bayesian methods of phylogenetic reconstruction to show that Nemertodermatida, generally classified as the sister group of Acoela within the recently proposed Phylum Acoelomorpha, are a separate basal bilaterian lineage. We used several analytical approaches to control for possible long branch attraction (LBA) artefacts in our results. Parsimony and the model based phylogenetic reconstruction methods that incorporate 'corrections' for substitution rate heterogenities yielded concordant results. When putative long branch taxa were experimentally removed the resulting topologies were consistent with our total evidence analysis. Deletion of fast-evolving nucleotide sites decreased resolution and clade support, but did not support a topology conflicting with the total evidence analysis. Establishment of Acoela and Nemertodermatida as two early lineages facilitates reconstruction of ancestral bilaterian features. The ancestor of extant Bilateria was a small, benthic direct developer without coelom or a planktonic larval stage. The previously proposed Phylum Acoelomorpha is dismissed as paraphyletic.  相似文献   

15.
Exact and heuristic algorithms for the Indel Maximum Likelihood Problem.   总被引:1,自引:0,他引:1  
Given a multiple alignment of orthologous DNA sequences and a phylogenetic tree for these sequences, we investigate the problem of reconstructing the most likely scenario of insertions and deletions capable of explaining the gaps observed in the alignment. This problem, that we called the Indel Maximum Likelihood Problem (IMLP), is an important step toward the reconstruction of ancestral genomics sequences, and is important for studying evolutionary processes, genome function, adaptation and convergence. We solve the IMLP using a new type of tree hidden Markov model whose states correspond to single-base evolutionary scenarios and where transitions model dependencies between neighboring columns. The standard Viterbi and Forward-backward algorithms are optimized to produce the most likely ancestral reconstruction and to compute the level of confidence associated to specific regions of the reconstruction. A heuristic is presented to make the method practical for large data sets, while retaining an extremely high degree of accuracy. The methods are illustrated on a 1-Mb alignment of the CFTR regions from 12 mammals.  相似文献   

16.
Phylogenetic inference under the pure drift model   总被引:1,自引:1,他引:0  
When pairwise genetic distances are used for phylogenetic reconstruction, it is usually assumed that the genetic distance between two taxa contains information about the time after the two taxa diverged. As a result, upon an appropriate transformation if necessary, the distance usually can be fitted to a linear model such that it is expressed as the sum of lengths of all branches that connect the two taxa in a given phylogeny. This kind of distance is referred to as "additive distance." For a phylogenetic tree exclusively driven by random genetic drift, genetic distances related to coancestry coefficients (theta XY) between any two taxa are more suitable. However, these distances are fundamentally different from the additive distance in that coancestry does not contain any information about the time after two taxa split from a common ancestral population; instead, it reflects the time before the two taxa diverged. In other words, the magnitude of theta XY provides information about how long the two taxa share the same evolutionary pathways. The fundamental difference between the two kinds of distances has led to a different algorithm of evaluating phylogenetic trees when theta XY and related distance measures are used. Here we present the new algorithm using the ordinary- least-squares approach but fitting to a different linear model. This treatment allows genetic variation within a taxon to be included in the model. Monte Carlo simulation for a rooted phylogeny of four taxa has verified the efficacy and consistency of the new method. Application of the method to human population was demonstrated.   相似文献   

17.
Reconstruction of ancestral DNA and amino acid sequences is an important means of inferring information about past evolutionary events. Such reconstructions suggest changes in molecular function and evolutionary processes over the course of evolution and are used to infer adaptation and convergence. Maximum likelihood (ML) is generally thought to provide relatively accurate reconstructed sequences compared to parsimony, but both methods lead to the inference of multiple directional changes in nucleotide frequencies in primate mitochondrial DNA (mtDNA). To better understand this surprising result, as well as to better understand how parsimony and ML differ, we constructed a series of computationally simple "conditional pathway" methods that differed in the number of substitutions allowed per site along each branch, and we also evaluated the entire Bayesian posterior frequency distribution of reconstructed ancestral states. We analyzed primate mitochondrial cytochrome b (Cyt-b) and cytochrome oxidase subunit I (COI) genes and found that ML reconstructs ancestral frequencies that are often more different from tip sequences than are parsimony reconstructions. In contrast, frequency reconstructions based on the posterior ensemble more closely resemble extant nucleotide frequencies. Simulations indicate that these differences in ancestral sequence inference are probably due to deterministic bias caused by high uncertainty in the optimization-based ancestral reconstruction methods (parsimony, ML, Bayesian maximum a posteriori). In contrast, ancestral nucleotide frequencies based on an average of the Bayesian set of credible ancestral sequences are much less biased. The methods involving simpler conditional pathway calculations have slightly reduced likelihood values compared to full likelihood calculations, but they can provide fairly unbiased nucleotide reconstructions and may be useful in more complex phylogenetic analyses than considered here due to their speed and flexibility. To determine whether biased reconstructions using optimization methods might affect inferences of functional properties, ancestral primate mitochondrial tRNA sequences were inferred and helix-forming propensities for conserved pairs were evaluated in silico. For ambiguously reconstructed nucleotides at sites with high base composition variability, ancestral tRNA sequences from Bayesian analyses were more compatible with canonical base pairing than were those inferred by other methods. Thus, nucleotide bias in reconstructed sequences apparently can lead to serious bias and inaccuracies in functional predictions.  相似文献   

18.
A nonhomogeneous, nonstationary stochastic model of DNA sequence evolution allowing varying equilibrium G + C contents among lineages is devised in order to deal with sequences of unequal base compositions. A maximum-likelihood implementation of this model for phylogenetic analyses allows handling of a reasonable number of sequences. The relevance of the model and the accuracy of parameter estimates are theoretically and empirically assessed, using real or simulated data sets. Overall, a significant amount of information about past evolutionary modes can be extracted from DNA sequences, suggesting that process (rates of distinct kinds of nucleotide substitutions) and pattern (the evolutionary tree) can be simultaneously inferred. G + C contents at ancestral nodes are quite accurately estimated. The new method appears to be useful for phylogenetic reconstruction when base composition varies among compared sequences. It may also be suitable for molecular evolution studies.   相似文献   

19.
Given a multiple alignment of orthologous DNA sequences and a phylogenetic tree for these sequences, we investigate the problem of reconstructing a most parsimonious scenario of insertions and deletions capable of explaining the gaps observed in the alignment. This problem, called the Indel Parsimony Problem, is a crucial component of the problem of ancestral genome reconstruction, and its solution provides valuable information to many genome functional annotation approaches. We first show that the problem is NP-complete. Second, we provide an algorithm, based on the fractional relaxation of an integer linear programming formulation. The algorithm is fast in practice, and the solutions it produces are, in most cases, provably optimal. We describe a divide-and-conquer approach that makes it possible to solve very large instances on a simple desktop machine, while retaining guaranteed optimality. Our algorithms are tested and shown efficient and accurate on a set of 1.8 Mb mammalian orthologous sequences in the CFTR region.  相似文献   

20.
Accurate phylogenetic reconstruction methods are currently limited to a maximum of few dozens of taxa. Supertree methods construct a large tree over a large set of taxa, from a set of small trees over overlapping subsets of the complete taxa set. Hence, in order to construct the tree of life over a million and a half different species, the use of a supertree method over the product of accurate methods, is inevitable. Perhaps the simplest version of this task that is still widely applicable, yet quite challenging, is quartet-based reconstruction. This problem lies at the root of many tree reconstruction methods and theoretical as well as experimental results have been reported. Nevertheless, dealing with false, conflicting quartet trees remains problematic. In this paper, we describe an algorithm for constructing a tree from a set of input quartet trees even with a significant fraction of errors. We show empirically that conflicts in the inputs are handled satisfactorily and that it significantly outperforms and outraces the Matrix Representation with Parsimony (MRP) methods that have previously been most successful in dealing with supertrees. Our algorithm is based on a divide and conquer algorithm where our divide step uses a semidefinite programming (SDP) formulation of MaxCut. We remark that this builds on previous work of ours for piecing together trees from rooted triplet trees. The recursion for unrooted quartets, however, is more complicated in that even with completely consistent set of quartet trees the problem is NP-hard, as opposed to the problem for triples where there is a linear time algorithm. This complexity leads to several issues and some solutions of possible independent interest.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号