Bayesian coestimation of phylogeny and sequence alignment

Gerton?Lunter Email author István?Miklós Alexei?Drummond Jens?Ledet?Jensen Jotun?Hein 《BMC bioinformatics》2005,6(1):83

Background

Two central problems in computational biology are the determination of the alignment and phylogeny of a set of biological sequences. The traditional approach to this problem is to first build a multiple alignment of these sequences, followed by a phylogenetic reconstruction step based on this multiple alignment. However, alignment and phylogenetic inference are fundamentally interdependent, and ignoring this fact leads to biased and overconfident estimations. Whether the main interest be in sequence alignment or phylogeny, a major goal of computational biology is the co-estimation of both. 相似文献

2.

Effects of nucleotide sequence alignment on phylogeny estimation: a case study of 18S rDNAs of apicomplexa 总被引：8，自引：5，他引：8

Morrison DA; Ellis JT 《Molecular biology and evolution》1997,14(4):428-441

The reconstruction of phylogenetic history is predicated on being able to accurately establish hypotheses of character homology, which involves sequence alignment for studies based on molecular sequence data. In an empirical study investigating nucleotide sequence alignment, we inferred phylogenetic trees for 43 species of the Apicomplexa and 3 of Dinozoa based on complete small-subunit rDNA sequences, using six different multiple-alignment procedures: manual alignment based on the secondary structure of the 18S rRNA molecule, and automated similarity-based alignment algorithms using the PileUp, ClustalW, TreeAlign, MALIGN, and SAM computer programs. Trees were constructed using neighboring-joining, weighted-parsimony, and maximum- likelihood methods. All of the multiple sequence alignment procedures yielded the same basic structure for the estimate of the phylogenetic relationship among the taxa, which presumably represents the underlying phylogenetic signal. However, the placement of many of the taxa was sensitive to the alignment procedure used; and the different alignments produced trees that were on average more dissimilar from each other than did the different tree-building methods used. The multiple alignments from the different procedures varied greatly in length, but aligned sequence length was not a good predictor of the similarity of the resulting phylogenetic trees. We also systematically varied the gap weights (the relative cost of inserting a new gap into a sequence or extending an already-existing gap) for the ClustalW program, and this produced alignments that were at least as different from each other as those produced by the different alignment algorithms. Furthermore, there was no combination of gap weights that produced the same tree as that from the structure alignment, in spite of the fact that many of the alignments were similar in length to the structure alignment. We also investigated the phylogenetic information content of the helical and nonhelical regions of the rDNA, and conclude that the helical regions are the most informative. We therefore conclude that many of the literature disagreements concerning the phylogeny of the Apicomplexa are probably based on differences in sequence alignment strategies rather than differences in data or tree-building methods. 相似文献

3.

Effects of sequence alignment and structural domains of ribosomal DNA on phylogeny reconstruction for the protozoan family sarcocystidae 总被引：5，自引：0，他引：5

Mugridge NB Morrison DA Jäkel T Heckeroth AR Tenter AM Johnson AM 《Molecular biology and evolution》2000,17(12):1842-1853

Finding correct species relationships using phylogeny reconstruction based on molecular data is dependent on several empirical and technical factors. These include the choice of DNA sequence from which phylogeny is to be inferred, the establishment of character homology within a sequence alignment, and the phylogeny algorithm used. Nevertheless, sequencing and phylogeny tools provide a way of testing certain hypotheses regarding the relationship among the organisms for which phenotypic characters demonstrate conflicting evolutionary information. The protozoan family Sarcocystidae is one such group for which molecular data have been applied phylogenetically to resolve questionable relationships. However, analyses carried out to date, particularly based on small-subunit ribosomal DNA, have not resolved all of the relationships within this family. Analysis of more than one gene is necessary in order to obtain a robust species signal, and some DNA sequences may not be appropriate in terms of their phylogenetic information content. With this in mind, we tested the informativeness of our chosen molecule, the large-subunit ribosomal DNA (lsu rDNA), by using subdivisions of the sequence in phylogenetic analysis through PAUP, fastDNAml, and neighbor joining. The segments of sequence applied correspond to areas of higher nucleotide variation in a secondary-structure alignment involving 21 taxa. We found that subdivision of the entire lsu rDNA is inappropriate for phylogenetic analysis of the Sarcocystidae. There are limited informative nucleotide sites in the lsu rDNA for certain clades, such as the one encompassing the subfamily Toxoplasmatinae. Consequently, the removal of any segment of the alignment compromises the final tree topology. We also tested the effect of using two different alignment procedures (CLUSTAL W and the structure alignment using DCSE) and three different tree-building methods on the final tree topology. This work shows that congruence between different methods in the formation of clades may be a feature of robust topology; however, a sequence alignment based on primary structure may not be comparing homologous nucleotides even though the expected topology is obtained. Our results support previous findings showing the paraphyly of the current genera Sarcocystis and Hammondia and again bring to question the relationships of Sarcocystis muris, Isospora felis, and Neospora caninum. In addition, results based on phylogenetic analysis of the structure alignment suggest that Sarcocystis zamani and Sarcocystis singaporensis, which have reptilian definitive hosts, are monophyletic with Sarcocystis species using mammalian definitive hosts if the genus Frenkelia is synonymized with Sarcocystis. 相似文献

4.

Prokaryote phylogeny without sequence alignment: from avoidance signature to composition distance

Hao B Qi J 《Journal of bioinformatics and computational biology》2004,2(1):1-19

This is a review of a new and essentially simple method of inferring phylogenetic relationships from complete genome data without using sequence alignment. The method is based on counting the appearance frequency of oligopeptides of a fixed length (up to K = 6) in the collection of protein sequences of a species. It is a method without fine adjustment and choice of genes. Applied to prokaryotic genomes it has led to results comparable with the bacteriologists' systematics as reflected in the latest 2002 outline of the Bergey's Manual of Systematic Bacteriology. The method has also been used to compare chloroplast genomes and to the phylogeny of Coronaviruses including human SARS-CoV. A key point in our approach is subtraction of a random background from the original counts by using a Markov model of order K-2 in order to highlight the shaping role of natural selection. The implications of the subtraction procedure is specially analyzed and further development of the new approach is indicated. 相似文献

5.

Arthropod phylogeny inferred from partial 12SrRNA revisited: monophyly of the Tracheata depends on sequence alignment

J. W. Wägele G. Stanjek 《Journal of Zoological Systematics and Evolutionary Research》1995,33(3-4):75-80

A new hypothesis on arthropod evolution published by Ballard et al. (1992), based on partial 12SrRNA sequences, is re-analysed using the same data, but using different alignments. It is argued that there is no reason to reject monophyly of the Euarthropoda, Mandibulata and Tracheata.

Zusammenfassung

Die Phylogenie der Arthropoden, abgeleitet von Abschnitten der 12SrRNA, nochmals überdacht: Die Monophylie der Tracheaten hängt vom Sequenz-Alignment ab
Eine von Ballard et al. (1992) publizierte, auf partiellen 12SrRNA-Sequenzen beruhende Hypothese zur Stellung der Onychophora und Myriapoda innerhalb der Arthropoda, wird aufgegriffen und mit anderer Alinierung neu berechnet. Es wird gezeigt, da β es keinen Anlaß gibt, die Monophylie der Euarthropoda, Mandibulata und Tracheata zu bezweifeln. 相似文献

6.

Arthropod phylogeny inferred from partial 12SrRNA revisited: monophyly of the Tracheata depends on sequence alignment 总被引：1，自引：0，他引：1

J. W. Wägele G. Stanjek 《Journal of Zoological Systematics and Evolutionary Research》1995,33(2):75-80

A new hypothesis on arthropod evolution published by Ballard et al. (1992), based on partial 12SrRNA sequences, is re-analysed using the same data, but using different alignments. It is argued that there is no reason to reject monophyly of the Euarthropoda, Mandibulata and Tracheata. 相似文献

7.

The effect of sequence quality on sequence alignment

Malde K 《Bioinformatics (Oxford, England)》2008,24(7):897-900

Motivation: The nucleotide sequencing process produces not onlythe sequence of nucleotides, but also associated quality values.Quality values provide valuable information, but are primarilyused only for trimming sequences and generally ignored in subsequentanalyses. Results: This article describes how the scoring schemes of standardalignment algorithms can be modified to take into account qualityvalues to produce improved alignments and statistically moreaccurate scores. A prototype implementation is also provided,and used to post-process a set of BLAST results. Quality-adjustedalignment is a natural extension of standard alignment methods,and can be implemented with only a small constant factor performancepenalty. The method can also be applied to related methods includingheuristic search algorithms like BLAST and FASTA. Availability: Software is available at http://malde.org/~ketil/qaa. Contact: ketil.malde{at}imr.no Supplementary information: Supplementary data are availableat Bioinformatics online. Associate Editor: Limsoon Wong 相似文献

8.

Multiple sequence alignment 总被引：13，自引：0，他引：13

D J Bacon W F Anderson 《Journal of molecular biology》1986,191(2):153-161

A method has been developed for aligning segments of several sequences at once. The number of search steps depends only polynomially on the number of sequences, instead of exponentially, because most alignments are rejected without being evaluated explicitly. A data structure herein called the "heap" facilitates this process. For a set of n sequence segments, the overall similarity is taken to be the sum of all the constituent segment pair similarities, which are in turn sums of corresponding residue similarity scores from a Table. The statistical models that test alignments for significance make it possible to group sequences objectively, even when most or all of the interrelationships are weak. These tests are very sensitive, while remaining quite conservative, and discourage the addition of "misfit" sequences to an existing set. The new techniques are applied to a set of five DNA-binding proteins, to a group of three enzymes that employ the coenzyme FAD, and to a control set. The alignment previously proposed for the DNA-binding proteins on the basis of structural comparisons and inspection of sequences is supported quite dramatically, and a highly significant alignment is found for the FAD-binding proteins. 相似文献

9.

Contact-based sequence alignment 总被引：2，自引：1，他引：1

Kleinjung J Romein J Lin K Heringa J 《Nucleic acids research》2004,32(8):2464-2473

This paper introduces the novel method of contact-based protein sequence alignment, where structural information in the form of contact mutation probabilities is incorporated into an alignment routine using contact-mutation matrices (CAO: Contact Accepted mutatiOn). The contact-based alignment routine optimizes the score of matched contacts, which involves four (two per contact) instead of two residues per match in pairwise alignments. The first contact refers to a real side-chain contact in a template sequence with known structure, and the second contact is the equivalent putative contact of a homologous query sequence with unknown structure. An algorithm has been devised to perform a pairwise sequence alignment based on contact information. The contact scores were combined with PAM-type (Point Accepted Mutation) substitution scores after parameterization of gap penalties and score weights by means of a genetic algorithm. We show that owing to the structural information contained in the CAO matrices, significantly improved alignments of distantly related sequences can be obtained. This has allowed us to annotate eight putative Drosophila IGF sequences. Contact-based sequence alignment should therefore prove useful in comparative modelling and fold recognition. 相似文献

10.

Joint Bayesian estimation of alignment and phylogeny

Redelings BD Suchard MA 《Systematic biology》2005,54(3):401-418

We describe a novel model and algorithm for simultaneously estimating multiple molecular sequence alignments and the phylogenetic trees that relate the sequences. Unlike current techniques that base phylogeny estimates on a single estimate of the alignment, we take alignment uncertainty into account by considering all possible alignments. Furthermore, because the alignment and phylogeny are constructed simultaneously, a guide tree is not needed. This sidesteps the problem in which alignments created by progressive alignment are biased toward the guide tree used to generate them. Joint estimation also allows us to model rate variation between sites when estimating the alignment and to use the evidence in shared insertion/deletions (indels) to group sister taxa in the phylogeny. Our indel model makes use of affine gap penalties and considers indels of multiple letters. We make the simplifying assumption that the indel process is identical on all branches. As a result, the probability of a gap is independent of branch length. We use a Markov chain Monte Carlo (MCMC) method to sample from the posterior of the joint model, estimating the most probable alignment and tree and their support simultaneously. We describe a new MCMC transition kernel that improves our algorithm's mixing efficiency, allowing the MCMC chains to converge even when started from arbitrary alignments. Our software implementation can estimate alignment uncertainty and we describe a method for summarizing this uncertainty in a single plot. 相似文献

11.

Constrained sequence alignment

Kun-Mao Chao Ross C. Hardison Webb Miller 《Bulletin of mathematical biology》1993,55(3):503-524

This paper presents a dynamic programming algorithm for aligning two sequeces when the alignment is constrained to lie between two arbitrary boundary lines in the dynamic programming matrix. For affine gap penalties, the algorithm requires onlyO(F) computation time andO(M+N) space, whereF is the area of the feasible region andM andN are the sequence lengths. The result extends to concave gap penalties, with somewhat increased time and space bounds. K.-M. C. and W. M. were supported in part by grant R01 LM05110 from the National Library of Medicine. R. C. H. was supported by PHS grant R01 DK27635. 相似文献

12.

Homology-extended sequence alignment 总被引：4，自引：1，他引：4

下载免费PDF全文

Simossis VA Kleinjung J Heringa J 《Nucleic acids research》2005,33(3):816-824

We present a profile–profile multiple alignment strategy that uses database searching to collect homologues for each sequence in a given set, in order to enrich their available evolutionary information for the alignment. For each of the alignment sequences, the putative homologous sequences that score above a pre-defined threshold are incorporated into a position-specific pre-alignment profile. The enriched position-specific profile is used for standard progressive alignment, thereby more accurately describing the characteristic features of the given sequence set. We show that owing to the incorporation of the pre-alignment information into a standard progressive multiple alignment routine, the alignment quality between distant sequences increases significantly and outperforms state-of-the-art methods, such as T-COFFEE and MUSCLE. We also show that although entirely sequence-based, our novel strategy is better at aligning distant sequences when compared with a recent contact-based alignment method. Therefore, our pre-alignment profile strategy should be advantageous for applications that rely on high alignment accuracy such as local structure prediction, comparative modelling and threading. 相似文献

13.

Multiple sequence alignment

Edgar RC Batzoglou S 《Current opinion in structural biology》2006,16(3):368-373

Multiple sequence alignments are an essential tool for protein structure and function prediction, phylogeny inference and other common tasks in sequence analysis. Recently developed systems have advanced the state of the art with respect to accuracy, ability to scale to thousands of proteins and flexibility in comparing proteins that do not share the same domain architecture. New multiple alignment benchmark databases include PREFAB, SABMARK, OXBENCH and IRMBASE. Although CLUSTALW is still the most popular alignment tool to date, recent methods offer significantly better alignment quality and, in some cases, reduced computational cost. 相似文献

14.

Effects of long-range correlations in DNA on sequence alignment score statistics.

Philipp W Messer Ralf Bundschuh Martin Vingron Peter F Arndt 《Journal of computational biology》2007,14(5):655-668

Long-range correlations in genomic base composition are a ubiquitous statistical feature among many eukaryotic genomes. In this article, these correlations are shown to substantially influence the statistics of sequence alignment scores. Using a Gaussian approximation to model the correlated score landscape, we calculate the corrections to the scale parameter lambda of the extreme value distribution of alignment scores. Our approximate analytic results are supported by a detailed numerical study based on a simple algorithm to efficiently generate long-range correlated random sequences. We find both, mean and exponential tail of the score distribution for long-range correlated sequences to be substantially shifted compared to random sequences with independent nucleotides. The significance of measured alignment scores will therefore change upon incorporation of the correlations in the null model. We discuss the magnitude of this effect in a biological context. 相似文献

15.

Simultaneous statistical multiple alignment and phylogeny reconstruction

Fleissner R Metzler D von Haeseler A 《Systematic biology》2005,54(4):548-561

Although the reconstruction of phylogenetic trees and the computation of multiple sequence alignments are highly interdependent, these two areas of research lead quite separate lives, the former often making use of stochastic modeling, whereas the latter normally does not. Despite the fact that reasonable insertion and deletion models for sequence pairs were already introduced more than 10 years ago, they have only recently been applied to multiple alignment and only in their simplest version. In this paper we present and discuss a strategy based on simulated annealing, which makes use of these models to infer a phylogenetic tree for a set of DNA or protein sequences together with the sequences'indel history, i.e., their multiple alignment augmented with information about the positioning of insertion and deletion events in the tree. Our method is also the first application of the TKF2 model in the context of multiple sequence alignment. We validate the method via simulations and illustrate it using a data set of primate mtDNA. 相似文献

16.

ITS secondary structure derived from comparative analysis: implications for sequence alignment and phylogeny of the Asteraceae

Goertzen LR Cannone JJ Gutell RR Jansen RK 《Molecular phylogenetics and evolution》2003,29(2):216-234

An RNA secondary structure model is presented for the nuclear ribosomal internal transcribed spacers (ITS) based on comparative analysis of 340 sequences from the angiosperm family Asteraceae. The model based on covariation analysis agrees with structural features proposed in previous studies using mainly thermodynamic criteria and provides evidence for additional structural motifs within ITS1 and ITS2. The minimum structure model suggests that at least 20% of ITS1 and 38% of ITS2 nucleotide positions are involved in base pairing to form helices. The sequence alignment enabled by conserved structural features provides a framework for broadscale molecular evolutionary studies and the first family-level phylogeny of the Asteraceae based on nuclear DNA data. The phylogeny based on ITS sequence data is very well resolved and shows considerable congruence with relationships among major lineages of the family suggested by chloroplast DNA studies, including a monophyletic subfamily Asteroideae and a paraphyletic subfamily Cichorioideae. Combined analyses of ndhF and ITS sequences provide additional resolution and support for relationships in the family. 相似文献

17.

生物序列比对算法的研究现状

文凤春王邦菊肖枝洪《生物信息学》2010,8(1):64-67

序列比对是生物信息学研究的一个重要工具,它在序列拼接、蛋白质结构预测、蛋白质结构功能分析、系统进化分析、数据库检索以及引物设计等问题的研究中被广泛使用。本文详细介绍了在生物信息学中常用的一些序列比对算法,比较了这些算法所需的计算复杂度,优缺点,讨论了各自的使用范围,并指出今后序列比对研究的发展方向。相似文献

18.

Reduced space sequence alignment

Grice J.Alicia; Hughey Richard; Speck Don 《Bioinformatics (Oxford, England)》1997,13(1):45-53

Motivation: Sequence alignment is the problem of finding theoptimal character-by-character correspondence between two sequences.It can be readily solved in O(n²) time and O(n²) space on aserial machine, or in O(n) time with O(n) space per O(n) processingelements on a parallel machine. Hirschberg's divide-and-conquerapproach for finding the single best path reduces space useby a factor of n while inducing only a small constant slowdownto the serial version. Results: This paper presents a family of methods for computingsequence alignments with reduced memory that are well suitedto serial or parallel implementation. Unlike the divide-and-conquerapproach, they can be used in the forward-backward (Baum-Welch)training of linear hidden Markov models, and they avoid data-dependentrepartitioning, making them easier to parallelize. The algorithmsfeature, for an arbitrary integer L, a factor proportional toL slowdown in exchange for reducing space requirement from O(n²)to O(n). A single best path member of this algorithm familymatches the quadratic time and linear space of the divide-and-conqueralgorithm. Experimentally, the O(n1.5)-space member of the familyis 15–40% faster than the O(n)-space divide-and-conqueralgorithm. Availability: The methods will soon be incorporated in the SAMhidden Markov modeling package http: //www.cse.ucs-c.edu/research/compbio/sam.html. Contact: wzrph{at}cse.ucsc.edu 相似文献

19.

Segment-based multiple sequence alignment

Rausch T Emde AK Weese D Döring A Notredame C Reinert K 《Bioinformatics (Oxford, England)》2008,24(16):i187-i192

相似文献

20.

Efficient sequence alignment algorithms 总被引：3，自引：0，他引：3

M S Waterman 《Journal of theoretical biology》1984,108(3):333-337

Sequence alignments are becoming more important with the increase of nucleic acid data. Fitch and Smith have recently given an example where multiple insertion/deletions (rather than a series of adjacent single insertion/deletions) are necessary to achieve the correct alignment. Multiple insertion/deletions are known to increase computation time from O(n2) to O(n3) although Gotoh has presented an O(n2) algorithm in the case the multiple insertion/deletion weighting function is linear. It is argued in this paper that it could be desirable to use concave weighting functions. For that case, an algorithm is derived that is conjectured to be O(n2). 相似文献