期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

The relative sensitivity of different alignment methods and character codings in sensitivity analysis

Mark P. Simmons Kai F. Müller Colleen T. Webb 《Cladistics : the international journal of the Willi Hennig Society》2008,24(6):1039-1050

Sensitivity analysis provides a way to measure robustness of clades in sequence‐based phylogenetic analyses to variation in alignment parameters rather than measuring their branch support. We compared three different approaches to multiple sequence alignment in the context of sensitivity analysis: progressive pairwise alignment, as implemented in MUSCLE; simultaneous multiple alignment of sequence fragments, as implemented in DCA; and direct optimization followed by generation of the implied alignment(s), as implemented in POY. We set out to determine the relative sensitivity of these three alignment methods using rDNA sequences and randomly generated sequences. A total of 36 parameter sets were used to create the alignments, varying the transition, transversion, and gap costs. Tree searches were performed using four different character‐coding and weighting approaches: the cost function used for alignment or equally weighted parsimony with gap positions treated as missing data, separate characters, or as fifth states. POY was found to be as sensitive, or more sensitive, to variation in alignment parameters than DCA and MUSCLE for the three empirical datasets, and POY was found to be more sensitive than MUSCLE, which in turn was found to be as sensitive, or more sensitive, than DCA when applied to the randomly generated sequences when sensitivity was measured using the averaged jackknife values. When significant differences in relative sensitivity were found between the different ways of weighting character‐state changes, equally weighted parsimony, for all three ways of treating gapped positions, was less sensitive than applying the same cost function used in alignment for phylogenetic analysis. When branch support is incorporated into the sensitivity criterion, our results favour the use of simultaneous alignment and progressive pairwise alignment using the similarity criterion over direct optimization followed by using the implied alignment(s) to calculate branch support. 相似文献

2.

Alignment and topological accuracy of the direct optimization approach via POY and traditional phylogenetics via ClustalW + PAUP*

Ogden TH Rosenberg MS 《Systematic biology》2007,56(2):182-193

Direct optimization frameworks for simultaneously estimating alignments and phylogenies have recently been developed. One such method, implemented in the program POY, is becoming more common for analyses of variable length sequences (e.g., analyses using ribosomal genes) and for combined evidence analyses (morphology + multiple genes). Simulation of sequences containing insertion and deletion events was performed in order to directly compare a widely used method of multiple sequence alignment (ClustalW) and subsequent parsimony analysis in PAUP* with direct optimization via POY. Data sets were simulated for pectinate, balanced, and random tree shapes under different conditions (clocklike, non-clocklike, and ultrametric). Alignment accuracy scores for the implied alignments from POY and the multiple sequence alignments from ClustalW were calculated and compared. In almost all cases (99.95%), ClustalW produced more accurate alignments than POY-implied alignments, judged by the proportion of correctly identified homologous sites. Topological accuracy (distance to the true tree) for POY topologies and topologies generated under parsimony in PAUP* from the ClustalW alignments were also compared. In 44.94% of the cases, Clustal alignment tree reconstructions via PAUP* were more accurate than POY, whereas in 16.71% of the cases POY reconstructions were more topologically accurate (38.38% of the time they were equally accurate). Comparisons between POY hypothesized alignments and the true alignments indicated that, on average, as alignment error increased, topological accuracy decreased. 相似文献

3.

Phylogeny of sipunculan worms: A combined analysis of four gene regions and morphology

Schulze A Cutler EB Giribet G 《Molecular phylogenetics and evolution》2007,42(1):171-192

The intra-phyletic relationships of sipunculan worms were analyzed based on DNA sequence data from four gene regions and 58 morphological characters. Initially we analyzed the data under direct optimization using parsimony as optimality criterion. An implied alignment resulting from the direct optimization analysis was subsequently utilized to perform a Bayesian analysis with mixed models for the different data partitions. For this we applied a doublet model for the stem regions of the 18S rRNA. Both analyses support monophyly of Sipuncula and most of the same clades within the phylum. The analyses differ with respect to the relationships among the major groups but whereas the deep nodes in the direct optimization analysis generally show low jackknife support, they are supported by 100% posterior probability in the Bayesian analysis. Direct optimization has been useful for handling sequences of unequal length and generating conservative phylogenetic hypotheses whereas the Bayesian analysis under mixed models provided high resolution in the basal nodes of the tree. 相似文献

4.

No so HoT – heads or tails is not able to reliably compare multiple sequence alignments

Michael J. Wise 《Cladistics : the international journal of the Willi Hennig Society》2010,26(4):438-443

Most phylogenetic‐tree building applications use multiple sequence alignments as a starting point. A recent meta‐level methodology, called Heads or Tails, aims to reveal the quality of multiple sequence alignments by comparing alignments taken in the forward direction with the alignments of the same sequences when the sequences are reversed. Through an examination of a special case for multiple sequence alignment – pair‐wise alignments, where an optimal algorithm exists – and the use of a modi?ed global‐alignment application, it is shown that the forward and reverse alignments, even when they are the same, do not capture all the possible variations in the alignments and when the forward and reverse alignments differ there may be other alignments that remain unaccounted for. The implication is that comparing just the forward and (biologically irrelevant) reverse alignments is not sufficient to capture the variability in multiple sequence alignments, and the Heads or Tails methodology is therefore not suitable as a method for investigating multiple sequence alignment accuracy. Part of the reason is the inability of individual multiple sequence alignment applications to adequately sample the space of possible alignments. A further implication is that the Hall [Hall, B.G., 2008. Mol. Biol. Evol. 25, 1576–1580] methodology may create optimal synthetic multiple sequence alignments that extant aligners will be unable to completely recover ab initio due to alternative alignments being possible at particular sites. In general, it is shown that more divergent sequences will give rise to an increased number of alternative alignments, so sequence sets with a higher degree of similarity are preferable to sets with lower similarity as the starting point for phylogenetic tree building. © The Willi Hennig Society 2009. 相似文献

5.

CLUSTAL: a package for performing multiple sequence alignment on a microcomputer 总被引：242，自引：0，他引：242

D G Higgins P M Sharp 《Gene》1988,73(1):237-244

An approach for performing multiple alignments of large numbers of amino acid or nucleotide sequences is described. The method is based on first deriving a phylogenetic tree from a matrix of all pairwise sequence similarity scores, obtained using a fast pairwise alignment algorithm. Then the multiple alignment is achieved from a series of pairwise alignments of clusters of sequences, following the order of branching in the tree. The method is sufficiently fast and economical with memory to be easily implemented on a microcomputer, and yet the results obtained are comparable to those from packages requiring mainframe computer facilities. 相似文献

6.

Joint Bayesian estimation of alignment and phylogeny

Redelings BD Suchard MA 《Systematic biology》2005,54(3):401-418

We describe a novel model and algorithm for simultaneously estimating multiple molecular sequence alignments and the phylogenetic trees that relate the sequences. Unlike current techniques that base phylogeny estimates on a single estimate of the alignment, we take alignment uncertainty into account by considering all possible alignments. Furthermore, because the alignment and phylogeny are constructed simultaneously, a guide tree is not needed. This sidesteps the problem in which alignments created by progressive alignment are biased toward the guide tree used to generate them. Joint estimation also allows us to model rate variation between sites when estimating the alignment and to use the evidence in shared insertion/deletions (indels) to group sister taxa in the phylogeny. Our indel model makes use of affine gap penalties and considers indels of multiple letters. We make the simplifying assumption that the indel process is identical on all branches. As a result, the probability of a gap is independent of branch length. We use a Markov chain Monte Carlo (MCMC) method to sample from the posterior of the joint model, estimating the most probable alignment and tree and their support simultaneously. We describe a new MCMC transition kernel that improves our algorithm's mixing efficiency, allowing the MCMC chains to converge even when started from arbitrary alignments. Our software implementation can estimate alignment uncertainty and we describe a method for summarizing this uncertainty in a single plot. 相似文献

7.

Integrating ambiguously aligned regions of DNA sequences in phylogenetic analyses without violating positional homology 总被引：1，自引：0，他引：1

Lutzoni F Wagner P Reeb V Zoller S 《Systematic biology》2000,49(4):628-651

Phylogenetic analyses of non-protein-coding nucleotide sequences such as ribosomal RNA genes, internal transcribed spacers, and introns are often impeded by regions of the alignments that are ambiguously aligned. These regions are characterized by the presence of gaps and their uncertain positions, no matter which optimization criteria are used. This problem is particularly acute in large-scale phylogenetic studies and when aligning highly diverged sequences. Accommodating these regions, where positional homology is likely to be violated, in phylogenetic analyses has been dealt with very differently by molecular systematists and evolutionists, ranging from the total exclusion of these regions to the inclusion of every position regardless of ambiguity in the alignment. We present a new method that allows the inclusion of ambiguously aligned regions without violating homology. In this three-step procedure, first homologous regions of the alignment containing ambiguously aligned sequences are delimited. Second, each ambiguously aligned region is unequivocally coded as a new character, replacing its respective ambiguous region. Third, each of the coded characters is subjected to a specific step matrix to account for the differential number of changes (summing substitutions and indels) needed to transform one sequence to another. The optimal number of steps included in the step matrix is the one derived from the pairwise alignment with the greatest similarity and the least number of steps. In addition to potentially enhancing phylogenetic resolution and support, by integrating previously nonaccessible characters without violating positional homology, this new approach can improve branch length estimations when using parsimony. 相似文献

8.

Exploring data interaction and nucleotide alignment in a multiple gene analysis of Ips (Coleoptera: Scolytinae)

Cognato AI Vogler AP 《Systematic biology》2001,50(6):758-780

The possibility of gene tree incongruence in a species-level phylogenetic analysis of the genus Ips (Coleoptera: Scolytidae) was investigated based on mitochondrial 16S rRNA (16S) and nuclear elongation factor-1 alpha (EF-1 alpha) sequences, and existing cytochrome oxidase I (COI) and nonmolecular data sets. Separate cladistic analyses of the data partitions resulted in partially discordant most-parsimonious trees but revealed only low conflict of the phylogenetic signal. Interactions among data partitions, which differed in the extent of sequence divergence (COI > 16S > EF-1 alpha), base composition, and homoplasy, revealed that much of the branch support emerges only in the simultaneous analysis, particularly for deeper nodes in the tree, which are almost entirely supported through "hidden support" (sensu Gatesy et al., Cladistics 15:271-313, 1999). Apparent incongruence between data partitions is in part due to suboptimal alignments and bias of character transformations, but little evidence supports invoking incongruent phylogenetic histories of genetic loci. There is also no justification for eliminating or downweighting gene partitions on the basis of their apparent homoplasy or incongruence with other partitions, because the signal emerges only in the interaction of all data. In comparison with traditional taxonomy, the pini, plastographus, and perturbatus groups are polyphyletic, whereas the grandicollis group is monophyletic except for inclusion of the (monophyletic) calligraphus group. The latidens group and some European species are distantly related and closer to other genera within Ipini. Our robust cladogram was used to revise the classification of Ips. We provide new diagnoses for Ips and four subgeneric taxa. 相似文献

9.

Incorporation of gap characters and lineage-specific regions into phylogenetic analyses of gene families from divergent clades: an example from the kinesin superfamily across eukaryotes

Mark P. Simmons Dale Richardson Anireddy S. N. Reddy 《Cladistics : the international journal of the Willi Hennig Society》2008,24(3):372-384

The kinesin superfamily across eukaryotes was used to examine how incorporation of gap characters scored from conserved regions shared by all members of a gene family and incorporation of amino acid and gap characters scored from lineage‐specific regions affect gene‐tree inference of the gene family as a whole. We addressed these two questions in the context of two different densities of sequence sampling, four alignment programs, and two methods of tree construction. Taken together, our findings suggest the following. First, gap characters should be incorporated into gene‐tree inference, even for divergent sequences. Second, gene regions that are not conserved among all or most sequences sampled should not be automatically discarded without evaluation of potential phylogenetic signal that may be contained in gap and/or sequence characters. Third, among the four alignment programs evaluated using their default alignment parameters, Clustal may be expected to output alignments that result in the greatest gene‐tree resolution and support. Yet, this high resolution and support should be regarded as optimistic, rather than conservative, estimates. Fourth, this same conclusion regarding resolution and support holds for Bayesian gene‐tree analyses relative to parsimony‐jackknife gene‐tree analyses. We suggest that a more conservative approach, such as aligning the sequences using DIALIGN‐T or MAFFT, analyzing the appropriate characters using parsimony, and assessing branch support using the jackknife, is more appropriate for inferring gene trees of divergent gene families. © The Willi Hennig Society 2007. 相似文献

10.

Scaling statistical multiple sequence alignment to large datasets

Nute Michael Warnow Tandy 《BMC genomics》2016,17(10):764-144

Background

Multiple sequence alignment is an important task in bioinformatics, and alignments of large datasets containing hundreds or thousands of sequences are increasingly of interest. While many alignment methods exist, the most accurate alignments are likely to be based on stochastic models where sequences evolve down a tree with substitutions, insertions, and deletions. While some methods have been developed to estimate alignments under these stochastic models, only the Bayesian method BAli-Phy has been able to run on even moderately large datasets, containing 100 or so sequences. A technique to extend BAli-Phy to enable alignments of thousands of sequences could potentially improve alignment and phylogenetic tree accuracy on large-scale data beyond the best-known methods today.

Results

We use simulated data with up to 10,000 sequences representing a variety of model conditions, including some that are significantly divergent from the statistical models used in BAli-Phy and elsewhere. We give a method for incorporating BAli-Phy into PASTA and UPP, two strategies for enabling alignment methods to scale to large datasets, and give alignment and tree accuracy results measured against the ground truth from simulations. Comparable results are also given for other methods capable of aligning this many sequences.

Conclusions

Extensions of BAli-Phy using PASTA and UPP produce significantly more accurate alignments and phylogenetic trees than the current leading methods.

相似文献

11.

Multiple sequence alignment accuracy and phylogenetic inference

Ogdenw TH Rosenberg MS 《Systematic biology》2006,55(2):314-328

Phylogenies are often thought to be more dependent upon the specifics of the sequence alignment rather than on the method of reconstruction. Simulation of sequences containing insertion and deletion events was performed in order to determine the role that alignment accuracy plays during phylogenetic inference. Data sets were simulated for pectinate, balanced, and random tree shapes under different conditions (ultrametric equal branch length, ultrametric random branch length, nonultrametric random branch length). Comparisons between hypothesized alignments and true alignments enabled determination of two measures of alignment accuracy, that of the total data set and that of individual branches. In general, our results indicate that as alignment error increases, topological accuracy decreases. This trend was much more pronounced for data sets derived from more pectinate topologies. In contrast, for balanced, ultrametric, equal branch length tree shapes, alignment inaccuracy had little average effect on tree reconstruction. These conclusions are based on average trends of many analyses under different conditions, and any one specific analysis, independent of the alignment accuracy, may recover very accurate or inaccurate topologies. Maximum likelihood and Bayesian, in general, outperformed neighbor joining and maximum parsimony in terms of tree reconstruction accuracy. Results also indicated that as the length of the branch and of the neighboring branches increase, alignment accuracy decreases, and the length of the neighboring branches is the major factor in topological accuracy. Thus, multiple-sequence alignment can be an important factor in downstream effects on topological reconstruction. 相似文献

12.

SATe-II: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees

Liu K Warnow TJ Holder MT Nelesen SM Yu J Stamatakis AP Linder CR 《Systematic biology》2012,61(1):90-106

Highly accurate estimation of phylogenetic trees for large data sets is difficult, in part because multiple sequence alignments must be accurate for phylogeny estimation methods to be accurate. Coestimation of alignments and trees has been attempted but currently only SATé estimates reasonably accurate trees and alignments for large data sets in practical time frames (Liu K., Raghavan S., Nelesen S., Linder C.R., Warnow T. 2009b. Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science. 324:1561-1564). Here, we present a modification to the original SATé algorithm that improves upon SATé (which we now call SATé-I) in terms of speed and of phylogenetic and alignment accuracy. SATé-II uses a different divide-and-conquer strategy than SATé-I and so produces smaller more closely related subsets than SATé-I; as a result, SATé-II produces more accurate alignments and trees, can analyze larger data sets, and runs more efficiently than SATé-I. Generally, SATé is a metamethod that takes an existing multiple sequence alignment method as an input parameter and boosts the quality of that alignment method. SATé-II-boosted alignment methods are significantly more accurate than their unboosted versions, and trees based upon these improved alignments are more accurate than trees based upon the original alignments. Because SATé-I used maximum likelihood (ML) methods that treat gaps as missing data to estimate trees and because we found a correlation between the quality of tree/alignment pairs and ML scores, we explored the degree to which SATé's performance depends on using ML with gaps treated as missing data to determine the best tree/alignment pair. We present two lines of evidence that using ML with gaps treated as missing data to optimize the alignment and tree produces very poor results. First, we show that the optimization problem where a set of unaligned DNA sequences is given and the output is the tree and alignment of those sequences that maximize likelihood under the Jukes-Cantor model is uninformative in the worst possible sense. For all inputs, all trees optimize the likelihood score. Second, we show that a greedy heuristic that uses GTR+Gamma ML to optimize the alignment and the tree can produce very poor alignments and trees. Therefore, the excellent performance of SATé-II and SATé-I is not because ML is used as an optimization criterion for choosing the best tree/alignment pair but rather due to the particular divide-and-conquer realignment techniques employed. 相似文献

13.

A hidden Markov model for progressive multiple alignment 总被引：4，自引：0，他引：4

Löytynoja A Milinkovitch MC 《Bioinformatics (Oxford, England)》2003,19(12):1505-1513

MOTIVATION: Progressive algorithms are widely used heuristics for the production of alignments among multiple nucleic-acid or protein sequences. Probabilistic approaches providing measures of global and/or local reliability of individual solutions would constitute valuable developments. RESULTS: We present here a new method for multiple sequence alignment that combines an HMM approach, a progressive alignment algorithm, and a probabilistic evolution model describing the character substitution process. Our method works by iterating pairwise alignments according to a guide tree and defining each ancestral sequence from the pairwise alignment of its child nodes, thus, progressively constructing a multiple alignment. Our method allows for the computation of each column minimum posterior probability and we show that this value correlates with the correctness of the result, hence, providing an efficient mean by which unreliably aligned columns can be filtered out from a multiple alignment. 相似文献

14.

A phylogenetic analysis of Coelopidae (Diptera) based on morphological and DNA sequence data

Meier R Wiegmann BM 《Molecular phylogenetics and evolution》2002,25(3):171-407

The phylogenetic relationships of 22 species of Coelopidae are reconstructed based on a data matrix consisting of morphological and DNA sequence characters (16S rDNA, EF-1alpha). Optimal gap and transversion costs are determined via a sensitivity analysis and both equal weighting and a transversion cost of 2 are found to perform best based on taxonomic congruence, character incongruence, and tree support. The preferred phylogenetic hypothesis is fully resolved and well-supported by jackknife, bootstrap, and Bremer support values, but it is in conflict with the cladogram based on morphological characters alone. Most notably, the Coelopidae and the genus Coelopa are not monophyletic. However, partitioned Bremer Support and an analysis of node stability under different gap and transversion costs reveal that the critical clades rendering these taxa non-monophyletic are poorly supported. Furthermore, the monophyly of Coelopidae and Coelopa is not rejected in analyses using 16S rDNA that was manually aligned. The resolution of the tree based on this reduced data sets is, however, lower than for the tree based on the full data sets. Partitioned Bremer support values reveal that 16S rDNA characters provide the largest amount of tree support, but the support values are heavily dependent on analysis conditions. Problems with direct comparison of branch support values for trees derived using fixed alignments with those obtained under optimization alignment are discussed. Biogeographic history and available behavioral and genetic data are also discussed in light of this first cladogram for Coelopidae based on a quantitative phylogenetic analysis. 相似文献

15.

Consistency of optimal sequence alignments

Osamu Gotoh 《Bulletin of mathematical biology》1990,52(4):509-525

Pairwise optimal alignments between three or more sequences are not necessarily consistent as a whole, but consistent and inconsistent residues are usually distributed in clusters. An efficient method has been developed for locating consistent regions when each pairwise alignment is given in the form of a “skeletal representation” (Bull. math. Biol. 52, 359–373). This method is further extended so that the combination of pairwise alignments that gives the greatest consistency is found when possibly many alignments are equally optimal for each pairwise comparison. A method for acceleration of simultaneous multiple sequence alignment is proposed in which consistent regions serve as “anchor points” limiting application of direct multi-way alignment to the rest of “inconsistent” regions. Dedicated to Prof. Akiyoshi Wada on the occasion of his 60^th birthday. 相似文献

16.

Multilocus ribosomal RNA phylogeny of the leaf beetles (Chrysomelidae)

Jesús Gómez-Zurita† Toby Hunt Alfried P. Vogler 《Cladistics : the international journal of the Willi Hennig Society》2008,24(1):34-50

Basal relationships in the Chrysomelidae (leaf beetles) were investigated using two nuclear (small and partial large subunits) and mitochondrial (partial large subunit) rRNA (≈ 3000 bp total) for 167 taxa covering most major lineages and relevant outgroups. Separate and combined data analyses were performed under parsimony and model‐based tree building algorithms from dynamic (direct optimization) and static (Clustal and BLAST) sequence alignments. The performance of methods differed widely and recovery of well established nodes was erratic, in particular when using single gene partitions, but showed a slight advantage for Bayesian inferences and one of the fast likelihood algorithms (PHYML) over others. Direct optimization greatly gained from simultaneous analysis and provided a valuable hypothesis of chrysomelid relationships. The BLAST‐based alignment, which removes poorly aligned sequence segments, in combination with likelihood and Bayesian analyses, resulted in highly defensible trees obtained in much shorter time than direct optimization, and hence is a viable alternative when data sets grow. The main taxonomic findings include the recognition of three major lineages of Chrysomelidae, including a basal “sagrine” clade (Criocerinae, Donaciinae, Bruchinae), which was sister to the “eumolpine” (Spilopyrinae, Eumolpinae, Cryptocephalinae, Cassidinae) plus “chrysomeline” (Chrysomelinae, Galerucinae) clades. The analyses support a broad definition of subfamilies (i.e., merging previously separated subfamilies) in the case of Cassidinae (cassidines + hispines) and Cryptocephalinae (chlamisines + cryptocephalines + clytrines), whereas two subfamilies, Chrysomelinae and Eumolpinae, were paraphyletic. The surprising separation of monocot feeding Cassidinae (associated with the eumolpine clade) from the other major monocot feeding groups in the sagrine clade was well supported. The study highlights the need for thorough taxon sampling, and reveals that morphological data affected by convergence had a great impact when combined with molecular data in previous phylogenetic analyses of Chrysomelidae. © The Willi Hennig Society 2007. 相似文献

17.

PnpProbs: a better multiple sequence alignment tool by better handling of guide trees

Yongtao Ye Tak-Wah Lam Hing-Fung Ting 《BMC bioinformatics》2016,17(8):285

Background

This paper describes a new MSA tool called PnpProbs, which constructs better multiple sequence alignments by better handling of guide trees. It classifies sequences into two types: normally related and distantly related. For normally related sequences, it uses an adaptive approach to construct the guide tree needed for progressive alignment; it first estimates the input’s discrepancy by computing the standard deviation of their percent identities, and based on this estimate, it chooses the better method to construct the guide tree. For distantly related sequences, PnpProbs abandons the guide tree and uses instead some non-progressive alignment method to generate the alignment.

Results

To evaluate PnpProbs, we have compared it with thirteen other popular MSA tools, and PnpProbs has the best alignment scores in all but one test. We have also used it for phylogenetic analysis, and found that the phylogenetic trees constructed from PnpProbs’ alignments are closest to the model trees.

Conclusions

By combining the strength of the progressive and non-progressive alignment methods, we have developed an MSA tool called PnpProbs. We have compared PnpProbs with thirteen other popular MSA tools and our results showed that our tool usually constructed the best alignments.

相似文献

18.

Pandit: a database of protein and associated nucleotide domains with inferred trees

Whelan S de Bakker PI Goldman N 《Bioinformatics (Oxford, England)》2003,19(12):1556-1563

MOTIVATION: A large, high-quality database of homologous sequence alignments with good estimates of their corresponding phylogenetic trees will be a valuable resource to those studying phylogenetics. It will allow researchers to compare current and new models of sequence evolution across a large variety of sequences. The large quantity of data may provide inspiration for new models and methodology to study sequence evolution and may allow general statements about the relative effect of different molecular processes on evolution. RESULTS: The Pandit 7.6 database contains 4341 families of sequences derived from the seed alignments of the Pfam database of amino acid alignments of families of homologous protein domains (Bateman et al., 2002). Each family in Pandit includes an alignment of amino acid sequences that matches the corresponding Pfam family seed alignment, an alignment of DNA sequences that contain the coding sequence of the Pfam alignment when they can be recovered (overall, 82.9% of sequences taken from Pfam) and the alignment of amino acid sequences restricted to only those sequences for which a DNA sequence could be recovered. Each of the alignments has an estimate of the phylogenetic tree associated with it. The tree topologies were obtained using the neighbor joining method based on maximum likelihood estimates of the evolutionary distances, with branch lengths then calculated using a standard maximum likelihood approach. 相似文献

19.

Pair hidden Markov models on tree structures

Sakakibara Y 《Bioinformatics (Oxford, England)》2003,19(Z1):i232-i240

MOTIVATION: Computationally identifying non-coding RNA regions on the genome has much scope for investigation and is essentially harder than gene-finding problems for protein-coding regions. Since comparative sequence analysis is effective for non-coding RNA detection, efficient computational methods are expected for structural alignments of RNA sequences. On the other hand, Hidden Markov Models (HMMs) have played important roles for modeling and analysing biological sequences. Especially, the concept of Pair HMMs (PHMMs) have been examined extensively as mathematical models for alignments and gene finding. RESULTS: We propose the pair HMMs on tree structures (PHMMTSs), which is an extension of PHMMs defined on alignments of trees and provides a unifying framework and an automata-theoretic model for alignments of trees, structural alignments and pair stochastic context-free grammars. By structural alignment, we mean a pairwise alignment to align an unfolded RNA sequence into an RNA sequence of known secondary structure. First, we extend the notion of PHMMs defined on alignments of 'linear' sequences to pair stochastic tree automata, called PHMMTSs, defined on alignments of 'trees'. The PHMMTSs provide various types of alignments of trees such as affine-gap alignments of trees and an automata-theoretic model for alignment of trees. Second, based on the observation that a secondary structure of RNA can be represented by a tree, we apply PHMMTSs to the problem of structural alignments of RNAs. We modify PHMMTSs so that it takes as input a pair of a 'linear' sequence and a 'tree' representing a secondary structure of RNA to produce a structural alignment. Further, the PHMMTSs with input of a pair of two linear sequences is mathematically equal to the pair stochastic context-free grammars. We demonstrate some computational experiments to show the effectiveness of our method for structural alignments, and discuss a complexity issue of PHMMTSs. 相似文献

20.

Characterization of pairwise and multiple sequence alignment errors

Landan G Graur D 《Gene》2009,441(1-2):141-147

We characterize pairwise and multiple sequence alignment (MSA) errors by comparing true alignments from simulations of sequence evolution with reconstructed alignments. The vast majority of reconstructed alignments contain many errors. Error rates rapidly increase with sequence divergence, thus, for even intermediate degrees of sequence divergence, more than half of the columns of a reconstructed alignment may be expected to be erroneous. In closely related sequences, most errors consist of the erroneous positioning of a single indel event and their effect is local. As sequences diverge, errors become more complex as a result of the simultaneous mis-reconstruction of many indel events, and the lengths of the affected MSA segments increase dramatically. We found a systematic bias towards underestimation of the number of gaps, which leads to the reconstructed MSA being on average shorter than the true one. Alignment errors are unavoidable even when the evolutionary parameters are known in advance. Correct reconstruction can only be guaranteed when the likelihood of true alignment is uniquely optimal. However, true alignment features are very frequently sub-optimal or co-optimal, with the result that optimal albeit erroneous features are incorporated into the reconstructed MSA. Progressive MSA utilizes a guide-tree in the reconstruction of MSAs. The quality of the guide-tree was found to affect MSA error levels only marginally. 相似文献