期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Alignment and topological accuracy of the direct optimization approach via POY and traditional phylogenetics via ClustalW + PAUP*

Ogden TH Rosenberg MS 《Systematic biology》2007,56(2):182-193

Direct optimization frameworks for simultaneously estimating alignments and phylogenies have recently been developed. One such method, implemented in the program POY, is becoming more common for analyses of variable length sequences (e.g., analyses using ribosomal genes) and for combined evidence analyses (morphology + multiple genes). Simulation of sequences containing insertion and deletion events was performed in order to directly compare a widely used method of multiple sequence alignment (ClustalW) and subsequent parsimony analysis in PAUP* with direct optimization via POY. Data sets were simulated for pectinate, balanced, and random tree shapes under different conditions (clocklike, non-clocklike, and ultrametric). Alignment accuracy scores for the implied alignments from POY and the multiple sequence alignments from ClustalW were calculated and compared. In almost all cases (99.95%), ClustalW produced more accurate alignments than POY-implied alignments, judged by the proportion of correctly identified homologous sites. Topological accuracy (distance to the true tree) for POY topologies and topologies generated under parsimony in PAUP* from the ClustalW alignments were also compared. In 44.94% of the cases, Clustal alignment tree reconstructions via PAUP* were more accurate than POY, whereas in 16.71% of the cases POY reconstructions were more topologically accurate (38.38% of the time they were equally accurate). Comparisons between POY hypothesized alignments and the true alignments indicated that, on average, as alignment error increased, topological accuracy decreased. 相似文献

2.

How should gaps be treated in parsimony? A comparison of approaches using simulation

Ogden TH Rosenberg MS 《Molecular phylogenetics and evolution》2007,42(3):817-826

Simulation with indels was used to produce alignments where true site homologies in DNA sequences were known; the gaps from these datasets were removed and the sequences were then aligned to produce hypothesized alignments. Both alignments were then analyzed under three widely used methods of treating gaps during tree reconstruction under the maximum parsimony principle. With the true alignments, for many cases (82%), there was no difference in topological accuracy for the different methods of gap coding. However, in cases where a difference was present, coding gaps as a fifth state character or as separate presence/absence characters outperformed treating gaps as unknown/missing data nearly 90% of the time. For the hypothesized alignments, on average, all gap treatment approaches performed equally well. Data sets with higher sequence divergence and more pectinate tree shapes with variable branch lengths are more affected by gap coding than datasets associated with shallower non-pectinate tree shapes. 相似文献

3.

Long branch effects distort maximum likelihood phylogenies in simulations despite selection of the correct model

Kück P Mayer C Wägele JW Misof B 《PloS one》2012,7(5):e36593

The aim of our study was to test the robustness and efficiency of maximum likelihood with respect to different long branch effects on multiple-taxon trees. We simulated data of different alignment lengths under two different 11-taxon trees and a broad range of different branch length conditions. The data were analyzed with the true model parameters as well as with estimated and incorrect assumptions about among-site rate variation. If length differences between connected branches strongly increase, tree inference with the correct likelihood model assumptions can fail. We found that incorporating invariant sites together with Γ distributed site rates in the tree reconstruction (Γ+I) increases the robustness of maximum likelihood in comparison with models using only Γ. The results show that for some topologies and branch lengths the reconstruction success of maximum likelihood under the correct model is still low for alignments with a length of 100,000 base positions. Altogether, the high confidence that is put in maximum likelihood trees is not always justified under certain tree shapes even if alignment lengths reach 100,000 base positions. 相似文献

4.

Using Confidence Set Heuristics During Topology Search Improves the Robustness of Phylogenetic Inference

Pepke SL Butt D Nadeau I Roger AJ Blouin C 《Journal of molecular evolution》2007,64(1):80-89

We examine the impact of likelihood surface characteristics on phylogenetic inference. Amino acid data sets simulated from topologies with branch length features chosen to represent varying degrees of difficulty for likelihood maximization are analyzed. We present situations where the tree found to achieve the global maximum in likelihood is often not equal to the true tree. We use the program covSEARCH to demonstrate how the use of adaptively sized pools of candidate trees that are updated using confidence tests results in solution sets that are highly likely to contain the true tree. This approach requires more computation than traditional maximum likelihood methods, hence covSEARCH is best suited to small to medium-sized alignments or large alignments with some constrained nodes. The majority rule consensus tree computed from the confidence sets also proves to be different from the generating topology. Although low phylogenetic signal in the input alignment can result in large confidence sets of trees, some biological information can still be obtained based on nodes that exhibit high support within the confidence set. Two real data examples are analyzed: mammal mitochondrial proteins and a small tubulin alignment. We conclude that the technique of confidence set optimization can significantly improve the robustness of phylogenetic inference at a reasonable computational cost. Additionally, when either very short internal branches or very long terminal branches are present, confident resolution of specific bipartitions or subtrees, rather than whole-tree phylogenies, may be the most realistic goal for phylogenetic methods. [Reviewing Editor: Dr. Nicolas Galtier] 相似文献

5.

Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments 总被引：5，自引：0，他引：5

Talavera G Castresana J 《Systematic biology》2007,56(4):564-577

Alignment quality may have as much impact on phylogenetic reconstruction as the phylogenetic methods used. Not only the alignment algorithm, but also the method used to deal with the most problematic alignment regions, may have a critical effect on the final tree. Although some authors remove such problematic regions, either manually or using automatic methods, in order to improve phylogenetic performance, others prefer to keep such regions to avoid losing any information. Our aim in the present work was to examine whether phylogenetic reconstruction improves after alignment cleaning or not. Using simulated protein alignments with gaps, we tested the relative performance in diverse phylogenetic analyses of the whole alignments versus the alignments with problematic regions removed with our previously developed Gblocks program. We also tested the performance of more or less stringent conditions in the selection of blocks. Alignments constructed with different alignment methods (ClustalW, Mafft, and Probcons) were used to estimate phylogenetic trees by maximum likelihood, neighbor joining, and parsimony. We show that, in most alignment conditions, and for alignments that are not too short, removal of blocks leads to better trees. That is, despite losing some information, there is an increase in the actual phylogenetic signal. Overall, the best trees are obtained by maximum-likelihood reconstruction of alignments cleaned by Gblocks. In general, a relaxed selection of blocks is better for short alignment, whereas a stringent selection is more adequate for longer ones. Finally, we show that cleaned alignments produce better topologies although, paradoxically, with lower bootstrap. This indicates that divergent and problematic alignment regions may lead, when present, to apparently better supported although, in fact, more biased topologies. 相似文献

6.

Joint Bayesian estimation of alignment and phylogeny

Redelings BD Suchard MA 《Systematic biology》2005,54(3):401-418

We describe a novel model and algorithm for simultaneously estimating multiple molecular sequence alignments and the phylogenetic trees that relate the sequences. Unlike current techniques that base phylogeny estimates on a single estimate of the alignment, we take alignment uncertainty into account by considering all possible alignments. Furthermore, because the alignment and phylogeny are constructed simultaneously, a guide tree is not needed. This sidesteps the problem in which alignments created by progressive alignment are biased toward the guide tree used to generate them. Joint estimation also allows us to model rate variation between sites when estimating the alignment and to use the evidence in shared insertion/deletions (indels) to group sister taxa in the phylogeny. Our indel model makes use of affine gap penalties and considers indels of multiple letters. We make the simplifying assumption that the indel process is identical on all branches. As a result, the probability of a gap is independent of branch length. We use a Markov chain Monte Carlo (MCMC) method to sample from the posterior of the joint model, estimating the most probable alignment and tree and their support simultaneously. We describe a new MCMC transition kernel that improves our algorithm's mixing efficiency, allowing the MCMC chains to converge even when started from arbitrary alignments. Our software implementation can estimate alignment uncertainty and we describe a method for summarizing this uncertainty in a single plot. 相似文献

7.

The relative contribution of band number to phylogenetic accuracy in AFLP data sets

García-Pereira MJ Caballero A Quesada H 《Journal of evolutionary biology》2011,24(11):2346-2356

We examined the effect of increasing the number of sampled amplified fragment length polymorphism (AFLP) bands to reconstruct an accurate and well-supported AFLP-based phylogeny. In silico AFLP was performed using simulated DNA sequences evolving along balanced and unbalanced model trees with recent, uniform and ancient radiations and average branch lengths (from the most internal node to the tip) ranging from 0.02 to 0.05 substitutions per site. Trees were estimated by minimum evolution (ME) and maximum parsimony (MP) methods from both DNA sequences and virtual AFLP fingerprints. The comparison of the true tree with the estimated AFLP trees suggests that moderate numbers of AFLP bands are necessary to recover the correct topology with high bootstrap support values (i.e. >70%). Fewer numbers of bands are necessary for shorter tree lengths and for balanced than for unbalanced tree topologies. However, branch length estimation was rather unreliable and did not improve substantially after a certain number of bands were sampled. These results hold for different levels of genome coverage and number of taxa analysed. In silico AFLP using bacterial genomic DNA sequences recovered a well-supported tree topology that mirrored an empirical phylogeny based on a set of 31 orthologous gene sequences when as few as 263 AFLP bands were scored. These results suggest that AFLPs may be an efficient alternative to traditional DNA sequencing for accurate topology reconstruction of shallow trees when not very short ancestral branches exist. 相似文献

8.

The geometric mean length, a new statistic to describe the distribution of character steps on a tree

Philippe Grandcolas Tony Robillard Cyrille D'Haese Laure Desutter-Grandcolas Eric Guilbert Jérôme Murienne 《Cladistics : the international journal of the Willi Hennig Society》2004,20(3):219-222

The geometric mean length (GML) is proposed as a new statistic aimed at describing the evenness of character changes on a tree for a given set of character optimizations. It is the geometric mean of the number of steps on each branch of the tree, varying between a maximum value when all branch lengths are equal, and a minimum value when all branches but one have only one character step. It can be scaled according to its theoretical maximum value, thus indicating a relative GML that allows a comparison of the evenness of character steps between different tree topologies. 相似文献

9.

Improving accuracy of multiple sequence alignment algorithms based on alignment of neighboring residues

下载免费PDF全文

Yue Lu Sing-Hoi Sze 《Nucleic acids research》2009,37(2):463-472

While most of the recent improvements in multiple sequence alignment accuracy are due to better use of vertical information, which include the incorporation of consistency-based pairwise alignments and the use of profile alignments, we observe that it is possible to further improve accuracy by taking into account alignment of neighboring residues when aligning two residues, thus making better use of horizontal information. By modifying existing multiple alignment algorithms to make use of horizontal information, we show that this strategy is able to consistently improve over existing algorithms on a few sets of benchmark alignments that are commonly used to measure alignment accuracy, and the average improvements in accuracy can be as much as 1–3% on protein sequence alignment and 5–10% on DNA/RNA sequence alignment. Unlike previous algorithms, consistent average improvements can be obtained across all identity levels. 相似文献

10.

Phylogenetic position of Salinibacter ruber based on concatenated protein alignments

Soria-Carrasco V Valens-Vadell M Peña A Antón J Amann R Castresana J Rosselló-Mora R 《Systematic and applied microbiology》2007,30(3):171-179

A total of 22 genes from the genome of Salinibacter ruber strain M31 were selected in order to study the phylogenetic position of this species based on protein alignments. The selection of the genes was based on their essential function for the organism, dispersion within the genome, and sufficient informative length of the final alignment. For each gene, an individual phylogenetic analysis was performed and compared with the resulting tree based on the concatenation of the 22 genes, which rendered a single alignment of 10,757 homologous positions. In addition to the manually chosen genes, an automatically selected data set of 74 orthologous genes was used to reconstruct a tree based on 17,149 homologous positions. Although single genes supported different topologies, the tree topology of both concatenated data sets was shown to be identical to that previously observed based on small subunit (SSU) rRNA gene analysis, in which S. ruber was placed together with Bacteroidetes. In both concatenated data sets the bootstrap was very high, but an analysis with a gradually lower number of genes indicated that the bootstrap was greatly reduced with less than 12 genes. The results indicate that tree reconstructions based on concatenating large numbers of protein coding genes seem to produce tree topologies with similar resolution to that of the single 16S rRNA gene trees. For classification purposes, 16S rRNA gene analysis may remain as the most pragmatic approach to infer genealogic relationships. 相似文献

11.

基于地基激光雷达的落叶松人工林枝条因子提取和建模 总被引：1，自引：0，他引：1

张颖贾炜玮《应用生态学报》2021,32(7):2505-2513

地基激光雷达(TLS)可以实现从森林中无破坏收集数据。本文基于地基激光雷达数据通过点云处理软件以人机交互的方式获取了26株落叶松样木的1266组枝条信息,包括着枝高度、弦长、枝长、着枝角度、基径和弓高。枝条可提取的最大相对着枝高度的平均值为0.83。在所提取的枝条因子中,提取精度依次为着枝高度＞弦长＞枝长＞基径(基径大于20 mm的枝条)＞弓高,将树冠分为4部分后分析发现,随着冠层高度的增加,枝条密度呈升高趋势,枝条提取率和提取精度呈下降趋势。此外,由于枝条基径提取精度较低,以弦长、着枝高度、胸径和树高为自变量构建基径预测模型。对不同基径的实测值、提取值与模型预测值对比分析发现,枝条基径的预测精度大于提取精度。对于造材来说,最有价值的部分是树木中下部,本方法能够较准确地提取树木胸径树高和相对着枝高度0.8以下的枝条属性信息,提供构建木材质量模型所需要的参数。相似文献

12.

Pandit: a database of protein and associated nucleotide domains with inferred trees

Whelan S de Bakker PI Goldman N 《Bioinformatics (Oxford, England)》2003,19(12):1556-1563

MOTIVATION: A large, high-quality database of homologous sequence alignments with good estimates of their corresponding phylogenetic trees will be a valuable resource to those studying phylogenetics. It will allow researchers to compare current and new models of sequence evolution across a large variety of sequences. The large quantity of data may provide inspiration for new models and methodology to study sequence evolution and may allow general statements about the relative effect of different molecular processes on evolution. RESULTS: The Pandit 7.6 database contains 4341 families of sequences derived from the seed alignments of the Pfam database of amino acid alignments of families of homologous protein domains (Bateman et al., 2002). Each family in Pandit includes an alignment of amino acid sequences that matches the corresponding Pfam family seed alignment, an alignment of DNA sequences that contain the coding sequence of the Pfam alignment when they can be recovered (overall, 82.9% of sequences taken from Pfam) and the alignment of amino acid sequences restricted to only those sequences for which a DNA sequence could be recovered. Each of the alignments has an estimate of the phylogenetic tree associated with it. The tree topologies were obtained using the neighbor joining method based on maximum likelihood estimates of the evolutionary distances, with branch lengths then calculated using a standard maximum likelihood approach. 相似文献

13.

In vitro RNA random pools are not structurally diverse: a computational analysis 总被引：2，自引：0，他引：2

下载免费PDF全文

Gevertz J Gan HH Schlick T 《RNA (New York, N.Y.)》2005,11(6):853-863

In vitro selection of functional RNAs from large random sequence pools has led to the identification of many ligand-binding and catalytic RNAs. However, the structural diversity in random pools is not well understood. Such an understanding is a prerequisite for designing sequence pools to increase the probability of finding complex functional RNA by in vitro selection techniques. Toward this goal, we have generated by computer five random pools of RNA sequences of length up to 100 nt to mimic experiments and characterized the distribution of associated secondary structural motifs using sets of possible RNA tree structures derived from graph theory techniques. Our results show that such random pools heavily favor simple topological structures: For example, linear stem-loop and low-branching motifs are favored rather than complex structures with high-order junctions, as confirmed by known aptamers. Moreover, we quantify the rise of structural complexity with sequence length and report the dominant class of tree motifs (characterized by vertex number) for each pool. These analyses show not only that random pools do not lead to a uniform distribution of possible RNA secondary topologies; they point to avenues for designing pools with specific simple and complex structures in equal abundance in the goal of broadening the range of functional RNAs discovered by in vitro selection. Specifically, the optimal RNA sequence pool length to identify a structure with x stems is 20x. 相似文献

14.

Exploring the relationship between sequence similarity and accurate phylogenetic trees

Cantarel BL Morrison HG Pearson W 《Molecular biology and evolution》2006,23(11):2090-2100

We have characterized the relationship between accurate phylogenetic reconstruction and sequence similarity, testing whether high levels of sequence similarity can consistently produce accurate evolutionary trees. We generated protein families with known phylogenies using a modified version of the PAML/EVOLVER program that produces insertions and deletions as well as substitutions. Protein families were evolved over a range of 100-400 point accepted mutations; at these distances 63% of the families shared significant sequence similarity. Protein families were evolved using balanced and unbalanced trees, with ancient or recent radiations. In families sharing statistically significant similarity, about 60% of multiple sequence alignments were 95% identical to true alignments. To compare recovered topologies with true topologies, we used a score that reflects the fraction of clades that were correctly clustered. As expected, the accuracy of the phylogenies was greatest in the least divergent families. About 88% of phylogenies clustered over 80% of clades in families that shared significant sequence similarity, using Bayesian, parsimony, distance, and maximum likelihood methods. However, for protein families with short ancient branches (ancient radiation), only 30% of the most divergent (but statistically significant) families produced accurate phylogenies, and only about 70% of the second most highly conserved families, with median expectation values better than 10(-60), produced accurate trees. These values represent upper bounds on expected tree accuracy for sequences with a simple divergence history; proteins from 700 Giardia families, with a similar range of sequence similarities but considerably more gaps, produced much less accurate trees. For our simulated insertions and deletions, correct multiple sequence alignments did not perform much better than those produced by T-COFFEE, and including sequences with expressed sequence tag-like sequencing errors did not significantly decrease phylogenetic accuracy. In general, although less-divergent sequence families produce more accurate trees, the likelihood of estimating an accurate tree is most dependent on whether radiation in the family was ancient or recent. Accuracy can be improved by combining genes from the same organism when creating species trees or by selecting protein families with the best bootstrap values in comprehensive studies. 相似文献

15.

Approximate likelihood-ratio test for branches: A fast, accurate, and powerful alternative 总被引：2，自引：0，他引：2

Anisimova M Gascuel O 《Systematic biology》2006,55(4):539-552

We revisit statistical tests for branches of evolutionary trees reconstructed upon molecular data. A new, fast, approximate likelihood-ratio test (aLRT) for branches is presented here as a competitive alternative to nonparametric bootstrap and Bayesian estimation of branch support. The aLRT is based on the idea of the conventional LRT, with the null hypothesis corresponding to the assumption that the inferred branch has length 0. We show that the LRT statistic is asymptotically distributed as a maximum of three random variables drawn from the chi(0)2 + chi(1)2 distribution. The new aLRT of interior branch uses this distribution for significance testing, but the test statistic is approximated in a slightly conservative but practical way as 2(l1- l2), i.e., double the difference between the maximum log-likelihood values corresponding to the best tree and the second best topological arrangement around the branch of interest. Such a test is fast because the log-likelihood value l2 is computed by optimizing only over the branch of interest and the four adjacent branches, whereas other parameters are fixed at their optimal values corresponding to the best ML tree. The performance of the new test was studied on simulated 4-, 12-, and 100-taxon data sets with sequences of different lengths. The aLRT is shown to be accurate, powerful, and robust to certain violations of model assumptions. The aLRT is implemented within the algorithm used by the recent fast maximum likelihood tree estimation program PHYML (Guindon and Gascuel, 2003). 相似文献

16.

Barking Up The Wrong Treelength: The Impact of Gap Penalty on Alignment and Tree Accuracy

Liu Kevin Nelesen Serita Raghavan Sindhu Linder C. Randal Warnow Tandy 《IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM》2009,6(1):7-21

Several methods have been developed for simultaneous estimation of alignment and tree, of which POY is the most popular. In a 2007 paper published in Systematic Biology, Ogden and Rosenberg reported on a simulation study in which they compared POY to estimating the alignment using ClustalW and then analyzing the resultant alignment using maximum parsimony. They found that ClustalW+MP outperformed POY with respect to alignment and phylogenetic tree accuracy, and they concluded that simultaneous estimation techniques are not competitive with two-phase techniques. Our paper presents a simulation study in which we focus on the NP-hard optimization problem that POY addresses: minimizing treelength. Our study considers the impact of the gap penalty and suggests that the poor performance observed for POY by Ogden and Rosenberg is due to the simple gap penalties they used to score alignment/tree pairs. Our study suggests that optimizing under an affine gap penalty might produce alignments that are better than ClustalW alignments, and competitive with those produced by the best current alignment methods. We also show that optimizing under this affine gap penalty produces trees whose topological accuracy is better than ClustalW+MP, and competitive with the current best two-phase methods. 相似文献

17.

Detecting the limits of regulatory element conservation and divergence estimation using pairwise and multiple alignments

Daniel A Pollard Alan M Moses Venky N Iyer Michael B Eisen 《BMC bioinformatics》2006,7(1):376-14

Background

Molecular evolutionary studies of noncoding sequences rely on multiple alignments. Yet how multiple alignment accuracy varies across sequence types, tree topologies, divergences and tools, and further how this variation impacts specific inferences, remains unclear. 相似文献

18.

SATe-II: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees

Liu K Warnow TJ Holder MT Nelesen SM Yu J Stamatakis AP Linder CR 《Systematic biology》2012,61(1):90-106

Highly accurate estimation of phylogenetic trees for large data sets is difficult, in part because multiple sequence alignments must be accurate for phylogeny estimation methods to be accurate. Coestimation of alignments and trees has been attempted but currently only SATé estimates reasonably accurate trees and alignments for large data sets in practical time frames (Liu K., Raghavan S., Nelesen S., Linder C.R., Warnow T. 2009b. Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science. 324:1561-1564). Here, we present a modification to the original SATé algorithm that improves upon SATé (which we now call SATé-I) in terms of speed and of phylogenetic and alignment accuracy. SATé-II uses a different divide-and-conquer strategy than SATé-I and so produces smaller more closely related subsets than SATé-I; as a result, SATé-II produces more accurate alignments and trees, can analyze larger data sets, and runs more efficiently than SATé-I. Generally, SATé is a metamethod that takes an existing multiple sequence alignment method as an input parameter and boosts the quality of that alignment method. SATé-II-boosted alignment methods are significantly more accurate than their unboosted versions, and trees based upon these improved alignments are more accurate than trees based upon the original alignments. Because SATé-I used maximum likelihood (ML) methods that treat gaps as missing data to estimate trees and because we found a correlation between the quality of tree/alignment pairs and ML scores, we explored the degree to which SATé's performance depends on using ML with gaps treated as missing data to determine the best tree/alignment pair. We present two lines of evidence that using ML with gaps treated as missing data to optimize the alignment and tree produces very poor results. First, we show that the optimization problem where a set of unaligned DNA sequences is given and the output is the tree and alignment of those sequences that maximize likelihood under the Jukes-Cantor model is uninformative in the worst possible sense. For all inputs, all trees optimize the likelihood score. Second, we show that a greedy heuristic that uses GTR+Gamma ML to optimize the alignment and the tree can produce very poor alignments and trees. Therefore, the excellent performance of SATé-II and SATé-I is not because ML is used as an optimization criterion for choosing the best tree/alignment pair but rather due to the particular divide-and-conquer realignment techniques employed. 相似文献

19.

Detecting interspecific recombination with a pruned probabilistic divergence measure

Husmeier D Wright F Milne I 《Bioinformatics (Oxford, England)》2005,21(9):1797-1806

MOTIVATION: A promising sliding-window method for the detection of interspecific recombination in DNA sequence alignments is based on the monitoring of changes in the posterior distribution of tree topologies with a probabilistic divergence measure. However, as the number of taxa in the alignment increases or the sliding-window size decreases, the posterior distribution becomes increasingly diffuse. This diffusion blurs the probabilistic divergence signal and adversely affects the detection accuracy. The present study investigates how this shortcoming can be redeemed with a pruning method based on post-processing clustering, using the Robinson-Foulds distance as a metric in tree topology space. RESULTS: An application of the proposed scheme to three synthetic and two real-world DNA sequence alignments illustrates the amount of improvement that can be obtained with the pruning method. The study also includes a comparison with two established recombination detection methods: Recpars and the DSS (difference of sum of squares) method. AVAILABILITY: Software, data and further supplementary material are available at the following website: http://www.bioss.sari.ac.uk/~dirk/Supplements/ 相似文献

20.

Exact and heuristic algorithms for the Indel Maximum Likelihood Problem. 总被引：1，自引：0，他引：1

Abdoulaye Banire Diallo Vladimir Makarenkov Mathieu Blanchette 《Journal of computational biology》2007,14(4):446-461

Given a multiple alignment of orthologous DNA sequences and a phylogenetic tree for these sequences, we investigate the problem of reconstructing the most likely scenario of insertions and deletions capable of explaining the gaps observed in the alignment. This problem, that we called the Indel Maximum Likelihood Problem (IMLP), is an important step toward the reconstruction of ancestral genomics sequences, and is important for studying evolutionary processes, genome function, adaptation and convergence. We solve the IMLP using a new type of tree hidden Markov model whose states correspond to single-base evolutionary scenarios and where transitions model dependencies between neighboring columns. The standard Viterbi and Forward-backward algorithms are optimized to produce the most likely ancestral reconstruction and to compute the level of confidence associated to specific regions of the reconstruction. A heuristic is presented to make the method practical for large data sets, while retaining an extremely high degree of accuracy. The methods are illustrated on a 1-Mb alignment of the CFTR regions from 12 mammals. 相似文献