首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
ABSTRACT: BACKGROUND: A number of software packages are available to generate DNA multiple sequence alignments (MSAs) evolved under continuous-time Markov processes on phylogenetic trees. On the other hand, methods of simulating the DNA MSA directly from the transition matrices do not exist. Moreover, existing software restricts to the time-reversible models and it is not optimized to generate nonhomogeneous data (i.e. placing distinct substitution rates at different lineages). RESULTS: We present the first package designed to generate MSAs evolving under discrete-time Markov processes on phylogenetic trees, directly from probability substitution matrices. Based on the input model and a phylogenetic tree in the Newick format (with branch lengths measured as the expected number of substitutions per site), the algorithm produces DNA alignments of desired length. GenNon-h is publicly available for download. CONCLUSION: The software presented here is an efficient tool to generate DNA MSAs on a given phylogenetic tree. GenNon-h provides the user with the nonstationary or nonhomogeneous phylogenetic data that is well suited for testing complex biological hypotheses, exploring the limits of the reconstruction algorithms and their robustness to such models.  相似文献   

2.
Most evolutionary tree estimation methods for DNA sequences ignore or inefficiently use the phylogenetic information contained within shared patterns of gaps. This is largely due to the computational difficulties in implementing models for insertions and deletions. A simple way to incorporate this information is to treat a gap as a fifth character (with the four nucleotides being the other four) and to incorporate it within a Markov model of nucleotide substitution. This idea has been dismissed in the past, since it treats a multiple-site insertion or deletion as a sequence of independent events rather than a single event. While this is true, we have found that under many circumstances it is better to incorporate gap information inadequately than to ignore it, at least for topology estimation. We propose an extension to a class of nucleotide substitution models to incorporate the gap character and show that, for data sets (both real and simulated) with short and medium gaps, these models do lead to effective use of the information contained within insertions and deletions. We also implement an ad hoc method in which the likelihood at columns containing multiple-site gaps is downweighted in order to avoid giving them undue influence. The precision of the estimated tree, assessed using Markov chain Monte Carlo techniques to find the posterior distribution over tree space, improves under these five-state models compared with standard methods which effectively ignore gaps.  相似文献   

3.
MOTIVATION: Most molecular phylogenies are based on sequence alignments. Consequently, they fail to account for modes of sequence evolution that involve frequent insertions or deletions. Here we present a method for generating accurate gene and species phylogenies from whole genome sequence that makes use of short character string matches not placed within explicit alignments. In this work, the singular value decomposition of a sparse tetrapeptide frequency matrix is used to represent the proteins of organisms uniquely and precisely as vectors in a high-dimensional space. Vectors of this kind can be used to calculate pairwise distance values based on the angle separating the vectors, and the resulting distance values can be used to generate phylogenetic trees. Protein trees so derived can be examined directly for homologous sequences. Alternatively, vectors defining each of the proteins within an organism can be summed to provide a vector representation of the organism, which is then used to generate species trees. RESULTS: Using a large mitochondrial genome dataset, we have produced species trees that are largely in agreement with previously published trees based on the analysis of identical datasets using different methods. These trees also agree well with currently accepted phylogenetic theory. In principle, our method could be used to compare much larger bacterial or nuclear genomes in full molecular detail, ultimately allowing accurate gene and species relationships to be derived from a comprehensive comparison of complete genomes. In contrast to phylogenetic methods based on alignments, sequences that evolve by relative insertion or deletion would tend to remain recognizably similar.  相似文献   

4.
We have characterized the relationship between accurate phylogenetic reconstruction and sequence similarity, testing whether high levels of sequence similarity can consistently produce accurate evolutionary trees. We generated protein families with known phylogenies using a modified version of the PAML/EVOLVER program that produces insertions and deletions as well as substitutions. Protein families were evolved over a range of 100-400 point accepted mutations; at these distances 63% of the families shared significant sequence similarity. Protein families were evolved using balanced and unbalanced trees, with ancient or recent radiations. In families sharing statistically significant similarity, about 60% of multiple sequence alignments were 95% identical to true alignments. To compare recovered topologies with true topologies, we used a score that reflects the fraction of clades that were correctly clustered. As expected, the accuracy of the phylogenies was greatest in the least divergent families. About 88% of phylogenies clustered over 80% of clades in families that shared significant sequence similarity, using Bayesian, parsimony, distance, and maximum likelihood methods. However, for protein families with short ancient branches (ancient radiation), only 30% of the most divergent (but statistically significant) families produced accurate phylogenies, and only about 70% of the second most highly conserved families, with median expectation values better than 10(-60), produced accurate trees. These values represent upper bounds on expected tree accuracy for sequences with a simple divergence history; proteins from 700 Giardia families, with a similar range of sequence similarities but considerably more gaps, produced much less accurate trees. For our simulated insertions and deletions, correct multiple sequence alignments did not perform much better than those produced by T-COFFEE, and including sequences with expressed sequence tag-like sequencing errors did not significantly decrease phylogenetic accuracy. In general, although less-divergent sequence families produce more accurate trees, the likelihood of estimating an accurate tree is most dependent on whether radiation in the family was ancient or recent. Accuracy can be improved by combining genes from the same organism when creating species trees or by selecting protein families with the best bootstrap values in comprehensive studies.  相似文献   

5.
We describe an exhaustive and greedy algorithm for improving the accuracy of multiple sequence alignment. A simple progressive alignment approach is employed to provide initial alignments. The initial alignment is then iteratively optimized against an objective function. For any working alignment, the optimization involves three operations: insertions, deletions and shuffles of gaps. The optimization is exhaustive since the algorithm applies the above operations to all eligible positions of an alignment. It is also greedy since only the operation that gives the best improving objective score will be accepted. The algorithms have been implemented in the EGMA (Exhaustive and Greedy Multiple Alignment) package using Java programming language, and have been evaluated using the BAliBASE benchmark alignment database. Although EGMA is not guaranteed to produce globally optimized alignment, the tests indicate that EGMA is able to build alignments with high quality consistently, compared with other commonly used iterative and non-iterative alignment programs. It is also useful for refining multiple alignments obtained by other methods.  相似文献   

6.
SUMMARY: BAli-Phy is a Bayesian posterior sampler that employs Markov chain Monte Carlo to explore the joint space of alignment and phylogeny given molecular sequence data. Simultaneous estimation eliminates bias toward inaccurate alignment guide-trees, employs more sophisticated substitution models during alignment and automatically utilizes information in shared insertion/deletions to help infer phylogenies. AVAILABILITY: Software is available for download at http://www.biomath.ucla.edu/msuchard/bali-phy.  相似文献   

7.

Background  

Phylogenies of rapidly evolving pathogens can be difficult to resolve because of the small number of substitutions that accumulate in the short times since divergence. To improve resolution of such phylogenies we propose using insertion and deletion (indel) information in addition to substitution information. We accomplish this through joint estimation of alignment and phylogeny in a Bayesian framework, drawing inference using Markov chain Monte Carlo. Joint estimation of alignment and phylogeny sidesteps biases that stem from conditioning on a single alignment by taking into account the ensemble of near-optimal alignments.  相似文献   

8.
Abstract Numerous software packages exist to provide support for quantifying peptides and proteins from mass spectrometry (MS) data. However, many support only a subset of experimental methods or instrument types, meaning that laboratories often have to use multiple software packages. The Progenesis LC-MS software package from Nonlinear Dynamics is a software solution for label-free quantitation. However, many laboratories using Progenesis also wish to employ stable isotope-based methods that are not natively supported in Progenesis. We have developed a Java programming interface that can use the output files produced by Progenesis, allowing the basic MS features quantified across replicates to be used in a range of different experimental methods. We have developed post-processing software (the Progenesis Post-Processor) to embed Progenesis in the analysis of stable isotope labeling data and top3 pseudo-absolute quantitation. We have also created export ability to the new data standard, mzQuantML, produced by the Proteomics Standards Initiative to facilitate the development and standardization process. The software is provided to users with a simple graphical user interface for accessing the different features. The underlying programming interface may also be used by Java developers to develop other routines for analyzing data produced by Progenesis.  相似文献   

9.
Motivation: The topic of this paper is the estimation of alignments and mutation rates based on stochastic sequence-evolution models that allow insertions and deletions of subsequences ('fragments') and not just single bases. The model we propose is a variant of a model introduced by Thorne et al., (J. Mol. Evol., 34, 3-16, 1992). The computational tractability of the model depends on certain restrictions in the insertion/deletion process; possible effects we discuss. Results: The process of fragment insertion and deletion in the sequence-evolution model induces a hidden Markov structure at the level of alignments and thus makes possible efficient statistical alignment algorithms. As an example we apply a sampling procedure to assess the variability in alignment and mutation parameter estimates for HVR1 sequences of human and orangutan, improving results of previous work. Simulation studies give evidence that estimation methods based on the proposed model also give satisfactory results when applied to data for which the restrictions in the insertion/deletion process do not hold. Availability: The source code of the software for sampling alignments and mutation rates for a pair of DNA sequences according to the fragment insertion and deletion model is freely available from http://www.math.uni-frankfurt.de/~stoch/software/mcmcsalut under the terms of the GNU public license (GPL, 2000).  相似文献   

10.
We describe a procedure for model averaging of relaxed molecular clock models in Bayesian phylogenetics. Our approach allows us to model the distribution of rates of substitution across branches, averaged over a set of models, rather than conditioned on a single model. We implement this procedure and test it on simulated data to show that our method can accurately recover the true underlying distribution of rates. We applied the method to a set of alignments taken from a data set of 12 mammalian species and uncovered evidence that lognormally distributed rates better describe this data set than do exponentially distributed rates. Additionally, our implementation of model averaging permits accurate calculation of the Bayes factor(s) between two or more relaxed molecular clock models. Finally, we introduce a new computational approach for sampling rates of substitution across branches that improves the convergence of our Markov chain Monte Carlo algorithms in this context. Our methods are implemented under the BEAST 1.6 software package, available at http://beast-mcmc.googlecode.com.  相似文献   

11.
Qian B  Goldstein RA 《Proteins》2001,45(1):102-104
Protein sequence alignment has become a widely used method in the study of newly sequenced proteins. Most sequence alignment methods use an affine gap penalty to assign scores to insertions and deletions. Although affine gap penalties represent the relative ease of extending a gap compared with initializing a gap, it is still an obvious oversimplification of the real processes that occur during sequence evolution. To improve the efficiency of sequence alignment methods and to obtain a better understanding of the process of sequence evolution, we wanted to find a more accurate model of insertions and deletions in homologous proteins. In this work, we extract the probability of a gap occurrence and the resulting gap length distribution in distantly related proteins (sequence identity < 25%) using alignments based on their common structures. We observe a distribution of gaps that can be fitted with a multiexponential with four distinct components. The results suggest new approaches to modeling insertions and deletions in sequence alignments.  相似文献   

12.
Bayesian phylogenetics with BEAUti and the BEAST 1.7   总被引:7,自引:0,他引:7  
Computational evolutionary biology, statistical phylogenetics and coalescent-based population genetics are becoming increasingly central to the analysis and understanding of molecular sequence data. We present the Bayesian Evolutionary Analysis by Sampling Trees (BEAST) software package version 1.7, which implements a family of Markov chain Monte Carlo (MCMC) algorithms for Bayesian phylogenetic inference, divergence time dating, coalescent analysis, phylogeography and related molecular evolutionary analyses. This package includes an enhanced graphical user interface program called Bayesian Evolutionary Analysis Utility (BEAUti) that enables access to advanced models for molecular sequence and phenotypic trait evolution that were previously available to developers only. The package also provides new tools for visualizing and summarizing multispecies coalescent and phylogeographic analyses. BEAUti and BEAST 1.7 are open source under the GNU lesser general public license and available at http://beast-mcmc.googlecode.com and http://beast.bio.ed.ac.uk.  相似文献   

13.
This article presents a statistical method for detecting recombination in DNA sequence alignments, which is based on combining two probabilistic graphical models: (1) a taxon graph (phylogenetic tree) representing the relationship between the taxa, and (2) a site graph (hidden Markov model) representing interactions between different sites in the DNA sequence alignments. We adopt a Bayesian approach and sample the parameters of the model from the posterior distribution with Markov chain Monte Carlo, using a Metropolis-Hastings and Gibbs-within-Gibbs scheme. The proposed method is tested on various synthetic and real-world DNA sequence alignments, and we compare its performance with the established detection methods RECPARS, PLATO, and TOPAL, as well as with two alternative parameter estimation schemes.  相似文献   

14.
This paper proposes a graphical method for detecting interspecies recombination in multiple alignments of DNA sequences. A fixed-size window is moved along a given DNA sequence alignment. For every position, the marginal posterior probability over tree topologies is determined by means of a Markov chain Monte Carlo simulation. Two probabilistic divergence measures are plotted along the alignment, and are used to identify recombinant regions. The method is compared with established detection methods on a set of synthetic benchmark sequences and two real-world DNA sequence alignments.  相似文献   

15.
Most methods for phylogenetic tree reconstruction are based on sequence alignments; they infer phylogenies from substitutions that may have occurred at the aligned sequence positions. Gaps in alignments are usually not employed as phylogenetic signal. In this paper, we explore an alignment-free approach that uses insertions and deletions (indels) as an additional source of information for phylogeny inference. For a set of four or more input sequences, we generate so-called quartet blocks of four putative homologous segments each. For pairs of such quartet blocks involving the same four sequences, we compare the distances between the two blocks in these sequences, to obtain hints about indels that may have happened between the blocks since the respective four sequences have evolved from their last common ancestor. A prototype implementation that we call Gap-SpaM is presented to infer phylogenetic trees from these data, using a quartet-tree approach or, alternatively, under the maximum-parsimony paradigm. This approach should not be regarded as an alternative to established methods, but rather as a complementary source of phylogenetic information. Interestingly, however, our software is able to produce phylogenetic trees from putative indels alone that are comparable to trees obtained with existing alignment-free methods.  相似文献   

16.
SUMMARY: ProfDist is a user-friendly software package using the profile-neighbor-joining method (PNJ) in inferring phylogenies based on profile distances on DNA or RNA sequences. It is a tool for reconstructing and visualizing large phylogenetic trees providing new and standard features with a special focus on time efficency, robustness and accuracy. AVAILABILITY: A Windows version of ProfDist comes with a graphical user interface and is freely available at http://profdist.bioapps.biozentrum.uni-wuerzburg.de  相似文献   

17.
18.
Gap costs for multiple sequence alignment   总被引:6,自引:0,他引:6  
Standard methods for aligning pairs of biological sequences charge for the most common mutations, which are substitutions, deletions and insertions. Because a single mutation may insert or delete several nucleotides, gap costs that are not directly proportional to gap length are usually the most effective. How to extend such gap costs to alignments of three or more sequences is not immediately obvious, and a variety of approaches have been taken. This paper argues that, since gap and substitution costs together specify optimal alignments, they should be defined using a common rationale. Specifically, a new definition of gap costs for multiple alignments is proposed and compared with previous ones. Since the new definition links a multiple alignment's cost to that of its pairwise projections, it allows knowledge gained about two-sequence alignments to bear on the multiple alignment problem. Also, such linkage is a key element of recent algorithms that have rendered practical the simultaneous alignment of as many as six sequences.  相似文献   

19.

Background  

In recent years there has been a trend of leaving the strict molecular clock in order to infer dating of speciations and other evolutionary events. Explicit modeling of substitution rates and divergence times makes formulation of informative prior distributions for branch lengths possible. Models with birth-death priors on tree branching and auto-correlated or iid substitution rates among lineages have been proposed, enabling simultaneous inference of substitution rates and divergence times. This problem has, however, mainly been analysed in the Markov chain Monte Carlo (MCMC) framework, an approach requiring computation times of hours or days when applied to large phylogenies.  相似文献   

20.
A fundamental task in sequence analysis is to calculate the probability of a multiple alignment given a phylogenetic tree relating the sequences and an evolutionary model describing how sequences change over time. However, the most widely used phylogenetic models only account for residue substitution events. We describe a probabilistic model of a multiple sequence alignment that accounts for insertion and deletion events in addition to substitutions, given a phylogenetic tree, using a rate matrix augmented by the gap character. Starting from a continuous Markov process, we construct a non-reversible generative (birth-death) evolutionary model for insertions and deletions. The model assumes that insertion and deletion events occur one residue at a time. We apply this model to phylogenetic tree inference by extending the program dnaml in phylip. Using standard benchmarking methods on simulated data and a new "concordance test" benchmark on real ribosomal RNA alignments, we show that the extended program dnamlepsilon improves accuracy relative to the usual approach of ignoring gaps, while retaining the computational efficiency of the Felsenstein peeling algorithm.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号