共查询到20条相似文献,搜索用时 0 毫秒
1.
Tom A. Williams Sarah E. Heaps Svetlana Cherlin Tom M. W. Nye Richard J. Boys T. Martin Embley 《Philosophical transactions of the Royal Society of London. Series B, Biological sciences》2015,370(1678)
The root of a phylogenetic tree is fundamental to its biological interpretation, but standard substitution models do not provide any information on its position. Here, we describe two recently developed models that relax the usual assumptions of stationarity and reversibility, thereby facilitating root inference without the need for an outgroup. We compare the performance of these models on a classic test case for phylogenetic methods, before considering two highly topical questions in evolutionary biology: the deep structure of the tree of life and the root of the archaeal radiation. We show that all three alignments contain meaningful rooting information that can be harnessed by these new models, thus complementing and extending previous work based on outgroup rooting. In particular, our analyses exclude the root of the tree of life from the eukaryotes or Archaea, placing it on the bacterial stem or within the Bacteria. They also exclude the root of the archaeal radiation from several major clades, consistent with analyses using other rooting methods. Overall, our results demonstrate the utility of non-reversible and non-stationary models for rooting phylogenetic trees, and identify areas where further progress can be made. 相似文献
2.
A maximum likelihood framework for estimating site-specific substitution rates is presented that does not require any prior assumptions about the rate distribution. We show that, when the branching pattern of the underlying tree is known, the analysis of pairs of positions is sufficient to estimate site-specific rates. In the abscense of a known topology, we introduce an iterative procedure to estimate simultaneously the branching pattern, the branch lengths, and site-specific substitution rates. Simulations show that the evolutionary rate of fast-evolving sites can be reliably inferred and that the accuracy of rate estimates depends mainly on the number of sequences in the data set. Thus, large sets of aligned sequences are necessary for reliable site-specific rate estimates. The method is applied to the complete mitochondrial DNA sequence of 53 humans, providing a complete picture of the site-specific substitution rates in human mitochondrial DNA. 相似文献
3.
Choosing appropriate substitution models for the phylogenetic analysis of protein-coding sequences 总被引:13,自引:0,他引:13
Although phylogenetic inference of protein-coding sequences continues to dominate the literature, few analyses incorporate evolutionary models that consider the genetic code. This problem is exacerbated by the exclusion of codon-based models from commonly employed model selection techniques, presumably due to the computational cost associated with codon models. We investigated an efficient alternative to standard nucleotide substitution models, in which codon position (CP) is incorporated into the model. We determined the most appropriate model for alignments of 177 RNA virus genes and 106 yeast genes, using 11 substitution models including one codon model and four CP models. The majority of analyzed gene alignments are best described by CP substitution models, rather than by standard nucleotide models, and without the computational cost of full codon models. These results have significant implications for phylogenetic inference of coding sequences as they make it clear that substitution models incorporating CPs not only are a computationally realistic alternative to standard models but may also frequently be statistically superior. 相似文献
4.
Background
We compared two methods of rooting a phylogenetic tree: the stationary and the nonstationary substitution processes. These methods do not require an outgroup. 相似文献5.
Piontkivska H 《Molecular phylogenetics and evolution》2004,31(3):865-873
Choice of a substitution model is a crucial step in the maximum likelihood (ML) method of phylogenetic inference, and investigators tend to prefer complex mathematical models to simple ones. However, when complex models with many parameters are used, the extent of noise in statistical inferences increases, and thus complex models may not produce the true topology with a higher probability than simple ones. This problem was studied using computer simulation. When the number of nucleotides used was relatively large (1000 bp), the HKY+Gamma model showed smaller d(T) topological distance between the inferred and the true trees) than the JC and Kimura models. In the cases of shorter sequences (300 bp) simpler model and search algorithm such as JC model and SA+NNI search were found to be as efficient as more complicated searches and models in terms of topological distances, although the topologies obtained under HKY+Gamma model had the highest likelihood values. The performance of relatively simple search algorithm SA+NNI was found to be essentially the same as that of more extensive SA+TBR search under all models studied. Similarly to the conclusions reached by Takahashi and Nei [Mol. Biol. Evol. 17 (2000) 1251], our results indicate that simple models can be as efficient as complex models, and that use of complex models does not necessarily give more reliable trees compared with simple models. 相似文献
6.
We derive an expectation maximization algorithm for maximum-likelihood training of substitution rate matrices from multiple sequence alignments. The algorithm can be used to train hidden substitution models, where the structural context of a residue is treated as a hidden variable that can evolve over time. We used the algorithm to train hidden substitution matrices on protein alignments in the Pfam database. Measuring the accuracy of multiple alignment algorithms with reference to BAliBASE (a database of structural reference alignments) our substitution matrices consistently outperform the PAM series, with the improvement steadily increasing as up to four hidden site classes are added. We discuss several applications of this algorithm in bioinformatics. 相似文献
7.
We prove that a wide class of Markov models of neighbor-dependent substitution processes on the integer line is solvable. This class contains some models of nucleotidic substitutions recently introduced and studied empirically by molecular biologists. We show that the polynucleotidic frequencies at equilibrium solve some finite-size linear systems. This provides, for the first time up to our knowledge, explicit and algebraic formulas for the stationary frequencies of non-degenerate neighbor-dependent models of DNA substitutions. Furthermore, we show that the dynamics of these stochastic processes and their distribution at equilibrium exhibit some stringent, rather unexpected, independence properties. For example, nucleotidic sites at distance at least three evolve independently, and all the sites, when encoded as purines and pyrimidines, evolve independently. 相似文献
8.
Carreras M Marco C Gianti E Eleonora G Sartori L Luca S Plyte SE Edward PS Isacchi A Antonella I Bosotti R Roberta B 《基因组蛋白质组与生物信息学报(英文版)》2005,3(1):58-60
PoInTree (Polar and Interactive Tree) is an application that allows to build, visualize, and customize phylogenetic trees in a polar, interactive, and highly flexible view. It takes as input a FASTA file or multiple alignment formats. Phylogenetic tree calculation is based on a sequence distance method and utilizes the Neighbor Joining (N J) algorithm. It also allows displaying precalculated trees of the major protein families based on Pfam classification. In PoInTree, nodes can be dynamically opened and closed and distances between genes are graphically represented. Tree root can be centered on a selected leaf. Text search mechanism, color-coding and labeling display are integrated. The visualizer can be connected to an Oracle database containing information on sequences and other biological data, helping to guide their interpretation within a given protein family across multiple species. The application is written in Borland Delphi and based on VCL Teechart Pro 6 graphical component (Steema software). 相似文献
9.
Richard Gouy Denis Baurain Hervé Philippe 《Philosophical transactions of the Royal Society of London. Series B, Biological sciences》2015,370(1678)
This article aims to shed light on difficulties in rooting the tree of life (ToL) and to explore the (sociological) reasons underlying the limited interest in accurately addressing this fundamental issue. First, we briefly review the difficulties plaguing phylogenetic inference and the ways to improve the modelling of the substitution process, which is highly heterogeneous, both across sites and over time. We further observe that enriched taxon samplings, better gene samplings and clever data removal strategies have led to numerous revisions of the ToL, and that these improved shallow phylogenies nearly always relocate simple organisms higher in the ToL provided that long-branch attraction artefacts are kept at bay. Then, we note that, despite the flood of genomic data available since 2000, there has been a surprisingly low interest in inferring the root of the ToL. Furthermore, the rare studies dealing with this question were almost always based on methods dating from the 1990s that have been shown to be inaccurate for much more shallow issues! This leads us to argue that the current consensus about a bacterial root for the ToL can be traced back to the prejudice of Aristotle''s Great Chain of Beings, in which simple organisms are ancestors of more complex life forms. Finally, we demonstrate that even the best models cannot yet handle the complexity of the evolutionary process encountered both at shallow depth, when the outgroup is too distant, and at the level of the inter-domain relationships. Altogether, we conclude that the commonly accepted bacterial root is still unproven and that the root of the ToL should be revisited using phylogenomic supermatrices to ensure that new evidence for eukaryogenesis, such as the recently described Lokiarcheota, is interpreted in a sound phylogenetic framework. 相似文献
10.
Empirical models of substitution are often used in protein sequence analysis because the large alphabet of amino acids requires that many parameters be estimated in all but the simplest parametric models. When information about structure is used in the analysis of substitutions in structured RNA, a similar situation occurs. The number of parameters necessary to adequately describe the substitution process increases in order to model the substitution of paired bases. We have developed a method to obtain substitution rate matrices empirically from RNA alignments that include structural information in the form of base pairs. Our data consisted of alignments from the European Ribosomal RNA Database of Bacterial and Eukaryotic Small Subunit and Large Subunit Ribosomal RNA ( Wuyts et al. 2001. Nucleic Acids Res. 29:175-177; Wuyts et al. 2002. Nucleic Acids Res. 30:183-185). Using secondary structural information, we converted each sequence in the alignments into a sequence over a 20-symbol code: one symbol for each of the four individual bases, and one symbol for each of the 16 ordered pairs. Substitutions in the coded sequences are defined in the natural way, as observed changes between two sequences at any particular site. For given ranges (windows) of sequence divergence, we obtained substitution frequency matrices for the coded sequences. Using a technique originally developed for modeling amino acid substitutions ( Veerassamy, Smith, and Tillier. 2003. J. Comput. Biol. 10:997-1010), we were able to estimate the actual evolutionary distance for each window. The actual evolutionary distances were used to derive instantaneous rate matrices, and from these we selected a universal rate matrix. The universal rate matrices were incorporated into the Phylip Software package ( Felsenstein 2002. http://evolution.genetics.washington.edu/phylip.html), and we analyzed the ribosomal RNA alignments using both distance and maximum likelihood methods. The empirical substitution models performed well on simulated data, and produced reasonable evolutionary trees for 16S ribosomal RNA sequences from sequenced Bacterial genomes. Empirical models have the advantage of being easily implemented, and the fact that the code consists of 20 symbols makes the models easily incorporated into existing programs for protein sequence analysis. In addition, the models are useful for simulating the evolution of RNA sequence and structure simultaneously. 相似文献
11.
Simon Y.W. Ho 《Biology letters》2009,5(3):421-424
Molecular evolutionary rates can show significant variation among lineages, complicating the task of estimating substitution rates and divergence times using phylogenetic methods. Accordingly, relaxed molecular clock models have been developed to accommodate such rate heterogeneity, but these often make the assumption of rate autocorrelation among lineages. In this paper, I examine the validity of this assumption. 相似文献
12.
Summary A method of estimating the number of nucleotide substitutions from amino acid sequence data is developed by using Dayhoff's mutation probability matrix. This method takes into account the effect of nonrandom amino acid substitutions and gives an estimate which is similar to the value obtained by Fitch's counting method, but larger than the estimate obtained under the assumption of random substitutions (Jukes and Cantor's formula). Computer simulations based on Dayhoff's mutation probability matrix have suggested that Jukes and Holmquist's method of estimating the number of nucleotide substitutions gives an overestimate when amino acid substitution is not random and the variance of the estimate is generally very large. It is also shown that when the number of nucleotide substitutions is small, this method tends to give an overestimate even when amino acid substitution is purely at random. 相似文献
13.
IQPNNI: moving fast through tree space and stopping in time 总被引:12,自引:0,他引:12
An efficient tree reconstruction method (IQPNNI) is introduced to reconstruct a phylogenetic tree based on DNA or amino acid sequence data. Our approach combines various fast algorithms to generate a list of potential candidate trees. The key ingredient is the definition of so-called important quartets (IQs), which allow the computation of an intermediate tree in O(n(2)) time for n sequences. The resulting tree is then further optimized by applying the nearest neighbor interchange (NNI) operation. Subsequently a random fraction of the sequences is deleted from the best tree found so far. The deleted sequences are then re-inserted in the smaller tree using the important quartet puzzling (IQP) algorithm. These steps are repeated several times and the best tree, with respect to the likelihood criterion, is considered as the inferred phylogenetic tree. Moreover, we suggest a rule which indicates when to stop the search. Simulations show that IQPNNI gives a slightly better accuracy than other programs tested. Moreover, we applied the approach to 218 small subunit rRNA sequences and 500 rbcL sequences. We found trees with higher likelihood compared to the results by others. A program to reconstruct DNA or amino acid based phylogenetic trees is available online (http://www.bi.uni-duesseldorf.de/software/iqpnni). 相似文献
14.
Analysis of the plant architecture via tree-structured statistical models: the hidden Markov tree models 总被引:1,自引:0,他引:1
Plant architecture is the result of repetitions that occur through growth and branching processes. During plant ontogeny, changes in the morphological characteristics of plant entities are interpreted as the indirect translation of different physiological states of the meristems. Thus connected entities can exhibit either similar or very contrasted characteristics. We propose a statistical model to reveal and characterize homogeneous zones and transitions between zones within tree-structured data: the hidden Markov tree (HMT) model. This model leads to a clustering of the entities into classes sharing the same 'hidden state'. The application of the HMT model to two plant sets (apple trees and bush willows), measured at annual shoot scale, highlights ordered states defined by different morphological characteristics. The model provides a synthetic overview of state locations, pointing out homogeneous zones or ruptures. It also illustrates where within branching structures, and when during plant ontogeny, morphological changes occur. However, the labelling exhibits some patterns that cannot be described by the model parameters. Some of these limitations are addressed by two alternative HMT families. 相似文献
15.
Simon D. W. Frost Erik M. Volz 《Philosophical transactions of the Royal Society of London. Series B, Biological sciences》2013,368(1614)
Epidemiological models have highlighted the importance of population structure in the transmission dynamics of infectious diseases. Using HIV-1 as an example of a model evolutionary system, we consider how population structure affects the shape and the structure of a viral phylogeny in the absence of strong selection at the population level. For structured populations, the number of lineages as a function of time is insufficient to describe the shape of the phylogeny. We develop deterministic approximations for the dynamics of tips of the phylogeny over evolutionary time, the number of ‘cherries’, tips that share a direct common ancestor, and Sackin''s index, a commonly used measure of phylogenetic imbalance or asymmetry. We employ cherries both as a measure of asymmetry of the tree as well as a measure of the association between sequences from different groups. We consider heterogeneity in infectiousness associated with different stages of HIV infection, and in contact rates between groups of individuals. In the absence of selection, we find that population structure may have relatively little impact on the overall asymmetry of a tree, especially when only a small fraction of infected individuals is sampled, but may have marked effects on how sequences from different subpopulations cluster and co-cluster. 相似文献
16.
Over the years, there have been claims that evolution proceeds according to systematically different processes over different timescales and that protein evolution behaves in a non-Markovian manner. On the other hand, Markov models are fundamental to many applications in evolutionary studies. Apparent non-Markovian or time-dependent behavior has been attributed to influence of the genetic code at short timescales and dominance of physicochemical properties of the amino acids at long timescales. However, any long time period is simply the accumulation of many short time periods, and it remains unclear why evolution should appear to act systematically differently across the range of timescales studied. We show that the observed time-dependent behavior can be explained qualitatively by modeling protein sequence evolution as an aggregated Markov process (AMP): a time-homogeneous Markovian substitution model observed only at the level of the amino acids encoded by the protein-coding DNA sequence. The study of AMPs sheds new light on the relationship between amino acid-level and codon-level models of sequence evolution, and our results suggest that protein evolution should be modeled at the codon level rather than using amino acid substitution models. 相似文献
17.
Distance based reconstruction methods of phylogenetic trees consist of two independent parts: first, inter-species distances are inferred assuming some stochastic model of sequence evolution; then the inferred distances are used to construct a tree. In this paper we concentrate on the task of inter-species distance estimation. Specifically, we characterize the family of valid distance functions for the assumed substitution model and show that deliberate selection of distance function significantly improves the accuracy of distance estimates and, consequently, also improves the accuracy of the reconstructed tree.Our contribution consists of three parts: first, we present a general framework for constructing families of additive distance functions for stochastic evolutionary models. Then, we present a method for selecting (near) optimal distance functions, and we conclude by presenting simulation results which support our theoretical analysis. 相似文献
18.
Examining rates and patterns of nucleotide substitution in plants 总被引:19,自引:0,他引:19
Muse SV 《Plant molecular biology》2000,42(1):25-43
Driven by rapid improvements in affordable computing power and by the even faster accumulation of genomic data, the statistical analysis of molecular sequence data has become an active area of interdisciplinary research. Maximum likelihood methods have become mainstream because of their desirable properties and, more importantly, their potential for providing statistically sound solutions in complex data analysis settings. In this chapter, a review of recent literature focusing on rates and patterns of nucleotide substitution rates in the nuclear, chloroplast, and mitochondrial genomes of plants demonstrates the power and flexibility of these new methods. The emerging picture of the nucleotide substitution process in plants is a complex one. Evolutionary rates are seen to be quite variable, both among genes and among plant lineages. However, there are hints, particularly in the chloroplast, that individual factors can have important effects on many genes simultaneously. 相似文献
19.
Nonhomogeneous substitution models have been introduced for phylogenetic inference when the substitution process is nonstationary, for example, when sequence composition differs between lineages. Existing models can have many parameters, and it is then difficult and computationally expensive to learn the parameters and to select the optimal model complexity. We extend an existing nonhomogeneous substitution model by introducing a reversible jump Markov chain Monte Carlo method for efficient Bayesian inference of the model order along with other phylogenetic parameters of interest. We also introduce a new hierarchical prior which leads to more reasonable results when only a small number of lineages share a particular substitution process. The method is implemented in the PHASE software, which includes specialized substitution models for RNA genes with conserved secondary structure. We apply an RNA-specific nonhomogeneous model to a structure-based alignment of rRNA sequences spanning the entire tree of life. A previous study of the same genes from a similar set of species found robust evidence for a mesophilic last universal common ancestor (LUCA) by inference of the G+C composition at the root of the tree. In the present study, we find that the helical GC composition at the root is strongly dependent on the root position. With a bacterial rooting, we find that there is no longer strong support for either a mesophile or a thermophile LUCA, although a hyperthermophile LUCA remains unlikely. We discuss reasons why results using only RNA helices may differ from results using all aligned sites when applying nonhomogeneous models to RNA genes. 相似文献
20.
Constructing amino acid residue substitution classes maximally indicative of local protein structure
Using an information theoretic formalism, we optimize classes of amino acid substitution to be maximally indicative of local protein structure. Our statistically-derived classes are loosely identifiable with the heuristic constructions found in previously published work. However, while these other methods provide a more rigid idealization of physicochemically constrained residue substitution, our classes provide substantially more structural information with many fewer parameters. Moreover, these substitution classes are consistent with the paradigmatic view of the sequence-to-structure relationship in globular proteins which holds that the three-dimensional architecture is predominantly determined by the arrangement of hydrophobic and polar side chains with weak constraints on the actual amino acid identities. More specific constraints are imposed on the placement of prolines, glycines, and the charged residues. These substitution classes have been used in highly accurate predictions of residue solvent accessibility. They could also be used in the identification of homologous proteins, the construction and refinement of multiple sequence alignments, and as a means of condensing and codifying the information in multiple sequence alignments for secondary structure prediction and tertiary fold recognition. © 1996 Wiley-Liss, Inc. 相似文献