首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The necessary and sufficient conditions for the existence of linear invariants under semigroups of probability transition matrices are derived. It is found that a biologically meaningful nucleotide substitution model has linear invariants if and only if it is a submodel of one of the three most general models, which include the so-called balanced and unbalanced transversion models. Each of these three general models is a nucleotide substitution model with six parameters.  相似文献   

2.
Statistical models of evolution are algebraic varieties in the space of joint probability distributions on the leaf colorations of a phylogenetic tree. The phylogenetic invariants of a model are the polynomials which vanish on the variety. Several widely used models for biological sequences have transition matrices that can be diagonalized by means of the Fourier transform of an abelian group. Their phylogenetic invariants form a toric ideal in the Fourier coordinates. We determine generators and Gr?bner bases for these toric ideals. For the Jukes-Cantor and Kimura models on a binary tree, our Gr?bner bases consist of certain explicitly constructed polynomials of degree at most four.  相似文献   

3.
Statistical models of evolution are algebraic varieties in the space of joint probability distributions on the leaf colorations of a phylogenetic tree. The phylogenetic invariants of a model are the polynomials which vanish on the variety. Several widely used models for biological sequences have transition matrices that can be diagonalized by means of the Fourier transform of an Abelian group. Their phylogenetic invariants form a toric ideal in the Fourier coordinates. We determine generators and Gr?bner bases for these toric ideals. For the Jukes-Cantor and Kimura models on a binary tree, our Gr?bner bases consist of certain explicitly constructed polynomials of degree at most four.  相似文献   

4.
In phylogenetic inference, an evolutionary model describes the substitution processes along each edge of a phylogenetic tree. Misspecification of the model has important implications for the analysis of phylogenetic data. Conventionally, however, the selection of a suitable evolutionary model is based on heuristics or relies on the choice of an approximate input tree. We introduce a method for model Selection in Phylogenetics based on linear INvariants (SPIn), which uses recent insights on linear invariants to characterize a model of nucleotide evolution for phylogenetic mixtures on any number of components. Linear invariants are constraints among the joint probabilities of the bases in the operational taxonomic units that hold irrespective of the tree topologies appearing in the mixtures. SPIn therefore requires no input tree and is designed to deal with nonhomogeneous phylogenetic data consisting of multiple sequence alignments showing different patterns of evolution, for example, concatenated genes, exons, and/or introns. Here, we report on the results of the proposed method evaluated on multiple sequence alignments simulated under a variety of single-tree and mixture settings for both continuous- and discrete-time models. In the simulations, SPIn successfully recovers the underlying evolutionary model and is shown to perform better than existing approaches.  相似文献   

5.
The method of invariants is an approach to the problem of reconstructing the phylogenetic tree of a collection of m taxa using nucleotide sequence data. Models for the respective probabilities of the 4m possible vectors of bases at a given site will have unknown parameters that describe the random mechanism by which substitution occurs along the branches of a putative phylogenetic tree. An invariant is a polynomial in these probabilities that, for a given phylogeny, is zero for all choices of the substitution mechanism parameters. If the invariant is typically non-zero for another phylogenetic tree, then estimates of the invariant can be used as evidence to support one phylogeny over another. Previous work of Evans and Speed showed that, for certain commonly used substitution models, the problem of finding a minimal generating set for the ideal of invariants can be reduced to the linear algebra problem of finding a basis for a certain lattice (that is, a free Z-module). They also conjectured that the cardinality of such a generating set can be computed using a simple "degrees of freedom" formula. We verify this conjecture. Along the way, we explain in detail how the observations of Evans and Speed lead to a simple, computationally feasible algorithm for constructing a minimal generating set.  相似文献   

6.
Counting phylogenetic invariants in some simple cases.   总被引:1,自引:0,他引:1  
An informal degrees of freedom argument is used to count the number of phylogenetic invariants in cases where we have three or four species and can assume a Jukes-Cantor model of base substitution with or without a molecular clock. A number of simple cases are treated and in each the number of invariants can be found. Two new classes of invariants are found: non-phylogenetic cubic invariants testing independence of evolutionary events in different lineages, and linear phylogenetic invariants which occur when there is a molecular clock. Most of the linear invariants found by Cavender (1989, Molec. Biol. Evol. 6, 301-316) turn out in the Jukes-Cantor case to be simple tests of symmetry of the substitution model, and not phylogenetic invariants.  相似文献   

7.
8.
9.
Sample size for a phylogenetic inference.   总被引:1,自引:0,他引:1  
The objective of this work is to describe sample-size calculations for the inference of a nonzero central branch length in an unrooted four-species phylogeny. Attention is restricted to independent binary characters, such as might be obtained from an alignment of the purine-pyrimidine sequences of a nucleic acid molecule. A statistical test based on a multinomial model for character-state configurations is described. The importance of including invariable sites in models for sequence change is demonstrated, and their effect on sample size is quantified. The methods are applied to a four-species alignment of small-subunit rRNA sequences derived from two archaebacteria, a eubacteria and a eukaryote. We conclude that the information in these sequences is not sufficient to resolve the branching order of this tree. Estimates of the number of aligned nucleotide positions required to provide a reasonably powerful test are given.  相似文献   

10.
A central task in the study of molecular evolution is the reconstruction of a phylogenetic tree from sequences of current-day taxa. The most established approach to tree reconstruction is maximum likelihood (ML) analysis. Unfortunately, searching for the maximum likelihood phylogenetic tree is computationally prohibitive for large data sets. In this paper, we describe a new algorithm that uses Structural Expectation Maximization (EM) for learning maximum likelihood phylogenetic trees. This algorithm is similar to the standard EM method for edge-length estimation, except that during iterations of the Structural EM algorithm the topology is improved as well as the edge length. Our algorithm performs iterations of two steps. In the E-step, we use the current tree topology and edge lengths to compute expected sufficient statistics, which summarize the data. In the M-Step, we search for a topology that maximizes the likelihood with respect to these expected sufficient statistics. We show that searching for better topologies inside the M-step can be done efficiently, as opposed to standard methods for topology search. We prove that each iteration of this procedure increases the likelihood of the topology, and thus the procedure must converge. This convergence point, however, can be a suboptimal one. To escape from such "local optima," we further enhance our basic EM procedure by incorporating moves in the flavor of simulated annealing. We evaluate these new algorithms on both synthetic and real sequence data and show that for protein sequences even our basic algorithm finds more plausible trees than existing methods for searching maximum likelihood phylogenies. Furthermore, our algorithms are dramatically faster than such methods, enabling, for the first time, phylogenetic analysis of large protein data sets in the maximum likelihood framework.  相似文献   

11.
Mechanized derivation of linear invariants   总被引:1,自引:0,他引:1  
Linear invariants, discovered by Lake, promise to provide a versatile way of inferring phylogenies on the basis of nucleic acid sequences (the method that he called "evolutionary parsimony"). A semigroup of Markov transition matrices embodies the assumptions underlying the method, and alternative semigroups exist. The set of all linear invariants may be derived from the semigroup by using an algorithm described here. Under assumptions no stronger than Lake's, there are greater than 50 independent linear invariants for each of the 15 rooted trees linking four species.  相似文献   

12.
Polytomies and Bayesian phylogenetic inference   总被引:16,自引:0,他引:16  
Bayesian phylogenetic analyses are now very popular in systematics and molecular evolution because they allow the use of much more realistic models than currently possible with maximum likelihood methods. There are, however, a growing number of examples in which large Bayesian posterior clade probabilities are associated with very short branch lengths and low values for non-Bayesian measures of support such as nonparametric bootstrapping. For the four-taxon case when the true tree is the star phylogeny, Bayesian analyses become increasingly unpredictable in their preference for one of the three possible resolved tree topologies as data set size increases. This leads to the prediction that hard (or near-hard) polytomies in nature will cause unpredictable behavior in Bayesian analyses, with arbitrary resolutions of the polytomy receiving very high posterior probabilities in some cases. We present a simple solution to this problem involving a reversible-jump Markov chain Monte Carlo (MCMC) algorithm that allows exploration of all of tree space, including unresolved tree topologies with one or more polytomies. The reversible-jump MCMC approach allows prior distributions to place some weight on less-resolved tree topologies, which eliminates misleadingly high posteriors associated with arbitrary resolutions of hard polytomies. Fortunately, assigning some prior probability to polytomous tree topologies does not appear to come with a significant cost in terms of the ability to assess the level of support for edges that do exist in the true tree. Methods are discussed for applying arbitrary prior distributions to tree topologies of varying resolution, and an empirical example showing evidence of polytomies is analyzed and discussed.  相似文献   

13.
Ribosomal DNA: molecular evolution and phylogenetic inference.   总被引:79,自引:0,他引:79  
Ribosomal DNA (rDNA) sequences have been aligned and compared in a number of living organisms, and this approach has provided a wealth of information about phylogenetic relationships. Studies of rDNA sequences have been used to infer phylogenetic history across a very broad spectrum, from studies among the basal lineages of life to relationships among closely related species and populations. The reasons for the systematic versatility of rDNA include the numerous rates of evolution among different regions of rDNA (both among and within genes), the presence of many copies of most rDNA sequences per genome, and the pattern of concerted evolution that occurs among repeated copies. These features facilitate the analysis of rDNA by direct RNA sequencing, DNA sequencing (either by cloning or amplification), and restriction enzyme methodologies. Constraints imposed by secondary structure of rRNA and concerted evolution need to be considered in phylogenetic analyses, but these constraints do not appear to impede seriously the usefulness of rDNA. An analysis of aligned sequences of the four nuclear and two mitochondrial rRNA genes identified regions of these genes that are likely to be useful to address phylogenetic problems over a wide range of levels of divergence. In general, the small subunit nuclear sequences appear to be best for elucidating Precambrian divergences, the large subunit nuclear sequences for Paleozoic and Mesozoic divergences, and the organellar sequences of both subunits for Cenozoic divergences. Primer sequences were designed for use in amplifying the entire nuclear rDNA array in 15 sections by use of the polymerase chain reaction; these "universal" primers complement previously described primers for the mitochondrial rRNA genes. Pairs of primers can be selected in conjunction with the analysis of divergence of the rRNA genes to address systematic problems throughout the hierarchy of life.  相似文献   

14.

Background

Microbial typing methods are commonly used to study the relatedness of bacterial strains. Sequence-based typing methods are a gold standard for epidemiological surveillance due to the inherent portability of sequence and allelic profile data, fast analysis times and their capacity to create common nomenclatures for strains or clones. This led to development of several novel methods and several databases being made available for many microbial species. With the mainstream use of High Throughput Sequencing, the amount of data being accumulated in these databases is huge, storing thousands of different profiles. On the other hand, computing genetic evolutionary distances among a set of typing profiles or taxa dominates the running time of many phylogenetic inference methods. It is important also to note that most of genetic evolution distance definitions rely, even if indirectly, on computing the pairwise Hamming distance among sequences or profiles.

Results

We propose here an average-case linear-time algorithm to compute pairwise Hamming distances among a set of taxa under a given Hamming distance threshold. This article includes both a theoretical analysis and extensive experimental results concerning the proposed algorithm. We further show how this algorithm can be successfully integrated into a well known phylogenetic inference method, and how it can be used to speedup querying local phylogenetic patterns over large typing databases.
  相似文献   

15.
Phylogenetic reconstructions are a major component of many studies in evolutionary biology, but their accuracy can be reduced under certain conditions. Recent studies showed that the convergent evolution of some phenotypes resulted from recurrent amino acid substitutions in genes belonging to distant lineages. It has been suggested that these convergent substitutions could bias phylogenetic reconstruction toward grouping convergent phenotypes together, but such an effect has never been appropriately tested. We used computer simulations to determine the effect of convergent substitutions on the accuracy of phylogenetic inference. We show that, in some realistic conditions, even a relatively small proportion of convergent codons can strongly bias phylogenetic reconstruction, especially when amino acid sequences are used as characters. The strength of this bias does not depend on the reconstruction method but varies as a function of how much divergence had occurred among the lineages prior to any episodes of convergent substitutions. While the occurrence of this bias is difficult to predict, the risk of spurious groupings is strongly decreased by considering only 3rd codon positions, which are less subject to selection, as long as saturation problems are not present. Therefore, we recommend that, whenever possible, topologies obtained with amino acid sequences and 3rd codon positions be compared to identify potential phylogenetic biases and avoid evolutionarily misleading conclusions.  相似文献   

16.
MRBAYES: Bayesian inference of phylogenetic trees   总被引:108,自引:0,他引:108  
SUMMARY: The program MRBAYES performs Bayesian inference of phylogeny using a variant of Markov chain Monte Carlo. AVAILABILITY: MRBAYES, including the source code, documentation, sample data files, and an executable, is available at http://brahms.biology.rochester.edu/software.html.  相似文献   

17.
Traditionally, phylogenetic analyses over many genes combine data into a contiguous block. Under this concatenated model, all genes are assumed to evolve at the same rate. However, it is clear that genes evolve at very different rates and that accounting for this rate heterogeneity is important if we are to accurately infer phylogenies from heterogeneous multigene data sets. There remain open questions regarding how best to incorporate gene rate parameters into phylogenetic models and which properties of real data correlate with improved fit over the concatenated model. In this study, two methods of accounting for gene rate heterogeneity are compared: the n-parameter method, which allows for each of the n gene partitions to have a gene rate parameter, and the alpha-parameter method, which fits a distribution to the gene rates. Results demonstrate that the n-parameter method is both computationally faster and in general provides a better fit over the concatenated model than the alpha-parameter method. Furthermore, improved model fit over the concatenated model is highly correlated with the presence of a gene with a slow relative rate of evolution.  相似文献   

18.
Quantification of the success of phylogenetic inference in simulations   总被引:1,自引:0,他引:1  
For phylogenetic simulation studies, the accuracy of topological reconstruction obtained from different data matrices or different methods of phylogenetic inference generally needs to be quantified. Two components of performance within this context are: (1) how the inferred tree topology matches or conflicts with the correct tree topology, and (2) the branch support assigned to both correctly and incorrectly resolved clades. We present a method (averaged overall success of resolution) that incorporates both of these components. Branch support is incorporated in the averaged overall success of resolution by linearly scaling the observed support relative to that conferred by uncontradicted synapomorphies. We believe that this method represents an improvement relative to the commonly used approaches of quantifying the percentage of clades that are correctly resolved in the inferred trees or presenting the Robinson–Foulds distance between the inferred trees and the correct tree. In contrast to Bremer support, the averaged overall success of resolution may be applied equally well to distance, likelihood and parsimony analyses. © The Willi Hennig Society 2006.  相似文献   

19.
20.
Character-state space versus rate of evolution in phylogenetic inference   总被引:1,自引:0,他引:1  
With only four alternative character states, parallelisms and reversals are expected to occur frequently when using nucleotide characters for phylogenetic inference. Greater available character‐state space has been described as one of the advantages of third codon positions relative to first and second codon positions, as well as amino acids relative to nucleotides. We used simulations to quantify how character‐state space and rate of evolution relate to one another, and how this relationship is affected by differences in: tree topology, branch lengths, rate heterogeneity among sites, probability of change among states, and frequency of character states. Specifically, we examined how inferred tree lengths, consistency and retention indices, and accuracy of phylogenetic inference are affected. Our results indicate that the relatively small increases in the character‐state space evident in empirical data matrices can provide enormous benefits for the accuracy of phylogenetic inference. This advantage may become more pronounced with unequal probabilities of change among states. Although increased character‐state space greatly improved the accuracy of topology inference, improvements in the estimation of the correct tree length were less apparent. Accuracy and inferred tree length improved most when character‐state space increased initially; further increases provided more modest improvements. © The Willi Hennig Society 2004.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号