首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 31 毫秒
It is known that if all the Markov transition matrices that govern the substitution of one nucleotide for another satisfy six linear constraints, then equations can be derived that permit one to infer evolutionary trees from nucleic acid sequences by the method of linear invariants. These sufficient conditions are also necessary. Any relaxation of them results in the loss of all linear invariants. Necessary conditions for any given set of linear invariants can be derived by examining conditions a matrix must satisfy to map a certain set of matrices into itself. To the extent that necessary conditions are incorrect, a method is not reliable. In a world where different parts of molecules evolve at different rates, the two-parameter model of Kimura may not be empirically distinguishable from the more general one treated here.  相似文献   

Counting phylogenetic invariants in some simple cases.   总被引:1,自引:0,他引:1  
An informal degrees of freedom argument is used to count the number of phylogenetic invariants in cases where we have three or four species and can assume a Jukes-Cantor model of base substitution with or without a molecular clock. A number of simple cases are treated and in each the number of invariants can be found. Two new classes of invariants are found: non-phylogenetic cubic invariants testing independence of evolutionary events in different lineages, and linear phylogenetic invariants which occur when there is a molecular clock. Most of the linear invariants found by Cavender (1989, Molec. Biol. Evol. 6, 301-316) turn out in the Jukes-Cantor case to be simple tests of symmetry of the substitution model, and not phylogenetic invariants.  相似文献   

Tests of applicability of several substitution models for DNA sequence data   总被引:8,自引:3,他引:5  
Using linear invariants for various models of nucleotide substitution, we developed test statistics for examining the applicability of a specific model to a given dataset in phylogenetic inference. The models examined are those developed by Jukes and Cantor (1969), Kimura (1980), Tajima and Nei (1984), Hasegawa et al. (1985), Tamura (1992), Tamura and Nei (1993), and a new model called the eight-parameter model. The first six models are special cases of the last model. The test statistics developed are independent of evolutionary time and phylogeny, although the variances of the statistics contain phylogenetic information. Therefore, these statistics can be used before a phylogenetic tree is estimated. Our objective is to find the simplest model that is applicable to a given dataset, keeping in mind that a simple model usually gives an estimate of evolutionary distance (number of nucleotide substitutions per site) with a smaller variance than a complicated model when the simple model is correct. We have also developed a statistical test of the homogeneity of nucleotide frequencies of a sample of several sequences that takes into account possible phylogenetic correlations. This test is used to examine the stationarity in time of the base frequencies in the sample. For Hasegawa et al.'s and the eight-parameter models, analytical formulas for estimating evolutionary distances are presented. Application of the above tests to several sets of real data has shown that the assumption of stationarity of base composition is usually acceptable when the sequences studied are closely related but otherwise it is rejected. Similarly, the simple models of nucleotide substitution are almost always rejected when actual genes are distantly related and/or the total number of nucleotides examined is large.   相似文献   

In phylogenetic inference, an evolutionary model describes the substitution processes along each edge of a phylogenetic tree. Misspecification of the model has important implications for the analysis of phylogenetic data. Conventionally, however, the selection of a suitable evolutionary model is based on heuristics or relies on the choice of an approximate input tree. We introduce a method for model Selection in Phylogenetics based on linear INvariants (SPIn), which uses recent insights on linear invariants to characterize a model of nucleotide evolution for phylogenetic mixtures on any number of components. Linear invariants are constraints among the joint probabilities of the bases in the operational taxonomic units that hold irrespective of the tree topologies appearing in the mixtures. SPIn therefore requires no input tree and is designed to deal with nonhomogeneous phylogenetic data consisting of multiple sequence alignments showing different patterns of evolution, for example, concatenated genes, exons, and/or introns. Here, we report on the results of the proposed method evaluated on multiple sequence alignments simulated under a variety of single-tree and mixture settings for both continuous- and discrete-time models. In the simulations, SPIn successfully recovers the underlying evolutionary model and is shown to perform better than existing approaches.  相似文献   

Miyazawa S 《PloS one》2011,6(12):e28892
BACKGROUND: A mechanistic codon substitution model, in which each codon substitution rate is proportional to the product of a codon mutation rate and the average fixation probability depending on the type of amino acid replacement, has advantages over nucleotide, amino acid, and empirical codon substitution models in evolutionary analysis of protein-coding sequences. It can approximate a wide range of codon substitution processes. If no selection pressure on amino acids is taken into account, it will become equivalent to a nucleotide substitution model. If mutation rates are assumed not to depend on the codon type, then it will become essentially equivalent to an amino acid substitution model. Mutation at the nucleotide level and selection at the amino acid level can be separately evaluated. RESULTS: The present scheme for single nucleotide mutations is equivalent to the general time-reversible model, but multiple nucleotide changes in infinitesimal time are allowed. Selective constraints on the respective types of amino acid replacements are tailored to each gene in a linear function of a given estimate of selective constraints. Their good estimates are those calculated by maximizing the respective likelihoods of empirical amino acid or codon substitution frequency matrices. Akaike and Bayesian information criteria indicate that the present model performs far better than the other substitution models for all five phylogenetic trees of highly-divergent to highly-homologous sequences of chloroplast, mitochondrial, and nuclear genes. It is also shown that multiple nucleotide changes in infinitesimal time are significant in long branches, although they may be caused by compensatory substitutions or other mechanisms. The variation of selective constraint over sites fits the datasets significantly better than variable mutation rates, except for 10 slow-evolving nuclear genes of 10 mammals. An critical finding for phylogenetic analysis is that assuming variable mutation rates over sites lead to the overestimation of branch lengths.  相似文献   

An analytical method is presented for constructing linear invariants. All linear invariants of a k-species tree can be derived from those of (k-1)-species trees using this method. The new method is simpler than that of Cavender, which relies on numerical computations. Moreover, the new method provides a convenient tool to study the relationships between linear invariants of the same tree or of different trees. All linear invariants of trees of up to five species are derived in this study. For four species, there are 16 independent linear invariants for each of the three possible unrooted trees, 14 of which are shared by two unrooted trees and 12 of these are shared by all three unrooted trees; the last types of linear invariants can be used to construct tests on the assumptions about nucleotide substitutions. The number of linear invariants for a tree is found to increase rapidly with the number of species.  相似文献   

The method of evolutionary parsimony--or operator invariants--is a technique of nucleic acid sequence analysis related to parsimony analysis and explicitly designed for determining evolutionary relationships among four distantly related taxa. The method is independent of substitution rates because it is derived from consideration of the group properties of substitution operators rather than from an analysis of the probabilities of substitution in branches of a tree. In both parsimony and evolutionary parsimony, three patterns of nucleotide substitution are associated one-to-one with the three topologically linked trees for four taxa. In evolutionary parsimony, the three quantities are operator invariants. These invariants are the remnants of substitutions that have occurred in the interior branch of the tree and are analogous to the substitutions assigned to the central branch by parsimony. The two invariants associated with the incorrect trees must equal zero (statistically), whereas only the correct tree can have a nonzero invariant. The chi 2-test is used to ascertain the nonzero invariant and the statistically favored tree. Examples, obtained using data calculated with evolutionary rates and branchings designed to camouflage the true tree, show that the method accurately predicts the tree, even when substitution rates differ greatly in neighboring peripheral branches (conditions under which parsimony will consistently fail). As the number of substitutions in peripheral branches becomes fewer, the parsimony and the evolutionary-parsimony solutions converge. The method is robust and easy to use.   相似文献   

The method of invariants is an approach to the problem of reconstructing the phylogenetic tree of a collection of m taxa using nucleotide sequence data. Models for the respective probabilities of the 4m possible vectors of bases at a given site will have unknown parameters that describe the random mechanism by which substitution occurs along the branches of a putative phylogenetic tree. An invariant is a polynomial in these probabilities that, for a given phylogeny, is zero for all choices of the substitution mechanism parameters. If the invariant is typically non-zero for another phylogenetic tree, then estimates of the invariant can be used as evidence to support one phylogeny over another. Previous work of Evans and Speed showed that, for certain commonly used substitution models, the problem of finding a minimal generating set for the ideal of invariants can be reduced to the linear algebra problem of finding a basis for a certain lattice (that is, a free Z-module). They also conjectured that the cardinality of such a generating set can be computed using a simple "degrees of freedom" formula. We verify this conjecture. Along the way, we explain in detail how the observations of Evans and Speed lead to a simple, computationally feasible algorithm for constructing a minimal generating set.  相似文献   

Invariants are functions of the probabilities of state configurations among lineages, with expected values equal to zero under certain phylogenies. For two-state sequences, the existence of certain quadratic invariants requires a symmetric substitution model. For sequences with more than two states, the necessary condition for the existence of certain quadratic invariants in terms of independent events is much stronger than symmetry. For DNA sequences, only three parameters are allowed in the substitution model, which includes Kimura's two-parameter model as a special case.  相似文献   

MOTIVATION: Neighbor-dependent substitution processes generated specific pattern of dinucleotide frequencies in the genomes of most organisms. The CpG-methylation-deamination process is, e.g. a prominent process in vertebrates (CpG effect). Such processes, often with unknown mechanistic origins, need to be incorporated into realistic models of nucleotide substitutions. RESULTS: Based on a general framework of nucleotide substitutions we developed a method that is able to identify the most relevant neighbor-dependent substitution processes, estimate their relative frequencies and judge their importance in order to be included into the modeling. Starting from a model for neighbor independent nucleotide substitution we successively added neighbor-dependent substitution processes in the order of their ability to increase the likelihood of the model describing given data. The analysis of neighbor-dependent nucleotide substitutions based on repetitive elements found in the genomes of human, zebrafish and fruit fly is presented. AVAILABILITY: A web server to perform the presented analysis is freely available at: http://evogen.molgen.mpg.de/server/substitution-analysis  相似文献   

The general Markov model (GMM) of nucleotide substitution does not assume the evolutionary process to be stationary, reversible, or homogeneous. The GMM can be simplified by assuming the evolutionary process to be stationary. A stationary GMM is appropriate for analyses of phylogenetic data sets that are compositionally homogeneous; a data set is considered to be compositionally homogeneous if a statistical test does not detect significant differences in the marginal distributions of the sequences. Though the general time-reversible (GTR) model assumes stationarity, it also assumes reversibility and homogeneity. We propose two new stationary and nonhomogeneous models--one constrains the GMM to be reversible, whereas the other does not. The two models, coupled with the GTR model, comprise a set of nested models that can be used to test the assumptions of reversibility and homogeneity for stationary processes. The two models are extended to incorporate invariable sites and used to analyze a seven-taxon hominoid data set that displays compositional homogeneity. We show that within the class of stationary models, a nonhomogeneous model fits the hominoid data better than the GTR model. We note that if one considers a wider set of models that are not constrained to be stationary, then an even better fit can be obtained for the hominoid data. However, the methods for reducing model complexity from an extremely large set of nonstationary models are yet to be developed.  相似文献   

Models of amino acid substitution were developed and compared using maximum likelihood. Two kinds of models are considered. "Empirical" models do not explicitly consider factors that shape protein evolution, but attempt to summarize the substitution pattern from large quantities of real data. "Mechanistic" models are formulated at the codon level and separate mutational biases at the nucleotide level from selective constraints at the amino acid level. They account for features of sequence evolution, such as transition-transversion bias and base or codon frequency biases, and make use of physicochemical distances between amino acids to specify nonsynonymous substitution rates. A general approach is presented that transforms a Markov model of codon substitution into a model of amino acid replacement. Protein sequences from the entire mitochondrial genomes of 20 mammalian species were analyzed using different models. The mechanistic models were found to fit the data better than empirical models derived from large databases. Both the mutational distance between amino acids (determined by the genetic code and mutational biases such as the transition-transversion bias) and the physicochemical distance are found to have strong effects on amino acid substitution rates. A significant proportion of amino acid substitutions appeared to have involved more than one codon position, indicating that nucleotide substitutions at neighboring sites may be correlated. Rates of amino acid substitution were found to be highly variable among sites.   相似文献   

Here we present a model of nucleotide substitution in protein-coding regions that also encode the formation of conserved RNA structures. In such regions, apparent evolutionary context dependencies exist, both between nucleotides occupying the same codon and between nucleotides forming a base pair in the RNA structure. The overlap of these fundamental dependencies is sufficient to cause "contagious" context dependencies which cascade across many nucleotide sites. Such large-scale dependencies challenge the use of traditional phylogenetic models in evolutionary inference because they explicitly assume evolutionary independence between short nucleotide tuples. In our model we address this by replacing context dependencies within codons by annotation-specific heterogeneity in the substitution process. Through a general procedure, we fragment the alignment into sets of short nucleotide tuples based on both the protein coding and the structural annotation. These individual tuples are assumed to evolve independently, and the different tuple sets are assigned different annotation-specific substitution models shared between their members. This allows us to build a composite model of the substitution process from components of traditional phylogenetic models. We applied this to a data set of full-genome sequences from the hepatitis C virus where five RNA structures are mapped within the coding region. This allowed us to partition the effects of selection on different structural elements and to test various hypotheses concerning the relation of these effects. Of particular interest, we found evidence of a functional role of loop and bulge regions, as these were shown to evolve according to a different and more constrained selective regime than the nonpairing regions outside the RNA structures. Other potential applications of the model include comparative RNA structure prediction in coding regions and RNA virus phylogenetics.  相似文献   

Models of amino acid substitution present challenges beyond those often faced with the analysis of DNA sequences. The alignments of amino acid sequences are often small, whereas the number of parameters to be estimated is potentially large when compared with the number of free parameters for nucleotide substitution models. Most approaches to the analysis of amino acid alignments have focused on the use of fixed amino acid models in which all of the potentially free parameters are fixed to values estimated from a large number of sequences. Often, these fixed amino acid models are specific to a gene or taxonomic group (e.g. the Mtmam model, which has parameters that are specific to mammalian mitochondrial gene sequences). Although the fixed amino acid models succeed in reducing the number of free parameters to be estimated--indeed, they reduce the number of free parameters from approximately 200 to 0--it is possible that none of the currently available fixed amino acid models is appropriate for a specific alignment. Here, we present four approaches to the analysis of amino acid sequences. First, we explore the use of a general time reversible model of amino acid substitution using a Dirichlet prior probability distribution on the 190 exchangeability parameters. Second, we then explore the behaviour of prior probability distributions that are'centred' on the rates specified by the fixed amino acid model. Third, we consider a mixture of fixed amino acid models. Finally, we consider constraints on the exchangeability parameters as partitions,similar to how nucleotide substitution models are specified, and place a Dirichlet process prior model on all the possible partitioning schemes.  相似文献   

Although phylogenetic inference of protein-coding sequences continues to dominate the literature, few analyses incorporate evolutionary models that consider the genetic code. This problem is exacerbated by the exclusion of codon-based models from commonly employed model selection techniques, presumably due to the computational cost associated with codon models. We investigated an efficient alternative to standard nucleotide substitution models, in which codon position (CP) is incorporated into the model. We determined the most appropriate model for alignments of 177 RNA virus genes and 106 yeast genes, using 11 substitution models including one codon model and four CP models. The majority of analyzed gene alignments are best described by CP substitution models, rather than by standard nucleotide models, and without the computational cost of full codon models. These results have significant implications for phylogenetic inference of coding sequences as they make it clear that substitution models incorporating CPs not only are a computationally realistic alternative to standard models but may also frequently be statistically superior.  相似文献   

Influenza A virus is one of the best-studied viruses and a model organism for the study of molecular evolution; in particular, much research has focused on detecting natural selection on influenza virus proteins. Here, we study the dynamics of the synonymous and nonsynonymous nucleotide composition of influenza A virus genes. In several genes, the nucleotide frequencies at synonymous positions drift away from the equilibria predicted from the synonymous substitution matrices. We investigate possible reasons for this unexpected behavior by fitting several regression models. Relaxation toward a mutation-selection equilibrium following a host jump fails to explain the dynamics of the synonymous nucleotide composition, even if we allow for slow temporal changes in the substitution matrix. Instead, we find that deep internal branches of the phylogeny show distinct patterns of nucleotide substitution and that these branches strongly influence the dynamics of nucleotide composition, suggesting that the observed trends are at least in part a result of natural selection acting on synonymous sites. Moreover, we find that the dynamics of the nucleotide composition at synonymous and nonsynonymous sites are highly correlated, providing evidence that even nonsynonymous sites can be influenced by selection pressure for nucleotide composition.  相似文献   

Tempo and mode of synonymous substitutions in mitochondrial DNA of primates   总被引:3,自引:1,他引:2  
Nucleotide substitutions of the four-fold degenerate sites and the total third codon positions of mitochondrial DNA from human, common chimpanzee, bonobo, gorilla, and orangutan were examined in detail by three alternative Markov models; (1) Hasegawa, Kishino, and Yano's (1985) model, (2) Tamura and Nei's (1993) model, and (3) the general reversible Markov model. These sites are expected to be relatively free from constraint, and therefore their tempo and mode in evolution should reflect those of mutation. It turned out that, among the alternative models, the general reversible Markov model best approximates the nucleotide substitutions of the four-fold degenerate sites and the total third codon positions, while the maximum likelihood estimates of the numbers of nucleotide substitutions along each branch do not differ significantly among the three models. It was further shown that the transition rate of these sites during evolution, and therefore transitional mutation rate of mtDNA, are higher in humans than in chimpanzees and gorillas probably by about two times. However, transversional mutation rate and amino acid substitution rate do not differ significantly between humans and the African apes. These and additional observations suggest heterogeneity of the mutation rate as well as of the constraint operating on the mtDNA-encoded proteins among different lineages of Hominoidea.   相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号