共查询到20条相似文献,搜索用时 0 毫秒
1.
2.
We have estimated phylogenetic patterns and rates of nucleotide substitution in the hominoid primates using two different probabilistic models of molecular evolution as applied to three different data sets of nucleic acid sequences. The orang-utan was found to be the out-group of the other hominoids examined. Within the African apes and human clade the sister-group relationship of chimpanzee and human was found to be statistically the best, although the magnitude of the error estimates (a reflection of random statistical fluctuations) makes this conclusion tentative. The ψν-globin data sets were found to be statistically the most consistent and gave estimates of the times of divergence of chimpanzee and human from gorilla and of chimpanzee from human as 7·7 ± 1·5 Ma (Millions of years ago) and 7·4 ± 1·5 Ma respectively, although the speculative nature of these estimates is emphasized. In all cases the calibration point was the assumed divergence of the orang-utan from the remaining hominoids at 14·5 Ma. There was no statistically significant evidence of a slowdown in nucleotide substitution rate for the human lineage, or among the hominoids as a whole with respect to the Old and New World monkeys. We advocate the continued use and development of stochastic models of molecular evolution as a basis for phylogenetic estimation. On this basis one can choose between competing hypotheses of relationship in a statistical manner and can provide estimates of the errors involved in such estimations. The assumptions of all stochastic models are open to test and future refinement. 相似文献
3.
Phylogenetic estimation has largely come to rely on explicitly model-based methods. This approach requires that a model be chosen and that that choice be justified. To date, justification has largely been accomplished through use of likelihood-ratio tests (LRTs) to assess the relative fit of a nested series of reversible models. While this approach certainly represents an important advance over arbitrary model selection, the best fit of a series of models may not always provide the most reliable phylogenetic estimates for finite real data sets, where all available models are surely incorrect. Here, we develop a novel approach to model selection, which is based on the Bayesian information criterion, but incorporates relative branch-length error as a performance measure in a decision theory (DT) framework. This DT method includes a penalty for overfitting, is applicable prior to running extensive analyses, and simultaneously compares all models being considered and thus does not rely on a series of pairwise comparisons of models to traverse model space. We evaluate this method by examining four real data sets and by using those data sets to define simulation conditions. In the real data sets, the DT method selects the same or simpler models than conventional LRTs. In order to lend generality to the simulations, codon-based models (with parameters estimated from the real data sets) were used to generate simulated data sets, which are therefore more complex than any of the models we evaluate. On average, the DT method selects models that are simpler than those chosen by conventional LRTs. Nevertheless, these simpler models provide estimates of branch lengths that are more accurate both in terms of relative error and absolute error than those derived using the more complex (yet still wrong) models chosen by conventional LRTs. This method is available in a program called DT-ModSel. 相似文献
4.
Maximum likelihood phylogeny reconstruction methods are widely used in uncovering and assessing the evolutionary history and relationships of natural systems. However, several simplifying assumptions commonly made in this analysis limit the explanatory power of the results obtained. We present an algorithm that performs the phylogenetic analysis without making the common assumptions for sequence data from at least three leaf nodes in a star phylogeny. In particular, the underlying nucleotide substitution model does not have to be reversible and may include neighbor-dependent processes like the CpG methylation deamination process (CpG-effect). The base composition of the sequences at the external nodes and the one of the ancestral sequence may be different from each other and they do not have to be stationary state distributions of the corresponding substitution model. The algorithm is able to reconstruct the ancestral base composition and accurately estimate substitution frequencies in the branches of the star phylogeny. Extensive tests on simulated data validate the very favorable performance of the algorithm. As an application we present the analysis of aligned genomic sequences from human, mouse, and dog. Different substitution pattern can be observed in the three lineages. 相似文献
5.
Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites 总被引:7,自引:24,他引:7
Felsenstein's maximum-likelihood approach for inferring phylogeny from DNA
sequences assumes that the rate of nucleotide substitution is constant over
different nucleotide sites. This assumption is sometimes unrealistic, as
has been revealed by analysis of real sequence data. In the present paper
Felsenstein's method is extended to the case where substitution rates over
sites are described by the gamma distribution. A numerical example is
presented to show that the method fits the data better than do previous
models.
相似文献
6.
7.
Comparison of models for nucleotide substitution used in maximum- likelihood phylogenetic estimation 总被引:16,自引:15,他引:16
Using real sequence data, we evaluate the adequacy of assumptions made in
evolutionary models of nucleotide substitution and the effects that these
assumptions have on estimation of evolutionary trees. Two aspects of the
assumptions are evaluated. The first concerns the pattern of nucleotide
substitution, including equilibrium base frequencies and the
transition/transversion-rate ratio. The second concerns the variation of
substitution rates over sites. The maximum-likelihood estimate of tree
topology appears quite robust to both these aspects of the assumptions of
the models, but evaluation of the reliability of the estimated tree by
using simpler, less realistic models can be misleading. Branch lengths are
underestimated when simpler models of substitution are used, but the
underestimation caused by ignoring rate variation over nucleotide sites is
much more serious. The goodness of fit of a model is reduced by ignoring
spatial rate variation, but unrealistic assumptions about the pattern of
nucleotide substitution can lead to an extraordinary reduction in the
likelihood. It seems that evolutionary biologists can obtain accurate
estimates of certain evolutionary parameters even with an incorrect
phylogeny, while systematists cannot get the right tree with confidence
even when a realistic, and more complex, model of evolution is assumed.
相似文献
8.
Most models and algorithms developed to perform statistical inference from DNA data make the assumption that substitution processes affecting distinct nucleotide sites are stochastically independent. This assumption ensures both mathematical and computational tractability but is in disagreement with observed data in many situations--one well-known example being CpG dinucleotide hypermutability in mammalian genomes. In this paper, we consider the class of RN95 + YpR substitution models, which allows neighbor-dependent effects--including CpG hypermutability--to be taken into account, through transitions between pyrimidine-purine dinucleotides. We show that it is possible to adapt inference methods originally developed under the assumption of independence between sites to RN95 + YpR models, using a mathematically rigorous framework provided by specific structural properties of this class of models. We assess how efficient this approach is at inferring the CpG hypermutability rate from aligned DNA sequences. The method is tested on simulated data and compared against several alternatives; the results suggest that it delivers a high degree of accuracy at a low computational cost. We then apply our method to an alignment of 10 DNA sequences from primate species. Model comparisons within the RN95 + YpR class show the importance of taking into account neighbor-dependent effects. An application of the method to the detection of hypomethylated islands is discussed. 相似文献
9.
We describe a novel model and algorithm for simultaneously estimating multiple molecular sequence alignments and the phylogenetic trees that relate the sequences. Unlike current techniques that base phylogeny estimates on a single estimate of the alignment, we take alignment uncertainty into account by considering all possible alignments. Furthermore, because the alignment and phylogeny are constructed simultaneously, a guide tree is not needed. This sidesteps the problem in which alignments created by progressive alignment are biased toward the guide tree used to generate them. Joint estimation also allows us to model rate variation between sites when estimating the alignment and to use the evidence in shared insertion/deletions (indels) to group sister taxa in the phylogeny. Our indel model makes use of affine gap penalties and considers indels of multiple letters. We make the simplifying assumption that the indel process is identical on all branches. As a result, the probability of a gap is independent of branch length. We use a Markov chain Monte Carlo (MCMC) method to sample from the posterior of the joint model, estimating the most probable alignment and tree and their support simultaneously. We describe a new MCMC transition kernel that improves our algorithm's mixing efficiency, allowing the MCMC chains to converge even when started from arbitrary alignments. Our software implementation can estimate alignment uncertainty and we describe a method for summarizing this uncertainty in a single plot. 相似文献
10.
Background
The covarion hypothesis of molecular evolution holds that selective pressures on a given amino acid or nucleotide site are dependent on the identity of other sites in the molecule that change throughout time, resulting in changes of evolutionary rates of sites along the branches of a phylogenetic tree. At the sequence level, covarion-like evolution at a site manifests as conservation of nucleotide or amino acid states among some homologs where the states are not conserved in other homologs (or groups of homologs). Covarion-like evolution has been shown to relate to changes in functions at sites in different clades, and, if ignored, can adversely affect the accuracy of phylogenetic inference. 相似文献11.
The standard approach to phylogeny estimation uses two phases, in which the first phase produces an alignment on a set of homologous sequences, and the second phase estimates a tree on the multiple sequence alignment. POY, a method which seeks a tree/alignment pair minimizing the total treelength, is the most widely used alternative to this two-phase approach. The topological accuracy of trees computed under treelength optimization is, however, controversial. In particular, one study showed that treelength optimization using simple gap penalties produced poor trees and alignments, and suggested the possibility that if POY were used with an affine gap penalty, it might be able to be competitive with the best two-phase methods. In this paper we report on a study addressing this possibility. We present a new heuristic for treelength, called BeeTLe (Better Treelength), that is guaranteed to produce trees at least as short as POY. We then use this heuristic to analyze a large number of simulated and biological datasets, and compare the resultant trees and alignments to those produced using POY and also maximum likelihood (ML) and maximum parsimony (MP) trees computed on a number of alignments. In general, we find that trees produced by BeeTLe are shorter and more topologically accurate than POY trees, but that neither POY nor BeeTLe produces trees as topologically accurate as ML trees produced on standard alignments. These findings, taken as a whole, suggest that treelength optimization is not as good an approach to phylogenetic tree estimation as maximum likelihood based upon good alignment methods. 相似文献
12.
13.
Nucleotide substitution and recombination at orthologous loci in Staphylococcus aureus 总被引:2,自引:0,他引:2 下载免费PDF全文
The pattern of nucleotide substitution was examined at 2,129 orthologous loci among five genomes of Staphylococcus aureus, which included two sister pairs of closely related genomes (MW2/MSSA476 and Mu50/N315) and the more distantly related MRSA252. A total of 108 loci were unusual in lacking any synonymous differences among the five genomes; most of these were short genes encoding proteins highly conserved at the amino acid sequence level (including many ribosomal proteins) or unknown predicted genes. In contrast, 45 genes were identified that showed anomalously high divergence at synonymous sites. The latter genes were evidently introduced by homologous recombination from distantly related genomes, and in many cases, the pattern of nucleotide substitution made it possible to reconstruct the most probable recombination event involved. These recombination events introduced genes encoding proteins that differed in amino acid sequence and thus potentially in function. Several of the proteins are known or likely to be involved in pathogenesis (e.g., staphylocoagulase, exotoxin, Ser-Asp fibrinogen-binding bone sialoprotein-binding protein, fibrinogen and keratin-10 binding surface-anchored protein, fibrinogen-binding protein ClfA, and enterotoxin P). Therefore, the results support the hypothesis that exchange of homologous genes among S. aureus genomes can play a role in the evolution of pathogenesis in this species. 相似文献
14.
Poe S 《Systematic biology》1998,47(1):18-31
Recent studies have shown that addition or deletion of taxa from a data matrix can change the estimate of phylogeny. I used 29 data sets from the literature to examine the effect of taxon sampling on phylogeny estimation within data sets. I then used multiple regression to assess the effect of number of taxa, number of characters, homoplasy, strength of support, and tree symmetry on the sensitivity of data sets to taxonomic sampling. Sensitivity to sampling was measured by mapping characters from a matrix of culled taxa onto optimal trees for that reduced matrix and onto the pruned optimal tree for the entire matrix, then comparing the length of the reduced tree to the length of the pruned complete tree. Within-data-set patterns can be described by a second-order equation relating fraction of taxa sampled to sensitivity to sampling. Multiple regression analyses found number of taxa to be a significant predictor of sensitivity to sampling; retention index, number of informative characters, total support index, and tree symmetry were nonsignificant predictors. I derived a predictive regression equation relating fraction of taxa sampled and number of taxa potentially sampled to sensitivity to taxonomic sampling and calculated values for this equation within the bounds of the variables examined. The length difference between the complete tree and a subsampled tree was generally small (average difference of 0-2.9 steps), indicating that subsampling taxa is probably not an important problem for most phylogenetic analyses using up to 20 taxa. 相似文献
15.
Phylogenetic relationships among all of the major decapod infraorders have never been estimated using molecular data, while morphological studies produce conflicting results. In the present study, the phylogenetic relationships among the decapod basal suborder Dendrobranchiata and all of the currently recognized decapod infraorders within the suborder Pleocyemata (Caridea, Stenopodidea, Achelata, Astacidea, Thalassinidea, Anomala, and Brachyura) were inferred using 16S mtDNA, 18S and 28S rRNA, and the histone H3 gene. Phylogenies were reconstructed using the model-based methods of maximum likelihood and Bayesian methods coupled with Markov Chain Monte Carlo inference. The phylogenies revealed that the seven infraorders are monophyletic, with high clade support values (bp>70; pP>0.95) under both methods. The two suborders also were recovered as monophyletic, but with weaker support (bp=70; pP=0.74). Although the nodal support values for infraordinal relationships were low (bp<50; pP<0.77) the Anomala and Brachyura were basal to the rest of the 'Reptantia' in both reconstructions and using Bayesian tree topology tests alternate morphology-based hypotheses were rejected (P<0.01). Newly developed multi-locus Bayesian and likelihood heuristic rate-smoothing methods to estimate divergence times were compared using eight fossil and geological calibrations. Estimated times revealed that the Decapoda originated earlier than 437MYA and that the radiation within the group occurred rapidly, with all of the major lineages present by 325MYA. Node time estimation under both approaches is severely affected by the number and phylogenetic distribution of the fossil calibrations chosen. For analyses incorporating fossils as fixed ages, more consistent results were obtained by using both shallow and deep or clade-related calibration points. Divergence time estimation using fossils as lower and upper limits performed well with as few as one upper limit and a single deep fossil lower limit calibration. 相似文献
16.
This article generalizes previous models for codon substitution and rate variation in molecular phylogeny. Particular attention is paid to (1) reversibility, (2) acceptance and rejection of proposed codon changes, (3) varying rates of evolution among codon sites, and (4) the interaction of these sites in determining evolutionary rates. To accommodate spatial variation in rates, Markov random fields rather than Markov chains are introduced. Because these innovations complicate maximum likelihood estimation in phylogeny reconstruction, it is necessary to formulate new algorithms for the evaluation of the likelihood and its derivatives with respect to the underlying kinetic, acceptance, and spatial parameters. To derive the most from maximum likelihood analysis of sequence data, it is useful to compute posterior probabilities assigning residues to internal nodes and evolutionary rate classes to codon sites. It is also helpful to search through tree space in a way that respects accepted phylogenetic relationships. Our phylogeny program LINNAEUS implements algorithms realizing these goals. Readers may consult our companion article in this issue for several examples. 相似文献
17.
We prove that a wide class of Markov models of neighbor-dependent substitution processes on the integer line is solvable. This class contains some models of nucleotidic substitutions recently introduced and studied empirically by molecular biologists. We show that the polynucleotidic frequencies at equilibrium solve some finite-size linear systems. This provides, for the first time up to our knowledge, explicit and algebraic formulas for the stationary frequencies of non-degenerate neighbor-dependent models of DNA substitutions. Furthermore, we show that the dynamics of these stochastic processes and their distribution at equilibrium exhibit some stringent, rather unexpected, independence properties. For example, nucleotidic sites at distance at least three evolve independently, and all the sites, when encoded as purines and pyrimidines, evolve independently. 相似文献
18.
In 1994, Muse and Gaut (MG) and Goldman and Yang (GY) proposed evolutionary models that recognize the coding structure of the nucleotide sequences under study, by defining a Markovian substitution process with a state space consisting of the 61 sense codons (assuming the universal genetic code). Several variations and extensions to their models have since been proposed, but no general and flexible framework for contrasting the relative performance of alternative approaches has yet been applied. Here, we compute Bayes factors to evaluate the relative merit of several MG and GY styles of codon substitution models, including recent extensions acknowledging heterogeneous nonsynonymous rates across sites, as well as selective effects inducing uneven amino acid or codon preferences. Our results on three real data sets support a logical model construction following the MG formulation, allowing for a flexible account of global amino acid or codon preferences, while maintaining distinct parameters governing overall nucleotide propensities. Through posterior predictive checks, we highlight the importance of such a parameterization. Altogether, the framework presented here suggests a broad modeling project in the MG style, stressing the importance of combining and contrasting available model formulations and grounding developments in a sound probabilistic paradigm. 相似文献
19.
Statistical tests of models of DNA substitution 总被引:32,自引:0,他引:32
Nick Goldman 《Journal of molecular evolution》1993,36(2):182-198
Summary Penny et al. have written that The most fundamental criterion for a scientific method is that the data must, in principle, be able to reject the model. Hardly any [phylogenetic] tree-reconstruction methods meet this simple requirement. The ability to reject models is of such great importance because the results of all phylogenetic analyses depend on their underlying models—to have confidence in the inferences, it is necessary to have confidence in the models. In this paper, a test statistics suggested by Cox is employed to test the adequacy of some statistical models of DNA sequence evolution used in the phylogenetic inference method introduced by Felsentein. Monte Carlo simulations are used to assess significance levels. The resulting statistical tests provide an objective and very general assessment of all the components of a DNA substitution model; more specific versions of the test are devised to test individual components of a model. In all cases, the new analyses have the additional advantage that values of phylogenetic parameters do not have to be assumed in order to perform the tests. 相似文献
20.
On the basis of 1,290 bp sequences of the chloroplast generbcL, a molecular phylogeny of seven of nine genera of the Celtidaceae and four of six genera of the Ulmaceae was produced. These
data were analyzed together with some other urticalean genera using three methods (i.e., maximum parsimony, maximum likelihood,
and neighbor joining methods). Maximum likelihood topology among 18 trees obtained indicated that the Urticales are monophyletic
with its common clade splitting basally into two: one leading to a line comprisingAmpelocera (traditionally placed in Celtidaceae) and Ulmaceae, and the other leading to a line comprising the remaining genera of Celtidaceae,
Moraceae, and other Urticales. Ulmaceae, to whichAmpelocera is a sister group, are monophyletic, as supported by many lines of morphological evidence. In contrast to Ulmaceae, the monophyly
of Celtidaceae (excludingAmpelocera) was not supported, and resolution of relationships of Celtidaceae with other Urticales, as well as of those within the family,
is left for future study. 相似文献