首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Standard methods of phylogenetic reconstruction are based on models that assume homogeneity of nucleotide composition among taxa. However, this assumption is often violated in biological data sets. In this study, we examine possible effects of nucleotide heterogeneity among lineages on the phylogenetic reconstruction of a bacterial group that spans a wide range of genomic nucleotide contents: obligately endosymbiotic bacteria and free-living or commensal species in the gamma-Proteobacteria. We focus on AT-rich primary endosymbionts to better understand the origins of obligately intracellular lifestyles. Previous phylogenetic analyses of this bacterial group point to the importance of accounting for base compositional variation in estimating relationships, particularly between endosymbiotic and free-living taxa. Here, we develop an approach to compare susceptibility of various phylogenetic reconstruction methods to the effects of nucleotide heterogeneity. First, we identify candidate trees of gamma-Proteobacteria groEL and 16S rRNA using approaches that assume homogeneous and stationary base composition, including Bayesian, maximum likelihood, parsimony, and distance methods. We then create permutations of the resulting candidate trees by varying the placement of the AT-rich endosymbiont Buchnera. These permutations are evaluated under the nonhomogeneous and nonstationary maximum likelihood model of Galtier and Gouy, which allows equilibrium base content to vary among examined lineages. Our results show that commonly used phylogenetic methods produce incongruent trees of the Enterobacteriales, and that the placement of Buchnera is especially unstable. However, under a nonhomogeneous model, various groEL and 16S rRNA phylogenies that separate Buchnera from other AT-rich endosymbionts (Blochmannia and Wigglesworthia) have consistently and significantly higher likelihood scores. Blochmannia and Wigglesworthia appear to have evolved from secondary endosymbionts, and represent an origin of primary endosymbiosis that is independent from Buchnera. This application of a nonhomogeneous model offers a computationally feasible way to test specific phylogenetic hypotheses for taxa with heterogeneous and nonstationary base composition.  相似文献   

2.
Genome-scale phylogeny and the detection of systematic biases   总被引:17,自引:0,他引:17  
Phylogenetic inference from sequences can be misled by both sampling (stochastic) error and systematic error (nonhistorical signals where reality differs from our simplified models). A recent study of eight yeast species using 106 concatenated genes from complete genomes showed that even small internal edges of a tree received 100% bootstrap support. This effective negation of stochastic error from large data sets is important, but longer sequences exacerbate the potential for biases (systematic error) to be positively misleading. Indeed, when we analyzed the same data set using minimum evolution optimality criteria, an alternative tree received 100% bootstrap support. We identified a compositional bias as responsible for this inconsistency and showed that it is reduced effectively by coding the nucleotides as purines and pyrimidines (RY-coding), reinforcing the original tree. Thus, a comprehensive exploration of potential systematic biases is still required, even though genome-scale data sets greatly reduce sampling error.  相似文献   

3.
Lercher MJ  Hurst LD 《Gene》2002,300(1-2):53-58
One of the most abiding controversies in evolutionary biology concerns the role of neutral processes in molecular evolution. A main focus of the debate has been the evolution of isochores, the strong and systematic variation of base composition in mammalian genomes. One set of hypotheses argue that regions of similar GC are owing to localised mutational biases coupled with neutral evolution. The alternatives point to either selection or biased gene conversion as mechanisms to preferentially remove A or T bases, favouring G and C instead. Using a novel method, we compare models including such fixation biases to models based on mutation bias alone, under the assumption that non-coding, non-repetitive human DNA is at compositional equilibrium. While failing to fully explain the allele frequency distributions of recent single nucleotide polymorphism data, we show that the data are best fitted if the mutation bias is assumed to be constant across the genome, while fixation bias varies with GC content. We also attempt to estimate the strength of fixation bias, which increases linearly with increasing GC. Our approximation suggests that this force exists within the necessary parameter range: it is not so weak as to be drowned by random drift, but not so strong as to lead to exclusive use of G and C alone. Together these results demonstrate that mutation bias fails to explain the evolution of isochores, and suggest that either selection or biased gene conversion are involved.  相似文献   

4.
The evolution of isochores: evidence from SNP frequency distributions   总被引:4,自引:0,他引:4  
Lercher MJ  Smith NG  Eyre-Walker A  Hurst LD 《Genetics》2002,162(4):1805-1810
The large-scale systematic variation in nucleotide composition along mammalian and avian genomes has been a focus of the debate between neutralist and selectionist views of molecular evolution. Here we test whether the compositional variation is due to mutation bias using two new tests, which do not assume compositional equilibrium. In the first test we assume a standard population genetics model, but in the second we make no assumptions about the underlying population genetics. We apply the tests to single-nucleotide polymorphism data from noncoding regions of the human genome. Both models of neutral mutation bias fit the frequency distributions of SNPs segregating in low- and medium-GC-content regions of the genome adequately, although both suggest compositional nonequilibrium. However, neither model fits the frequency distribution of SNPs from the high-GC-content regions. In contrast, a simple population genetics model that incorporates selection or biased gene conversion cannot be rejected. The results suggest that mutation biases are not solely responsible for the compositional biases found in noncoding regions.  相似文献   

5.
With growing amounts of genome data and constant improvement of models of molecular evolution, phylogenetic reconstruction became more reliable. However, our knowledge of the real process of molecular evolution is still limited. When enough large-sized data sets are analyzed, any subtle biases in statistical models can support incorrect topologies significantly because of the high signal-to-noise ratio. We propose a procedure to locate sequences in a multidimensional vector space (MVS), in which the geometry of the space is uniquely determined in such a way that the vectors of sequence evolution are orthogonal among different branches. In this paper, the MVS approach is developed to detect and remove biases in models of molecular evolution caused by unrecognized convergent evolution among lineages or unexpected patterns of substitutions. Biases in the estimated pairwise distances are identified as deviations (outliers) of sequence spatial vectors from the expected orthogonality. Modifications to the estimated distances are made by minimizing an index to quantify the deviations. In this way, it becomes possible to reconstruct the phylogenetic tree, taking account of possible biases in the model of molecular evolution. The efficacy of the modification procedure was verified by simulating evolution on various topologies with rate heterogeneity and convergent change. The phylogeny of placental mammals in previous analyses of large data sets has varied according to the genes being analyzed. Systematic deviations caused by convergent evolution were detected by our procedure in all representative data sets and were found to strongly affect the tree structure. However, the bias correction yielded a consistent topology among data sets. The existence of strong biases was validated by examining the sites of convergent evolution between the hedgehog and other species in mitochondrial data set. This convergent evolution explains why it has been difficult to determine the phylogenetic placement of the hedgehog in previous studies.  相似文献   

6.
Despite the degeneracy of the genetic code, whereby different codons encode the same amino acid, alternative codons and amino acids are utilized nonrandomly within and between genomes. Such biases in codon and amino acid usage have been demonstrated extensively in prokaryote genomes and likely reflect a balance between the action of mutation, selection, and genetic drift. Here, we quantify the effects of selection and mutation drift as causes of codon and amino acid-usage bias in a large collection of nematode partial genomes from 37 species spanning approximately 700 Myr of evolution, as inferred from expressed sequence tag (EST) measures of gene expression and from base composition variation. Average G + C content at silent sites among these taxa ranges from 10% to 63%, and EST counts range more than 100-fold, underlying marked differences between the identities of major codons and optimal codons for a given species as well as influencing patterns of amino acid abundance among taxa. Few species in our sample demonstrate a dominant role of selection in shaping intragenomic codon-usage biases, and these are principally free living rather than parasitic nematodes. This suggests that deviations in effective population size among species, with small effective sizes among parasites, are partly responsible for species differences in the extent to which selection shapes patterns of codon usage. Nevertheless, a consensus set of optimal codons emerges that is common to most taxa, indicating that, with some notable exceptions, selection for translational efficiency and accuracy favors similar sets of codons regardless of the major codon-usage trends defined by base compositional properties of individual nematode genomes.  相似文献   

7.
A nonhomogeneous, nonstationary stochastic model of DNA sequence evolution allowing varying equilibrium G + C contents among lineages is devised in order to deal with sequences of unequal base compositions. A maximum-likelihood implementation of this model for phylogenetic analyses allows handling of a reasonable number of sequences. The relevance of the model and the accuracy of parameter estimates are theoretically and empirically assessed, using real or simulated data sets. Overall, a significant amount of information about past evolutionary modes can be extracted from DNA sequences, suggesting that process (rates of distinct kinds of nucleotide substitutions) and pattern (the evolutionary tree) can be simultaneously inferred. G + C contents at ancestral nodes are quite accurately estimated. The new method appears to be useful for phylogenetic reconstruction when base composition varies among compared sequences. It may also be suitable for molecular evolution studies.   相似文献   

8.
Erroneous estimates of ingroup relationships can be caused by attributes in the outgroup chosen to root the tree. Phylogenetic analyses of DNA sequences frequently yield incorrect estimates of ingroup relationships when the outgroup used to "root" the tree is highly divergent from the ingroup. This is especially the case when the outgroup has a different base composition than the ingroup. Unfortunately, in many instances, alternative less divergent outgroups are not available. In such cases, investigators must either target genes with attributes that minimize the problem (slowly evolving genes with stationary base compositions--which are often not ideal for estimating relationships among the more closely related ingroup taxa) or use inference models that are explicitly tailored to deal with an attenuated historical signal with a superimposed non-stationary base composition. In this paper we explore the problem both empirically and through simulation. For the empirical component we looked at the phylogenetic relationships among elasmobranch fishes (sharks and rays), a group whose closest living outgroup, the holocephalan Ghost fishes, are separated from the elasmobranchs by more than 100 million years of evolution. We compiled a data set for analysis comprising 10 single-copy nuclear protein-coding genes (12,096 bp) for representatives of the major lineages within elasmobranchs and holocephalans. For the simulation, we used an evolutionary model on a fixed tree topology to generate DNA sequence data sets which varied both in their distance to the outgroup, and in their base compositional difference between ingroup and outgroup. Results from both the empirical data set and the simulation, support the idea that deviation from base compositional stationarity, in conjunction with distance from the root can act in concert to compromise accuracy of estimated relationships within the ingroup. We tested several approaches to mitigate such problems. We found, that excluding genes with overall faster rates and heterogeneous base compositions, while the least sophisticated of the methods evaluated, seemed to be the most effective.  相似文献   

9.
Insertions and deletions (indels) are common molecular evolutionary events. However, probabilistic models for indel evolution are under-developed due to their computational complexity. Here, we introduce several improvements to indel modeling: 1) While previous models for indel evolution assumed that the rates and length distributions of insertions and deletions are equal, here we propose a richer model that explicitly distinguishes between the two; 2) we introduce numerous summary statistics that allow approximate Bayesian computation-based parameter estimation; 3) we develop a method to correct for biases introduced by alignment programs, when inferring indel parameters from empirical data sets; and 4) using a model-selection scheme, we test whether the richer model better fits biological data compared with the simpler model. Our analyses suggest that both our inference scheme and the model-selection procedure achieve high accuracy on simulated data. We further demonstrate that our proposed richer model better fits a large number of empirical data sets and that, for the majority of these data sets, the deletion rate is higher than the insertion rate.  相似文献   

10.
SEGMENT: identifying compositional domains in DNA sequences   总被引:2,自引:0,他引:2  
MOTIVATION: DNA sequences are formed by patches or domains of different nucleotide composition. In a few simple sequences, domains can simply be identified by eye; however, most DNA sequences show a complex compositional heterogeneity (fractal structure), which cannot be properly detected by current methods. Recently, a computationally efficient segmentation method to analyse such nonstationary sequence structures, based on the Jensen-Shannon entropic divergence, has been described. Specific algorithms implementing this method are now needed. RESULTS: Here we describe a heuristic segmentation algorithm for DNA sequences, which was implemented on a Windows program (SEGMENT). The program divides a DNA sequence into compositionally homogeneous domains by iterating a local optimization procedure at a given statistical significance. Once a sequence is partitioned into domains, a global measure of sequence compositional complexity (SCC), accounting for both the sizes and compositional biases of all the domains in the sequence, is derived. SEGMENT computes SCC as a function of the significance level, which provides a multiscale view of sequence complexity.  相似文献   

11.
Compositional bias in DNA   总被引:6,自引:0,他引:6  
Experimental approaches, as well as computer analysis on genomic sequences, have revealed a large variability in base composition between regions in the same genome or between genomes of different species. In most cases, however, the biological causes of these compositional biases remain unknown. The recent large increase in the availability of completely sequenced genomes can give new insight into evolution processes involved in these compositional biases.  相似文献   

12.
Compositional changes are a major feature of genome evolution. Overlooking nucleotide composition differences among sequences can seriously mislead phylogenetic reconstructions. Large compositional variation exists among the members of the family Drosophilidae. Until now, however, base composition differences have been largely neglected in the formulations of the nucleotide substitution process used to reconstruct the phylogeny of this important group of species. The present study adopts a maximum-likelihood framework of phylogenetic inference in order to analyze five nuclear gene regions and shows that (1) the pattern of compositional variation in the Drosophilidae does not match the phylogeny of the species; (2) accounting for the heterogeneous GC content with Galtier and Gouy's nucleotide substitution model leads to a tree that differs in significant aspects from the tree inferred when the nucleotide composition differences are ignored, even though both phylogenetic hypotheses attain strong nodal support in the bootstrap analyses; and (3) the LogDet distance correction cannot completely overcome the distorting effects of the compositional variation that exists among the species of the Drosophilidae. Our analyses confidently place the Chymomyza genus as an outgroup closer than the genus Scaptodrosophila to the Drosophila genus and conclusively support the monophyly of the Sophophora subgenus.  相似文献   

13.
The number of distinct functional classes of single-stranded RNAs (ssRNAs) and the number of sequences representing them are substantial and continue to increase. Organizing this data in an evolutionary context is essential, yet traditional comparative sequence analyses require that homologous sites can be identified. This prevents comparative analysis between sequences of different functional classes that share no site-to-site sequence similarity. Analysis within a single evolutionary lineage also limits evolutionary inference because shared ancestry confounds properties of molecular structure and function that are historically contingent with those that are imposed for biophysical reasons. Here, we apply a method of comparative analysis to ssRNAs that is not restricted to homologous sequences, and therefore enables comparison between distantly related or unrelated sequences, minimizing the effects of shared ancestry. This method is based on statistical similarities in nucleotide base composition among different functional classes of ssRNAs. In order to denote base composition unambiguously, we have calculated the fraction G+A and G+U content, in addition to the more commonly used fraction G+C content. These three parameters define RNA composition space, which we have visualized using interactive graphics software. We have examined the distribution of nucleotide composition from 15 distinct functional classes of ssRNAs from organisms spanning the universal phylogenetic tree and artificial ribozymes evolved in vitro. Surprisingly, these distributions are biased consistently in G+A and G+U content, both within and between functional classes, regardless of the more variable G+C content. Additionally, an analysis of the base composition of secondary structural elements indicates that paired and unpaired nucleotides, known to have different evolutionary rates, also have significantly different compositional biases. These universal compositional biases observed among ssRNAs sharing little or no sequence similarity suggest, contrary to current understanding, that base composition biases constitute a convergent adaptation among a wide variety of molecular functions.  相似文献   

14.
Dávalos LM  Perkins SL 《Genomics》2008,91(5):433-442
Despite recent genome-based advances in understanding Plasmodium molecular evolution and its relationship to disease mechanisms and potential drug development, the phylogenetics of the group is currently limited to single-gene analyses. Here we develop and analyze a set of N100 putative orthologous genes derived from genome comparisons. We aimed to minimize systematic errors that arise when reconstructing the Plasmodium phylogeny with a genome-scale data set by evaluating the congruence of different genes, optimality criteria, and models of sequence evolution with previous studies encompassing fewer characters and more species. Saturation in substitutions and bias in base frequencies at third-codon positions characterized most Plasmodium genes. Molecular evolution models that partitioned rates of change by codon position were best at accounting for these sequence characteristics, as were analyses of amino acid alignments. These methods also ameliorated, but did not entirely avoid, the impact of reduced taxon sampling on phylogeny. The use of these models and expanded taxon sampling are necessary to maximize detection of multiple substitutions, overcome compositional biases, and, ultimately, resolve with confidence the phylogeny of Plasmodium.  相似文献   

15.
Model-based phylogenetic reconstruction methods traditionally assume homogeneity of nucleotide frequencies among sequence sites and lineages. Yet, heterogeneity in base composition is a characteristic shared by most biological sequences. Compositional variation in time, reflected in the compositional biases among contemporary sequences, has already been extensively studied, and its detrimental effects on phylogenetic estimates are known. However, fewer studies have focused on the effects of spatial compositional heterogeneity within genes. We show here that different sites in an alignment do not always share a unique compositional pattern, and we provide examples where nucleotide frequency trends are correlated with the site-specific rate of evolution in RNA genes. Spatial compositional heterogeneity is shown to affect the estimation of evolutionary parameters. With standard phylogenetic methods, estimates of equilibrium frequencies are found to be biased towards the composition observed at fast-evolving sites. Conversely, the ancestral composition estimates of some time-heterogeneous but spatially homogeneous methods are found to be biased towards frequencies observed at invariant and slow-evolving sites. The latter finding challenges the result of a previous study arguing against a hyperthermophilic last universal ancestor from the low apparent G + C content of its rRNA sequences. We propose a new model to account for compositional variation across sites. A Gaussian process prior is used to allow for a smooth change in composition with evolutionary rate. The model has been implemented in the phylogenetic inference software PHASE, and Bayesian methods can be used to obtain the model parameters. The results suggest that this model can accurately capture the observed trends in present-day RNA sequences.  相似文献   

16.
Two spurious nodes were found in phylogenetic analyses of vertebrate rhodopsin sequences in comparison with well-established vertebrate relationships. These spurious reconstructions were well supported in bootstrap analyses and occurred independently of the method of phylogenetic analysis used (parsimony, distance, or likelihood). Use of this data set of vertebrate rhodopsin sequences allowed us to exploit established vertebrate relationships, as well as the considerable amount known about the molecular evolution of this gene, in order to identify important factors contributing to the spurious reconstructions. Simulation studies using parametric bootstrapping indicate that it is unlikely that the spurious nodes in the parsimony analyses are due to long branches or other topological effects. Rather, they appear to be due to base compositional bias at third positions, codon bias, and convergent evolution at nucleotide positions encoding the hydrophobic residues isoleucine, leucine, and valine. LogDet distance methods, as well as maximum-likelihood methods which allow for nonstationary changes in base composition, reduce but do not entirely eliminate support for the spurious resolutions. Inclusion of five additional rhodopsin sequences in the phylogenetic analyses largely corrected one of the spurious reconstructions while leaving the other unaffected. The additional sequences not only were more proximal to the corrected node, but were also found to have intermediate levels of base composition and codon bias as compared with neighboring sequences on the tree. This study shows that the spurious reconstructions can be corrected either by excluding third positions, as well as those encoding the amino acids Ile, Val, and Leu (which may not be ideal, as these sites can contain useful phylogenetic signal for other parts of the tree), or by the addition of sequences that reduce problems associated with convergent evolution.  相似文献   

17.
It has been claimed that blending processes such as trade and exchange have always been more important in the evolution of cultural similarities and differences among human populations than the branching process of population fissioning. In this paper, we report the results of a novel comparative study designed to shed light on this claim. We fitted the bifurcating tree model that biologists use to represent the relationships of species to 21 biological data sets that have been used to reconstruct the relationships of species and/or higher level taxa and to 21 cultural data sets. We then compared the average fit between the biological data sets and the model with the average fit between the cultural data sets and the model. Given that the biological data sets can be confidently assumed to have been structured by speciation, which is a branching process, our assumption was that, if cultural evolution is dominated by blending processes, the fit between the bifurcating tree model and the cultural data sets should be significantly worse than the fit between the bifurcating tree model and the biological data sets. Conversely, if cultural evolution is dominated by branching processes, the fit between the bifurcating tree model and the cultural data sets should be no worse than the fit between the bifurcating tree model and the biological data sets. We found that the average fit between the cultural data sets and the bifurcating tree model was not significantly different from the fit between the biological data sets and the bifurcating tree model. This indicates that the cultural data sets are not less tree-like than are the biological data sets. As such, our analysis does not support the suggestion that blending processes have always been more important than branching processes in cultural evolution. We conclude from this that, rather than deciding how cultural evolution has proceeded a priori, researchers need to ascertain which model or combination of models is relevant in a particular case and why.  相似文献   

18.
On reduced amino acid alphabets for phylogenetic inference   总被引:1,自引:0,他引:1  
We investigate the use of Markov models of evolution for reduced amino acid alphabets or bins of amino acids. The use of reduced amino acid alphabets can ameliorate effects of model misspecification and saturation. We present algorithms for 2 different ways of automating the construction of bins: minimizing criteria based on properties of rate matrices and minimizing criteria based on properties of alignments. By simulation, we show that in the absence of model misspecification, the loss of information due to binning is found to be insubstantial, and the use of Markov models at the binned level is found to be almost as effective as the more appropriate missing data approach. By applying these approaches to real data sets where compositional heterogeneity and/or saturation appear to be causing biased tree estimation, we find that binning can improve topological estimation in practice.  相似文献   

19.
Efficient likelihood computations with nonreversible models of evolution   总被引:4,自引:0,他引:4  
Recent advances in heuristics have made maximum likelihood phylogenetic tree estimation tractable for hundreds of sequences. Noticeably, these algorithms are currently limited to reversible models of evolution, in which Felsenstein's pulley principle applies. In this paper we show that by reorganizing the way likelihood is computed, one can efficiently compute the likelihood of a tree from any of its nodes with a nonreversible model of DNA sequence evolution, and hence benefit from cutting-edge heuristics. This computational trick can be used with reversible models of evolution without any extra cost. We then introduce nhPhyML, the adaptation of the nonhomogeneous nonstationary model of Galtier and Gouy (1998; Mol. Biol. Evol. 15:871-879) to the structure of PhyML, as well as an approximation of the model in which the set of equilibrium frequencies is limited. This new version shows good results both in terms of exploration of the space of tree topologies and ancestral G+C content estimation. We eventually apply it to rRNA sequences slowly evolving sites and conclude that the model and a wider taxonomic sampling still do not plead for a hyperthermophilic last universal common ancestor.  相似文献   

20.
Galtier N  Bazin E  Bierne N 《Genetics》2006,172(1):221-228
The study of base composition evolution in Drosophila has been achieved mostly through the analysis of coding sequences. Third codon position GC content, however, is influenced by both neutral forces (e.g., mutation bias) and natural selection for codon usage optimization. In this article, large data sets of noncoding DNA sequence polymorphism in D. melanogaster and D. simulans were gathered from public databases to try to disentangle these two factors-noncoding sequences are not affected by selection for codon usage. Allele frequency analyses revealed an asymmetric pattern of AT vs. GC noncoding polymorphisms: AT --> GC mutations are less numerous, and tend to segregate at a higher frequency, than GC --> AT ones, especially at GC-rich loci. This is indicative of nonstationary evolution of base composition and/or of GC-biased allele transmission. Fitting population genetics models to the allele frequency spectra confirmed this result and favored the hypothesis of a biased transmission. These results, together with previous reports, suggest that GC-biased gene conversion has influenced base composition evolution in Drosophila and explain the correlation between intron and exon GC content.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号