首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 93 毫秒
1.
X Liu  H Liu  W Guo  K Yu 《Gene》2012,509(1):136-141
Codon models are now widely used to draw evolutionary inferences from alignments of homologous sequence data. Incorporating physicochemical properties of amino acids into codon models, two novel codon substitution models describing the evolution of protein-coding DNA sequences are presented based on the similarity scores of amino acids. To describe substitutions between codons a continue-time Markov process is used. Transition/transversion rate bias and nonsynonymous codon usage bias are allowed in the models. In our implementation, the parameters are estimated by maximum-likelihood (ML) method as in previous studies. Furthermore, instantaneous mutations involving more than one nucleotide position of a codon are considered in the second model. Then the two suggested models are applied to five real data sets. The analytic results indicate that the new codon models considering physicochemical properties of amino acids can provide a better fit to the data comparing with existing codon models, and then produce more reliable estimates of certain biologically important measures than existing methods.  相似文献   

2.
Phylogenetic tree reconstruction is traditionally based on multiple sequence alignments (MSAs) and heavily depends on the validity of this information bottleneck. With increasing sequence divergence, the quality of MSAs decays quickly. Alignment-free methods, on the other hand, are based on abstract string comparisons and avoid potential alignment problems. However, in general they are not biologically motivated and ignore our knowledge about the evolution of sequences. Thus, it is still a major open question how to define an evolutionary distance metric between divergent sequences that makes use of indel information and known substitution models without the need for a multiple alignment. Here we propose a new evolutionary distance metric to close this gap. It uses finite-state transducers to create a biologically motivated similarity score which models substitutions and indels, and does not depend on a multiple sequence alignment. The sequence similarity score is defined in analogy to pairwise alignments and additionally has the positive semi-definite property. We describe its derivation and show in simulation studies and real-world examples that it is more accurate in reconstructing phylogenies than competing methods. The result is a new and accurate way of determining evolutionary distances in and beyond the twilight zone of sequence alignments that is suitable for large datasets.  相似文献   

3.
A codon-based model of nucleotide substitution for protein-coding DNA sequences   总被引:34,自引:23,他引:11  
A codon-based model for the evolution of protein-coding DNA sequences is presented for use in phylogenetic estimation. A Markov process is used to describe substitutions between codons. Transition/transversion rate bias and codon usage bias are allowed in the model, and selective restraints at the protein level are accommodated using physicochemical distances between the amino acids coded for by the codons. Analyses of two data sets suggest that the new codon-based model can provide a better fit to data than can nucleotide-based models and can produce more reliable estimates of certain biologically important measures such as the transition/transversion rate ratio and the synonymous/nonsynonymous substitution rate ratio.   相似文献   

4.
Approximate methods for estimating the numbers of synonymous and nonsynonymous substitutions between two DNA sequences involve three steps: counting of synonymous and nonsynonymous sites in the two sequences, counting of synonymous and nonsynonymous differences between the two sequences, and correcting for multiple substitutions at the same site. We examine complexities involved in those steps and propose a new approximate method that takes into account two major features of DNA sequence evolution: transition/transversion rate bias and base/codon frequency bias. We compare the new method with maximum likelihood, as well as several other approximate methods, by examining infinitely long sequences, performing computer simulations, and analyzing a real data set. The results suggest that when there are transition/transversion rate biases and base/codon frequency biases, previously described approximate methods for estimating the nonsynonymous/synonymous rate ratio may involve serious biases, and the bias can be both positive and negative. The new method is, in general, superior to earlier approximate methods and may be useful for analyzing large data sets, although maximum likelihood appears to always be the method of choice.  相似文献   

5.
PAML 4: phylogenetic analysis by maximum likelihood   总被引:42,自引:1,他引:41  
PAML, currently in version 4, is a package of programs for phylogeneticanalyses of DNA and protein sequences using maximum likelihood(ML). The programs may be used to compare and test phylogenetictrees, but their main strengths lie in the rich repertoire ofevolutionary models implemented, which can be used to estimateparameters in models of sequence evolution and to test interestingbiological hypotheses. Uses of the programs include estimationof synonymous and nonsynonymous rates (dN and dS) between twoprotein-coding DNA sequences, inference of positive Darwinianselection through phylogenetic comparison of protein-codinggenes, reconstruction of ancestral genes and proteins for molecularrestoration studies of extinct life forms, combined analysisof heterogeneous data sets from multiple gene loci, and estimationof species divergence times incorporating uncertainties in fossilcalibrations. This note discusses some of the major applicationsof the package, which includes example data sets to demonstratetheir use. The package is written in ANSI C, and runs underWindows, Mac OSX, and UNIX systems. It is available at http://abacus.gene.ucl.ac.uk/software/paml.html.  相似文献   

6.
Empirical models of substitution are often used in protein sequence analysis because the large alphabet of amino acids requires that many parameters be estimated in all but the simplest parametric models. When information about structure is used in the analysis of substitutions in structured RNA, a similar situation occurs. The number of parameters necessary to adequately describe the substitution process increases in order to model the substitution of paired bases. We have developed a method to obtain substitution rate matrices empirically from RNA alignments that include structural information in the form of base pairs. Our data consisted of alignments from the European Ribosomal RNA Database of Bacterial and Eukaryotic Small Subunit and Large Subunit Ribosomal RNA ( Wuyts et al. 2001. Nucleic Acids Res. 29:175-177; Wuyts et al. 2002. Nucleic Acids Res. 30:183-185). Using secondary structural information, we converted each sequence in the alignments into a sequence over a 20-symbol code: one symbol for each of the four individual bases, and one symbol for each of the 16 ordered pairs. Substitutions in the coded sequences are defined in the natural way, as observed changes between two sequences at any particular site. For given ranges (windows) of sequence divergence, we obtained substitution frequency matrices for the coded sequences. Using a technique originally developed for modeling amino acid substitutions ( Veerassamy, Smith, and Tillier. 2003. J. Comput. Biol. 10:997-1010), we were able to estimate the actual evolutionary distance for each window. The actual evolutionary distances were used to derive instantaneous rate matrices, and from these we selected a universal rate matrix. The universal rate matrices were incorporated into the Phylip Software package ( Felsenstein 2002. http://evolution.genetics.washington.edu/phylip.html), and we analyzed the ribosomal RNA alignments using both distance and maximum likelihood methods. The empirical substitution models performed well on simulated data, and produced reasonable evolutionary trees for 16S ribosomal RNA sequences from sequenced Bacterial genomes. Empirical models have the advantage of being easily implemented, and the fact that the code consists of 20 symbols makes the models easily incorporated into existing programs for protein sequence analysis. In addition, the models are useful for simulating the evolution of RNA sequence and structure simultaneously.  相似文献   

7.
Patterns of substitution in chloroplast encoded trnL_F regions were compared between species of Actaea (Ranunculales), Digitalis (Scrophulariales), Drosera (Caryophyllales), Panicoideae (Poales), the small chromosome species clade of Pelargonium (Geraniales), each representing a different order of flowering plants, and Huperzia (Lycopodiales). In total, the study included 265 taxa, each with > 900-bp sequences, totaling 0.24 Mb. Both pairwise and phylogeny-based comparisons were used to assess nucleotide substitution patterns. In all six groups, we found that transition/transversion ratios, as estimated by maximum likelihood on most-parsimonious trees, ranged between 0.8 and 1.0 for ingroups. These values occurred both at low sequence divergences, where substitutional saturation, i.e., multiple substitutions having occurred at the same (homologous) nucleotide position, was not expected, and at higher levels of divergence. This suggests that the angiosperm trnL-F regions evolve in a pattern different from that generally observed for nuclear and animal mtDNA (transitional/transversion ratio > or = 2). Transition/transversion ratios in the intron and the spacer region differed in all alignments compared, yet base compositions between the regions were highly similar in all six groups. A>-C transversions were significantly less frequent than the other four substitution types. This correlates with results from studies on fidelity mechanisms in DNA replication that predict A<->T and G<->C transversions to be least likely to occur. It therefore strengthens confidence in the link between mutation bias at the polymerase level and the actual fixation of substitutions as recorded on evolutionary trees, and concomitantly, in the neutrality of nucleotide substitutions as phylogenetic markers.  相似文献   

8.
MOTIVATION: The two mutation processes that have the largest impact on genome evolution at small scales are substitutions, and sequence insertions and deletions (indels). While the former have been studied extensively, indels have received less attention, and in particular, the problem of inferring indel rates between pairs of divergent sequence remains unsolved. Here, I describe a novel and accurate method for estimating neutral indel rates between divergent pairs of genomes. RESULTS: Simulations suggest that new method for estimating indel rates is accurate to within 2%, at divergences corresponding to that of human and mouse. Applying the method to these species, I show that indel rates are up to twice higher than is apparent from alignments, and depend strongly on the local G + C content. These results indicate that at these evolutionary distances, the contribution of indels to sequence divergence is much larger than hitherto appreciated. In particular, the ratio of substitution to indel rates between human and mouse appears to be around gamma = 8, rather than the currently accepted value of about gamma = 14.  相似文献   

9.
Among the fundamental problems in molecular evolution and in the analysis of homologous sequences are alignment, phylogeny reconstruction, and the reconstruction of ancestral sequences. This paper presents a fast, combined solution to these problems. The new algorithm gives an approximation to the minimal history in terms of a distance function on sequences. The distance function on sequences is a minimal weighted path length constructed from substitutions and insertions-deletions of segments of any length. Substitutions are weighted with an arbitrary metric on the set of nucleotides or amino acids, and indels are weighted with a gap penalty function of the form gk = a + (bxk), where k is the length of the indel and a and b are two positive numbers. A novel feature is the introduction of the concept of sequence graphs and a generalization of the traditional dynamic sequence comparison algorithm to the comparison of sequence graphs. Sequence graphs ease several computational problems. They are used to represent large sets of sequences that can then be compared simultaneously. Furthermore, they allow the handling of multiple, equally good, alignments, where previous methods were forced to make arbitrary choices. A program written in C implemented this method; it was tested first on 22 5S RNA sequences.   相似文献   

10.
Specificity of mutations induced in transfected DNA by mammalian cells   总被引:29,自引:1,他引:28       下载免费PDF全文
DNA transfected into mammalian cells is subject to the high mutation frequency of approximately 1% per gene. We present data bearing on the derivation of the two main classes of mutations detected, base substitutions and deletions. The DNA sequence change is reported for nearly 100 independent base substitution mutations that occurred in shuttle vectors as a result of passage in simian cells. All of the mutations occur at G:C base pairs and involve either transition to A:T or transversion to T:A. To identify possible mutational intermediates, various topological forms of the vector DNA were introduced separately. Supercoiled and relaxed DNA are mutated at equal frequencies. However, linearized DNA leads to a greatly elevated frequency of deletions. Nicked and gapped templates stimulate both deletions and base substitutions. We discuss a model involving intracellular degradation of the transfected DNA which explains these observations.  相似文献   

11.
Genes that have experienced accelerated evolutionary rates on the human lineage during recent evolution are candidates for involvement in human-specific adaptations. To determine the forces that cause increased evolutionary rates in certain genes, we analyzed alignments of 10,238 human genes to their orthologues in chimpanzee and macaque. Using a likelihood ratio test, we identified protein-coding sequences with an accelerated rate of base substitutions along the human lineage. Exons evolving at a fast rate in humans have a significant tendency to contain clusters of AT-to-GC (weak-to-strong) biased substitutions. This pattern is also observed in noncoding sequence flanking rapidly evolving exons. Accelerated exons occur in regions with elevated male recombination rates and exhibit an excess of nonsynonymous substitutions relative to the genomic average. We next analyzed genes with significantly elevated ratios of nonsynonymous to synonymous rates of base substitution (dN/dS) along the human lineage, and those with an excess of amino acid replacement substitutions relative to human polymorphism. These genes also show evidence of clusters of weak-to-strong biased substitutions. These findings indicate that a recombination-associated process, such as biased gene conversion (BGC), is driving fixation of GC alleles in the human genome. This process can lead to accelerated evolution in coding sequences and excess amino acid replacement substitutions, thereby generating significant results for tests of positive selection.  相似文献   

12.
Zhang Z  Wang Y  Wang L  Gao P 《PloS one》2010,5(12):e14316

Background

In the process of protein evolution, sequence variations within protein families can cause changes in protein structures and functions. However, structures tend to be more conserved than sequences and functions. This leads to an intriguing question: what is the evolutionary mechanism by which sequence variations produce structural changes? To investigate this question, we focused on the most common types of sequence variations: amino acid substitutions and insertions/deletions (indels). Here their combined effects on protein structure evolution within protein families are studied.

Results

Sequence-structure correlation analysis on 75 homologous structure families (from SCOP) that contain 20 or more non-redundant structures shows that in most of these families there is, statistically, a bilinear correlation between the amount of substitutions and indels versus the degree of structure variations. Bilinear regression of percent sequence non-identity (PNI) and standardized number of gaps (SNG) versus RMSD was performed. The coefficients from the regression analysis could be used to estimate the structure changes caused by each unit of substitution (structural substitution sensitivity, SSS) and by each unit of indel (structural indel sensitivity, SIDS). An analysis on 52 families with high bilinear fitting multiple correlation coefficients and statistically significant regression coefficients showed that SSS is mainly constrained by disulfide bonds, which almost have no effects on SIDS.

Conclusions

Structural changes in homologous protein families could be rationally explained by a bilinear model combining amino acid substitutions and indels. These results may further improve our understanding of the evolutionary mechanisms of protein structures.  相似文献   

13.
14.
It has long been known, from the distribution of multiple amino acid replacements, that not all amino acids of a sequence are replaceable. More recently, the phenomenon was observed at the nucleotide level in mitochondrial DNA even after allowing for different rates of transition and transversion substitutions. We have extended the search to globin gene sequences from various organisms, with the following results: (1) Nearly every data set showed evidence of invariable nucleotide positions. (2) In all data sets, substitution rates of transversions and transitions were never in the ratio of 2/1, and rarely was the ratio even constant. (3) Only rarely (e.g., the third codon position of beta hemoglobins) was it possible to fit the data set solely by making allowance for the number of invariable positions and for the relative rates of transversion and transition substitutions. (4) For one data set (the second codon position of beta hemoglobins) we were able to simulate the observed data by making the allowance in (3) and having the set of covariotides (concomitantly variable nucleotides) be small in number and be turned over in a stochastic manner with a probability that was appreciable. (5) The fit in the latter case suggests, if the assumptions are correct and at all common, that current procedures for estimating the total number of nucleotide substitutions in two genes since their divergence from their common ancestor could be low by as much as an order of magnitude. (6) The fact that only a small fraction of the nucleotide positions differ is no guarantee that one is not seriously underestimating the total amount of divergence (substitutions). (7) Most data sets are so heterogeneous in their number of transition and transversion differences that none of the current models of nucleotide substitution seem to fit them even after (a) segregation of coding from noncoding sequences and (b) splitting of the codon into three subsets by codon position. (8) These frequently occurring problems cannot be seen unless several reasonably divergent orthologous genes are examined together.   相似文献   

15.
Microstructural changes such as insertions and deletions (=indels) are a major driving force in the evolution of non-coding DNA sequences. To better understand the mechanisms by which indel mutations arise, as well as the molecular evolution of non-coding regions, the number and pattern of indels and nucleotide substitutions were compared in the whole chloroplast genomes. Comparisons were made for a total of over 38 kb non-coding DNA sequences from 126 intergenic regions in two data sets representing species with different divergence times: sugarcane and maize and Oryza sativa var. indica and japonica. The main findings of this study are: (i) Approximately half of all indels are single nucleotide indels. This observation agrees with previous studies in various organisms. (ii) The distribution and number of indels was different between two data sets, and different patterns were observed for tandem repeat and non-repeat indels. (iii) Distribution pattern of tandem repeat indels showed statistically significant bias towards A/T-rich. (iv) The rate of indel mutation was estimated to be approximately 0.8 +/- 0.04 x 10(-9) per site per year, which was similar to previous estimates in other organisms. (v) The frequencies of nucleotide substitutions and indels were significantly lower in inverted repeat (IR).  相似文献   

16.
To test whether gaps resulting from sequence alignment contain phylogenetic signal concordant with those of base substitutions, we analyzed the occurrence of indel mutations upon a well-resolved, substitution-based tree for three nuclear genes in bumble bees (Bombus, Apidae: Bombini). The regions analyzed were exon and intron sequences of long-wavelength rhodopsin (LW Rh), arginine kinase (ArgK), and elongation factor-1alpha (EF-1alpha) F2 copy genes. LW Rh intron had only a few uninformative gaps, ArgK intron had relatively long gaps that were easily aligned, and EF-1alpha intron had many short gaps, resulting in multiple optimal alignments. The unambiguously aligned gaps within ArgK intron sequences showed no homoplasy upon the substitution-based tree, and phylogenetic signals within ambiguously aligned regions of EF-1alpha intron were highly congruent with those of base substitutions. We further analyzed the contribution of gap characters to phylogenetic reconstruction by incorporating them in parsimony analysis. Inclusion of gap characters consistently improved support for nodes recovered by substitutions, and inclusion of ambiguously aligned regions of EF-1alpha intron resolved several additional nodes, most of which were apical on the phylogeny. We conclude that gaps are an exceptionally reliable source of phylogenetic information that can be used to corroborate and refine phylogenies hypothesized by base substitutions, at least at lower taxonomic levels. At present, full use of gaps in phylogenetic reconstruction is best achieved in parsimony analysis, pending development of well-justified and generally applicable methods for incorporating indels in explicitly model-based methods.  相似文献   

17.
Insertions and deletions of lengths not divisible by 3 in protein-coding sequences cause frameshifts that usually induce premature stop codons and may carry a high fitness cost. However, this cost can be partially offset by a second compensatory indel restoring the reading frame. The role of such pairs of compensatory frameshifting mutations (pCFMs) in evolution has not been studied systematically. Here, we use whole-genome alignments of protein-coding genes of 100 vertebrate species, and of 122 insect species, studying the prevalence of pCFMs in their divergence. We detect a total of 624 candidate pCFM genes; six of them pass stringent quality filtering, including three human genes: RAB36, ARHGAP6, and NCR3LG1. In some instances, amino acid substitutions closely predating or following pCFMs restored the biochemical similarity of the frameshifted segment to the ancestral amino acid sequence, possibly reducing or negating the fitness cost of the pCFM. Typically, however, the biochemical similarity of the frameshifted sequence to the ancestral one was not higher than the similarity of a random sequence of a protein-coding gene to its frameshifted version, indicating that pCFMs can uncover radically novel regions of protein space. In total, pCFMs represent an appreciable and previously overlooked source of novel variation in amino acid sequences.  相似文献   

18.
12S ribosomal RNA (rRNA) gene sequences from a suite of mammalian taxa (13 placentals, 4 marsupials, 1 monotreme), for which phylogenetic relationships are well established based on independent criteria, were employed to study the evolution of this gene. Phylogenetic analysis of 12S sequences produces a phylogeny that agrees with expectations. Base composition provides evidence for directional symmetrical substitution pressure in loops; in stems, base composition is much more even. Rates of nucleotide substitution are lower in stems than loops. Patterns of nucleotide substitution show an overall preference for transitions over transversions, with this difference more profound in stems than loops. Among different transversion pathways, there is a wide range of transformation frequencies. An analysis of compensatory substitutions shows that there is strong evidence for their occurrence and that a weighting factor of 0.61 should be applied in phylogenetic analyses to account for the dependence of mutations at stem positions relative to positions where changes are independent. Among stem variables (i.e., stem length, interaction distance, substitution rates, G+C content, and the percentage of bases that are paired), several significant correlations were discovered, but stem length and interaction distance are uncorrelated with other variables.   相似文献   

19.
The pattern of amino acid substitutions and sequence conservation over many structure-based alignments of protein sequences was analyzed as a function of percentage sequence identity. The statistics of the amino acid substitutions were converted into the form of log-odds amino acid substitution matrices to which eigenvalue decomposition was applied. It was found that the most important component of the substitution matrices exhibited a sharp transition at the sequence identity of 30-35%, which coincides with the twilight zone. Above the transition point, the most dominant component is related to the mutability of amino acids and it acts to disfavor any substitutions, whereas below the transition point, the most dominant component is related to the hydrophobicity of amino acids and substitutions between residues of similar hydrophobic character are positively favored. Implications for protein evolution and sequence analysis are discussed.  相似文献   

20.
We study to what degree patterns of amino acid substitution vary between genes using two models of protein-coding gene evolution. The first divides the amino acids into groups, with one substitution rate for pairs of residues in the same group and a second for those in differing groups. Unlike previous applications of this model, the groups themselves are estimated from data by simulated annealing. The second model makes substitution rates a function of the physical and chemical similarity between two residues. Because we model the evolution of coding DNA sequences as opposed to protein sequences, artifacts arising from the differing numbers of nucleotide substitutions required to bring about various amino acid substitutions are avoided. Using 10 alignments of related sequences (five of orthologous genes and five gene families), we do find differences in substitution patterns. We also find that, although patterns of amino acid substitution vary temporally within the history of a gene, variation is not greater in paralogous than in orthologous genes. Improved understanding of such gene-specific variation in substitution patterns may have implications for applications such as sequence alignment and phylogenetic inference.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号