首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
A low rate of simultaneous double-nucleotide mutations in primates   总被引:1,自引:0,他引:1  
The occurrence of double-nucleotide (doublet) mutations is contrary to the normal assumption that point mutations affect single nucleotides. Here we develop a new method for estimating the doublet mutation rate and apply it to more than a megabase of human-chimpanzee-baboon genomic DNA alignments and more than a million human single-nucleotide polymorphisms. The new method accounts for the effect of regional variation in evolutionary rates, which may be a confounding factor in previous estimates of the doublet mutation rate. Furthermore we determine sequence context effects by using sequence comparisons over a variety of lineage lengths. This approach yields a new estimate of the doublet mutation rate of 0.3% of the singleton rate, indicating that doublet mutations are far rarer than previously thought. Our results suggest that doublet mutations are unlikely to have caused the correlation between synonymous and nonsynonymous substitution rates in mammals, and also show that regional variation and sequence context effects play an important role in primate DNA sequence evolution.  相似文献   

2.
A Markov analysis of DNA sequences   总被引:12,自引:0,他引:12  
We present a model by which we look at the DNA sequence as a Markov process. It has been suggested by several workers that some basic biological or chemical features of nucleic acids stand behind the frequencies of dinucleotides (doublets) in these chains. Comparing patterns of doublet frequencies in DNA of different organisms was shown to be a fruitful approach to some phylogenetic questions (Russel & Subak-Sharpe, 1977). Grantham (1978) formulated mRNA sequence indices, some of which involve certain doublet frequencies. He suggested that using these indices may provide indications of the molecular constraints existing during gene evolution. Nussinov (1981) has shown that a set of dinucleotide preference rules holds consistently for eukaryotes, and suggested a strong correlation between these rules and degenerate codon usage. Gruenbaum, Cedar & Razin (1982) found that methylation in eukaryotic DNA occurs exclusively at C-G sites. Important biological information thus seems to be contained in the doublet frequencies. One of the basic questions to be asked (the "correlation question") is to what extent are the 64 trinucleotide (triplet) frequencies measured in a sequence determined by the 16 doublet frequencies in the same sequence. The DNA is described here as a Markov process, with the nucleotides being outcomes of a sequence generator. Answering the correlation question mentioned above means finding the order of the Markov process. The difficulty is that natural sequences are of finite length, and statistical noise is quite strong. We show that even for a 16000 nucleotide long sequence (like that of the human mitochondrial genome) the finite length effect cannot be neglected. Using the Markov chain model, the correlation between doublet and triplet frequencies can, however, be determined even for finite sequences, taking proper account of the finite length. Two natural DNA sequences, the human mitochondrial genome and the SV40 DNA, are analysed as examples of the method.  相似文献   

3.

Background  

Existing tools for multiple-sequence alignment focus on aligning protein sequence or protein-coding DNA sequence, and are often based on extensions to Needleman-Wunsch-like pairwise alignment methods. We introduce a new tool, Sigma, with a new algorithm and scoring scheme designed specifically for non-coding DNA sequence. This problem acquires importance with the increasing number of published sequences of closely-related species. In particular, studies of gene regulation seek to take advantage of comparative genomics, and recent algorithms for finding regulatory sites in phylogenetically-related intergenic sequence require alignment as a preprocessing step. Much can also be learned about evolution from intergenic DNA, which tends to evolve faster than coding DNA. Sigma uses a strategy of seeking the best possible gapless local alignments (a strategy earlier used by DiAlign), at each step making the best possible alignment consistent with existing alignments, and scores the significance of the alignment based on the lengths of the aligned fragments and a background model which may be supplied or estimated from an auxiliary file of intergenic DNA.  相似文献   

4.

Background  

While most multiple sequence alignment programs expect that all or most of their input is known to be homologous, and penalise insertions and deletions, this is not a reasonable assumption for non-coding DNA, which is much less strongly conserved than protein-coding genes. Arguing that the goal of sequence alignment should be the detection of homology and not similarity, we incorporate an evolutionary model into a previously published multiple sequence alignment program for non-coding DNA, Sigma, as a sensitive likelihood-based way to assess the significance of alignments. Version 1 of Sigma was successful in eliminating spurious alignments but exhibited relatively poor sensitivity on synthetic data. Sigma 1 used a p-value (the probability under the "null hypothesis" of non-homology) to assess the significance of alignments, and, optionally, a background model that captured short-range genomic correlations. Sigma version 2, described here, retains these features, but calculates the p-value using a sophisticated evolutionary model that we describe here, and also allows for a transition matrix for different substitution rates from and to different nucleotides. Our evolutionary model takes separate account of mutation and fixation, and can be extended to allow for locally differing functional constraints on sequence.  相似文献   

5.
Constructing multiple homologous alignments for protein-coding DNA sequences is crucial for a variety of bioinformatic analyses but remains computationally challenging. With the growing amount of sequence data available and the ongoing efforts largely dependent on protein-coding DNA alignments, there is an increasing demand for a tool that can process a large number of homologous groups and generate multiple protein-coding DNA alignments. Here we present a parallel tool - ParaAT that is capable of parallelly constructing multiple protein-coding DNA alignments for a large number of homologs. As testified on empirical datasets, ParaAT is well suited for large-scale data analysis in the high-throughput era, providing good scalability and exhibiting high parallel efficiency for computationally demanding tasks. ParaAT is freely available for academic use only at http://cbb.big.ac.cn/software.  相似文献   

6.
7.
8.
Warden CD  Kim SH  Yi SV 《PloS one》2008,3(2):e1559
Functional RNAs (fRNAs) are being recognized as an important regulatory component in biological processes. Interestingly, recent computational studies suggest that the number and biological significance of functional RNAs within coding regions (coding fRNAs) may have been underestimated. We hypothesized that such coding fRNAs will impose additional constraint on sequence evolution because the DNA primary sequence has to simultaneously code for functional RNA secondary structures on the messenger RNA in addition to the amino acid codons for the protein sequence. To test this prediction, we first utilized computational methods to predict conserved fRNA secondary structures within multiple species alignments of Saccharomyces sensu strico genomes. We predict that as much as 5% of the genes in the yeast genome contain at least one functional RNA secondary structure within their protein-coding region. We then analyzed the impact of coding fRNAs on the evolutionary rate of protein-coding genes because a decrease in evolutionary rate implies constraint due to biological functionality. We found that our predicted coding fRNAs have a significant influence on evolutionary rates (especially at synonymous sites), independent of other functional measures. Thus, coding fRNA may play a role on sequence evolution. Given that coding regions of humans and flies contain many more predicted coding fRNAs than yeast, the impact of coding fRNAs on sequence evolution may be substantial in genomes of higher eukaryotes.  相似文献   

9.
10.
In this paper we demonstrate a practical approach to construct progressive multiple alignments using sequence triplet optimizations rather than a conventional pairwise approach. Using the sequence triplet alignments progressively provides a scope for the synthesis of a three-residue exchange amino acid substitution matrix. We develop such a 20 x 20 x 20 matrix for the first time and demonstrate how its use in optimal sequence triplet alignments increases the sensitivity of building multiple alignments. Various comparisons were made between alignments generated using the progressive triplet methods and the conventional progressive pairwise procedure. The assessment of these data reveal that, in general, the triplet based approaches generate more accurate sequence alignments than the traditional pairwise based procedures, especially between more divergent sets of sequences.  相似文献   

11.
The doublet or nearest-neighbour ratios of the nucleotides in various computer-generated sequences of DNA have been counted to find out which sequences would have the same ratios as those measured for guinea-pig DNA by Russell et al. (1976). Their data shows that the ratio patterns for all nuclear DNA fractions except satellite, ribosomal and tRNA coding DNA are similar irrespective of G+C content and are characterised by the amount of the doublet CpG being less than 30% of that expected on a random basis. To construct and analyse such theoretical sequences, methods have been developed which allow to be counted the doublet frequencies that random DNA and the DNA expected to code for any amino acid (AA) sequence would have were they analysed experimentally. These methods permit the C+G content to be altered and the frequency of the doublet CpG to be lowered without affecting the information stored in the DNA. The former is achieved by selecting codons for a given AA that are high or low in G and C while the latter requires selecting against triplets that either contain CpG or will cause a CpG to occur between codons.The results show that no DNA sequence that we have been able to construct using the unrestricted genetic code has doublet ratios similar to those observed. However, the DNA expected to code for a group of 27 vertebrate proteins (5237 AAs) of diverse functions has doublet ratios virtually identical to those measured experimentally for the 47% G+C fraction, provided that 77·5 % of the CpG is eliminated. The data for the 34–43% G+C fractions are matched well by the protein sequence provided that codons low in G and C are selected and, again, that 77·5% of CpG is eliminated. We have been unable to match the data for the satellite DNA. A perhaps surprising result was that DNA with a random sequence of nucleotides but subjected to the removal of 80% of CpG had doublet ratios that were similar to the experimental data but matched them less well than the doublets of protein-coding DNA. This result probably does no more than emphasise that a significant part of the match of the pattern of doublet ratios of guinea-pig DNA derives from the elimination of CpG.The effect of evolution (random mutation) on the doublet ratios of protein-coding DNA has been investigated by assuming a mutation rate of 3·10?9 per base per generation (the mutation rate of haemoglobin) and seeing how a great many generations of such mutation affect the doublet ratios. The results show that it will take ~1·5·106 generations for the number of termination codons to double and ~2·107 generations for the doublet ratios to become indistinguishable from those of random DNA. This seems to imply either that selection acts over the whole of the DNA or that the mutation rate of haemoglobin DNA is unusually high. The results, as a whole, support the view that, whether or not all non-satellite DNA actually codes for protein, its sequences are similar to those that would code for proteins.  相似文献   

12.
Profile search methods based on protein domain alignments have proven to be useful tools in comparative sequence analysis. Domain alignments used by currently available search methods have been computed by sequence comparison. With the growth of the protein structure database, however, alignments of many domain pairs have also been computed by structure comparison. Here, we examine the extent to which information from these two sources agrees. We measure agreement with respect to identification of homologous regions in each protein, that is, with respect to the location of domain boundaries. We also measure agreement with respect to identification of homologous residue sites by comparing alignments and assessing the accuracy of the molecular models they predict. We find that domain alignments in publicly available collections based on sequence and structure comparison are largely consistent. However, the homologous regions identified by sequence comparison are often shorter than those identified by 3D structure comparison. In addition, when overall sequence similarity is low alignments from sequence comparison produce less accurate molecular models, suggesting that they less accurately identify homologous sites. These observations suggest that structure comparison results might be used to improve the overall accuracy of domain alignment collections and the performance of profile search methods based on them.  相似文献   

13.
Insertions and deletions of lengths not divisible by 3 in protein-coding sequences cause frameshifts that usually induce premature stop codons and may carry a high fitness cost. However, this cost can be partially offset by a second compensatory indel restoring the reading frame. The role of such pairs of compensatory frameshifting mutations (pCFMs) in evolution has not been studied systematically. Here, we use whole-genome alignments of protein-coding genes of 100 vertebrate species, and of 122 insect species, studying the prevalence of pCFMs in their divergence. We detect a total of 624 candidate pCFM genes; six of them pass stringent quality filtering, including three human genes: RAB36, ARHGAP6, and NCR3LG1. In some instances, amino acid substitutions closely predating or following pCFMs restored the biochemical similarity of the frameshifted segment to the ancestral amino acid sequence, possibly reducing or negating the fitness cost of the pCFM. Typically, however, the biochemical similarity of the frameshifted sequence to the ancestral one was not higher than the similarity of a random sequence of a protein-coding gene to its frameshifted version, indicating that pCFMs can uncover radically novel regions of protein space. In total, pCFMs represent an appreciable and previously overlooked source of novel variation in amino acid sequences.  相似文献   

14.
The evolution of homologous sequences affected by recombination or gene conversion cannot be adequately explained by a single phylogenetic tree. Many tree-based methods for sequence analysis, for example, those used for detecting sites evolving nonneutrally, have been shown to fail if such phylogenetic incongruity is ignored. However, it may be possible to propose several phylogenies that can correctly model the evolution of nonrecombinant fragments. We propose a model-based framework that uses a genetic algorithm to search a multiple-sequence alignment for putative recombination break points, quantifies the level of support for their locations, and identifies sequences or clades involved in putative recombination events. The software implementation can be run quickly and efficiently in a distributed computing environment, and various components of the methods can be chosen for computational expediency or statistical rigor. We evaluate the performance of the new method on simulated alignments and on an array of published benchmark data sets. Finally, we demonstrate that prescreening alignments with our method allows one to analyze recombinant sequences for positive selection.  相似文献   

15.
Garel T  Orland H 《Biopolymers》2004,75(6):453-467
The Poland-Scheraga (PS) model for the helix-coil transition of DNA considers the statistical mechanics of the binding (or hybridization) of two complementary strands of DNA of equal length, with the restriction that only bases with the same index along the strands are allowed to bind. In this article, we extend this model by relaxing these constraints: We propose a generalization of the PS model that allows for the binding of two strands of unequal lengths N1 and N2 with unrelated sequences. We study in particular (i) the effect of mismatches on the hybridization of complementary strands, (ii) the hybridization of noncomplementary strands (as resulting from point mutations) of unequal lengths N1 and N2. The use of a Fixman-Freire scheme scales down the computational complexity of our algorithm from O(N1(2)N2(2) to O(N1N2). The simulation of complementary strands of a few kilo base pairs yields results almost identical to the PS model. For short strands of equal or unequal lengths, the binding displays a strong sensitivity to mutations. This model may be relevant to the experimental protocol in DNA microarrays, and more generally to the molecular recognition of DNA fragments. It also provides a physical implementation of sequence alignments.  相似文献   

16.
We conducted a genome-wide analysis of variations in guanine plus cytosine (G+C) content at the third codon position at silent substitution sites of orthologous human and mouse protein-coding nucleotide sequences. Alignments of 3776 human protein-coding DNA sequences with mouse orthologs having >50 synonymous codons were analyzed, and nucleotide substitutions were counted by comparing sequences in the alignments extracted from gap-free regions. The G+C content at silent sites in these pairs of genes showed a strong negative correlation (r = -0.93). Some gene pairs showed significant differences in G+C content at the third codon position at silent substitution sites. For example, human thymine-DNA glycosylase was A+T-rich at the silent substitution sites, while the orthologous mouse sequence was G+C-rich at the corresponding sites. In contrast, human matrix metalloproteinase 23B was G+C-rich at silent substitution sites, while the mouse ortholog was A+T-rich. We discuss possible implications of this significant negative correlation of G+C content at silent sites.  相似文献   

17.
Approximately 5% of the human genome consists of segmental duplications that can cause genomic mutations and may play a role in gene innovation. Reticulate evolutionary processes, such as unequal crossing-over and gene conversion, are known to occur within specific duplicon families, but the broader contribution of these processes to the evolution of human duplications remains poorly characterized. Here, we use phylogenetic profiling to analyze multiple alignments of 24 human duplicon families that span >8 Mb of DNA. Our results indicate that none of them are evolving independently, with all alignments showing sharp discontinuities in phylogenetic signal consistent with reticulation. To analyze these results in more detail, we have developed a quartet method that estimates the relative contribution of nucleotide substitution and reticulate processes to sequence evolution. Our data indicate that most of the duplications show a highly significant excess of sites consistent with reticulate evolution, compared with the number expected by nucleotide substitution alone, with 15 of 30 alignments showing a >20-fold excess over that expected. Using permutation tests, we also show that at least 5% of the total sequence shares 100% sequence identity because of reticulation, a figure that includes 74 independent tracts of perfect identity >2 kb in length. Furthermore, analysis of a subset of alignments indicates that the density of reticulation events is as high as 1 every 4 kb. These results indicate that phylogenetic relationships within recently duplicated human DNA can be rapidly disrupted by reticulate evolution. This finding has important implications for efforts to finish the human genome sequence, complicates comparative sequence analysis of duplicon families, and could profoundly influence the tempo of gene-family evolution.  相似文献   

18.
Strong purifying selection in transmission of mammalian mitochondrial DNA   总被引:5,自引:3,他引:2  
There is an intense debate concerning whether selection or demographics has been most important in shaping the sequence variation observed in modern human mitochondrial DNA (mtDNA). Purifying selection is thought to be important in shaping mtDNA sequence evolution, but the strength of this selection has been debated, mainly due to the threshold effect of pathogenic mtDNA mutations and an observed excess of new mtDNA mutations in human population data. We experimentally addressed this issue by studying the maternal transmission of random mtDNA mutations in mtDNA mutator mice expressing a proofreading-deficient mitochondrial DNA polymerase. We report a rapid and strong elimination of nonsynonymous changes in protein-coding genes; the hallmark of purifying selection. There are striking similarities between the mutational patterns in our experimental mouse system and human mtDNA polymorphisms. These data show strong purifying selection against mutations within mtDNA protein-coding genes. To our knowledge, our study presents the first direct experimental observations of the fate of random mtDNA mutations in the mammalian germ line and demonstrates the importance of purifying selection in shaping mitochondrial sequence diversity.  相似文献   

19.
Evolutionary studies commonly model single nucleotide substitutions and assume that they occur as independent draws from a unique probability distribution across the sequence studied. This assumption is violated for protein-coding sequences, and we consider modeling approaches where codon positions (CPs) are treated as separate categories of sites because within each category the assumption is more reasonable. Such "codon-position" models have been shown to explain the evolution of codon data better than homogenous models in previous studies. This paper examines the ways in which codon-position models outperform homogeneous models and characterizes the differences in estimates of model parameters across CPs. Using the PANDIT database of multiple species DNA sequence alignments, we quantify the differences in the evolutionary processes at the 3 CPs in a systematic and comprehensive manner, characterizing previously undescribed features of protein evolution. We relate our findings to the functional constraints imposed by the genetic code, protein function, and the types of mutation that cause synonymous and nonsynonymous codon changes. The results increase our understanding of selective constraints and could be incorporated into phylogenetic analyses or gene-finding techniques in the future. The methods used are extended to an overlapping reading frame data set, and we discover that overlapping reading frames do not necessarily cause more stringent evolutionary constraints.  相似文献   

20.
This article presents a statistical method for detecting recombination in DNA sequence alignments, which is based on combining two probabilistic graphical models: (1) a taxon graph (phylogenetic tree) representing the relationship between the taxa, and (2) a site graph (hidden Markov model) representing interactions between different sites in the DNA sequence alignments. We adopt a Bayesian approach and sample the parameters of the model from the posterior distribution with Markov chain Monte Carlo, using a Metropolis-Hastings and Gibbs-within-Gibbs scheme. The proposed method is tested on various synthetic and real-world DNA sequence alignments, and we compare its performance with the established detection methods RECPARS, PLATO, and TOPAL, as well as with two alternative parameter estimation schemes.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号