首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
We develop a reversible jump Markov chain Monte Carlo approach to estimating the posterior distribution of phylogenies based on aligned DNA/RNA sequences under several hierarchical evolutionary models. Using a proper, yet nontruncated and uninformative prior, we demonstrate the advantages of the Bayesian approach to hypothesis testing and estimation in phylogenetics by comparing different models for the infinitesimal rates of change among nucleotides, for the number of rate classes, and for the relationships among branch lengths. We compare the relative probabilities of these models and the appropriateness of a molecular clock using Bayes factors. Our most general model, first proposed by Tamura and Nei, parameterizes the infinitesimal change probabilities among nucleotides (A, G, C, T/U) into six parameters, consisting of three parameters for the nucleotide stationary distribution, two rate parameters for nucleotide transitions, and another parameter for nucleotide transversions. Nested models include the Hasegawa, Kishino, and Yano model with equal transition rates and the Kimura model with a uniform stationary distribution and equal transition rates. To illustrate our methods, we examine simulated data, 16S rRNA sequences from 15 contemporary eubacteria, halobacteria, eocytes, and eukaryotes, 9 primates, and the entire HIV genome of 11 isolates. We find that the Kimura model is too restrictive, that the Hasegawa, Kishino, and Yano model can be rejected for some data sets, that there is evidence for more than one rate class and a molecular clock among similar taxa, and that a molecular clock can be rejected for more distantly related taxa.  相似文献   

2.
It has been suggested that isochores are maintained by mutation biases, and that this leads to variation in the rate of mutation across the genome. A model of DNA replication is presented in which the probabilities of misincorporation and proofreading are affected by the composition and concentration of the free nucleotide pools. The relationship between sequence G+C content and the mutation rate is investigated. It is found that there is very little variation in the mutation rate between sequences of different G+C contents if the total concentration of the free nucleotides remains constant. However, variation in the mutation rate can be arbitrarily large if some mismatches are proofread and the total concentration of free nucleotides varies. Hence the model suggests that the maintenance of isochores by the replication of DNA in free nucleotide pools of biased composition does not lead per se to mutation rate variance. However, it is possible that changes in composition could be accompanied by changes in concentration, thus generating mutation rate variance. Furthermore, there is the possibility that germ-line selection could lead to alterations in the overall free nucleotide concentration through the cell cycle. These findings are discussed with reference to the variance in mammalian silent substitution rates.  相似文献   

3.
A statistical analysis of occurrence of particular nucleotide runs (1 divided by 10 nucleotides long) in DNA sequences of different species has been carried out. There are considerable differences in run distributions in DNA sequences of prokaryotes, invertebrates and vertebrates. Distribution of various types of runs has been found to be different in coding and non-coding sequences. There is an abundance of short runs 1 divided by 2 nucleotides long in coding sequences, and there is a deficiency of such runs in the non-coding regions. However, some interesting exceptions from this rule exist: for run distribution of adenine in prokaryotes and for distribution of purine-pyrimidine runs in eukaryotes. This may be stipulated by the fact that the distribution of runs are predetermined by structural peculiarities of the entire DNA molecule. Runs of guanine or cytosine of three to six nucleotides long occur predominantly in the non-coding DNA regions in eukaryotes, especially in vertebrates.  相似文献   

4.
A statistical analysis of the occurrence of particular nucleotide runs in DNA sequences of different species has been carried out. There are considerable differences of run distributions in DNA sequences of procaryotes, invertebrates and vertebrates. There is an abundance of short runs (1-2 nucleotides long) in the coding sequences and there is a deficiency of such runs in the noncoding regions. However, some interesting exceptions from this rule exist for the run distribution of adenine in procaryotes and for the arrangement of purine-pyrimidine runs in eucaryotes. The similarity in the distributions of such runs in the coding and noncoding regions may be due to some structural features of the DNA molecule as a whole. Runs of guanine (or cytosine) of three to six nucleotides occur predominantly in noncoding DNA regions in eucaryotes, especially in vertebrates.  相似文献   

5.
A 320 nucleotide repeated DNA sequence within the copia coding element of Drosophila melanogaster has been identified and characterized. This sequence has been localized by DNA-DNA hybridization and electron microscopic analysis of heteroduplexes to the approximate middle of the 5 kb copia coding region. The primary sequence of this repeated DNA has been determined. The sequence is composed of three related subunits, 35-37 nucleotides in length (A, B and C). This 105 nucleotide higher order repeat has apparently been duplicated twice to yield a complex repeated sequence, ABCA'B'C'A"B"C", which exhibits divergence among the individual subunits. This sequence is AT rich, as are the direct terminal repeats which flank the copia coding region, but does not contain any apparent homology with the terminal repeats. This repeated sequence contains three presumptive polyadenylation signals and two 25 nucleotide, imperfectly matched, inverted repeat sequences adjacent to two of the polyadenylation sequences.  相似文献   

6.
The nucleotide sequences of cDNAs for the evolutionarily diverged but highly conserved basal H2A isoprotein, H2A.Z, have been determined for the rat, cow, and human. As a basal histone, H2A.Z is synthesized throughout the cell cycle at a constant rate, unlinked to DNA replication, and at a much lower rate in quiescent cells. Each of the cDNA isolates encodes the entire H2A.Z polypeptide. The human isolate is about 1.0 kilobases long. It contains a coding region of 387 nucleotides flanked by 106 nucleotides of 5'UTR and 376 nucleotides of 3'UTR, which contains a polyadenylation signal followed by a poly A tail. The bovine and rat cDNAs have 97 and 94% nucleotide positional identity to the human cDNA in the coding region and 98% in the proximal 376 nucleotides of the 3'UTR which includes the polyadenylation signal. A potential stem-forming sequence imbedded in a direct repeat is found centered at 261 nucleotides into the 3'UTR. Each of the cDNA clones could be transcribed and translated in vitro to yield H2A.Z protein. The mammalian H2A.Z cDNA coding sequences are approximately 80% similar to those in chicken and 75% to those in sea urchin.  相似文献   

7.
An earlier report (Subramanian, Dhar, and Weissman, 1977c) presented the nucleotide sequence of Eco RII-G fragment of SV40 DNA, which contains the origin of DNA replication. The nucleotide sequence of Eco RII-N fragment located next to Eco RII-G on the physical map of SV40 DNA is presented in this report. Eco RII-N is found to be a tandem duplication of the last 55 nucleotides of Eco RII-G. This tandem repeat is immediately preceded by two other reiterated sequences occurring within Eco RII-G, one of them being a tandem repeat of 21 nucleotides and the other a nontandem repeat of 10 nucleotides. These repetitive sequences occur in close proximity to the origin of DNA replication which is known to contain other specialized sequences such as a few palindromes (one of which is 27 long and possesses a perfect 2-fold axis of symmetry), one "true" palindrome, and a long A/T-rich cluster. The repeats (and the replication origin) occur within an untranslated region of SV40 DNA flanked by (the few) structural genes coding for the "late" proteins on the one side and that (those) coding for the "early" protein(s) on the other side. The reiterated sequences are comparable in some respects to repetitive sequences occurring in eucaryotic DNAs. Possible biological functions of the repeats are discussed.  相似文献   

8.
The Eco RI fragment “b” of chicken DNA (Breathnach, Mandel and Chambon, 1977), which contains the sequences coding for the 5′ quarter of ovalbumin mRNA (ov mRNA), has been isolated by molecular cloning using a “shotgun” approach. Electron microscopy and restriction enzyme analysis have revealed that the sequences coding for the 5′ quarter (~500 nucleotides) of ov mRNA are split into four regions separated by three intervening sequences. The cloning procedure seems to be reliable, since the restriction enzyme pattern of the cloned Eco RI fragment “b” is similar to that of the corresponding chromosomal DNA fragment. There is no evidence supporting the existence of a 150–200 nucleotide long sequence at the 5′ end of the ov mRNA similar to the “leader” sequences found at the 5′ end of some adenovirus and SV40 mRNAs.  相似文献   

9.
In analyzing the silent nucleotide substitutions in some mammalian mitochondrial mRNA coding genes, we had found that the frequency of each of the four nucleotides in rat, mouse, and cow, but not in humans, is the same in the silent third codon position (Lanave C, Preparata G, Saccone C, Serio G (1984) J Mol Evol 20:86-93). Because our findings for these three species were compatible with a stationary Markov process for the evolution of nucleotide sequences, we applied such a model to calculate the effective evolutionary silent substitution rate (vs) and the divergence times among the species. In this paper we have analyzed the first and second codon positions in the same mammalian mitochondrial genes. We found that in the first and second codon positions the human mitochondrial genes satisfy the stationarity conditions. This has allowed us to use the stochastic model mentioned above to calculate the divergence times among mouse, rat, cow, and human. Furthermore, we have analyzed the silent substitution rate in one nuclear gene for these four mammals. We found that in this gene the effective silent substitution rate is about 3 times lower than in mitochondrial genes, and that humans are in this case stationary with respect to the other three mammals in the third codon position as well. Application of our Markov model to this latter gene yields divergence times consistent with our previous determinations.  相似文献   

10.
Recombinant DNA plasmids containing sequences coding for the alpha subunit of the bovine pituitary glycoprotein hormones have been isolated. The nucleotide sequences of three different cDNA clones have been determined. The largest alpha-subunit cDNA clone was found to contain 713 bases including 77 nucleotides from the 5'-untranslated region, 72 nucleotides coding for a precursor segment, 288 nucleotides coding for the mature alpha subunit, and 276 nucleotides from the 3'-untranslated region of the mRNA followed by a poly(A) segment. This cDNA likely represents most of the bovine alpha-subunit mRNA sequence. Nucleotide sequences were obtained from the cDNA inserts of two other alpha-subunit clones, and several differences among the three cDNA sequences have been detected. These differences in nucleotide sequence may represent either individual variation in genomic sequence or cloning artifacts. Comparison of the bovine alpha-subunit cDNA sequence to the sequences of human, rat, and mouse alpha-subunit cDNAs reveals that the bovine sequence has greater than 70% homology with the other cDNAs. The cloned alpha-subunit cDNA should provide a useful probe for further studies of the structure and expression of this interesting gene.  相似文献   

11.
Mitochondrial DNA (mtDNA) sequences are widely used for inferring the phylogenetic relationships among species. Clearly, the assumed model of nucleotide or amino acid substitution used should be as realistic as possible. Dependence among neighboring nucleotides in a codon complicates modeling of nucleotide substitutions in protein-encoding genes. It seems preferable to model amino acid substitution rather than nucleotide substitution. Therefore, we present a transition probability matrix of the general reversible Markov model of amino acid substitution for mtDNA-encoded proteins. The matrix is estimated by the maximum likelihood (ML) method from the complete sequence data of mtDNA from 20 vertebrate species. This matrix represents the substitution pattern of the mtDNA-encoded proteins and shows some differences from the matrix estimated from the nuclear-encoded proteins. The use of this matrix would be recommended in inferring trees from mtDNA-encoded protein sequences by the ML method. Received: 3 May 1995 / Accepted: 31 October 1995  相似文献   

12.
A Markov analysis of DNA sequences   总被引:12,自引:0,他引:12  
We present a model by which we look at the DNA sequence as a Markov process. It has been suggested by several workers that some basic biological or chemical features of nucleic acids stand behind the frequencies of dinucleotides (doublets) in these chains. Comparing patterns of doublet frequencies in DNA of different organisms was shown to be a fruitful approach to some phylogenetic questions (Russel & Subak-Sharpe, 1977). Grantham (1978) formulated mRNA sequence indices, some of which involve certain doublet frequencies. He suggested that using these indices may provide indications of the molecular constraints existing during gene evolution. Nussinov (1981) has shown that a set of dinucleotide preference rules holds consistently for eukaryotes, and suggested a strong correlation between these rules and degenerate codon usage. Gruenbaum, Cedar & Razin (1982) found that methylation in eukaryotic DNA occurs exclusively at C-G sites. Important biological information thus seems to be contained in the doublet frequencies. One of the basic questions to be asked (the "correlation question") is to what extent are the 64 trinucleotide (triplet) frequencies measured in a sequence determined by the 16 doublet frequencies in the same sequence. The DNA is described here as a Markov process, with the nucleotides being outcomes of a sequence generator. Answering the correlation question mentioned above means finding the order of the Markov process. The difficulty is that natural sequences are of finite length, and statistical noise is quite strong. We show that even for a 16000 nucleotide long sequence (like that of the human mitochondrial genome) the finite length effect cannot be neglected. Using the Markov chain model, the correlation between doublet and triplet frequencies can, however, be determined even for finite sequences, taking proper account of the finite length. Two natural DNA sequences, the human mitochondrial genome and the SV40 DNA, are analysed as examples of the method.  相似文献   

13.
Summary In analyzing the silent nucleotide substitutions in some mammalian mitochondrial mRNA coding genes, we had found that the frequency of each of the four nucleotides in rat, mouse, and cow, but not in humans, is the same in the silent third codon position (Lanave C, Preparata G, Saccone C, Serio G (1984) J Mol Evol 20:86-93). Because our findings for these three species were compatible with a stationary Markov process for the evolution of nucleotide sequences, we applied such a model to calculate the effective evolutionary silent substitution rate (vs) and the divergence times among the species. In this paper we have analyzed the first and second codon positions in the same mammalian mitochondrial genes. We found that in the first and second codon positions the human mitochondrial genes satisfy the stationarity conditions. This has allowed us to use the stochastic model mentioned above to calculate the divergence times among mouse, rat, cow, and human. Furthermore, we have analyzed the silent substitution rate in one nuclear gene for these four mammals. We found that in this gene the effective silent substitution rate is about 3 times lower than in mitochondrial genes, and that humans are in this case stationary with respect to the other three mammals in the third codon position as well. Application of our Markov model to this latter gene yields divergence times consistent with our previous determinations.  相似文献   

14.
A hidden Markov model that finds genes in E. coli DNA.   总被引:12,自引:1,他引:11       下载免费PDF全文
A Krogh  I S Mian    D Haussler 《Nucleic acids research》1994,22(22):4768-4778
A hidden Markov model (HMM) has been developed to find protein coding genes in E. coli DNA using E. coli genome DNA sequence from the EcoSeq6 database maintained by Kenn Rudd. This HMM includes states that model the codons and their frequencies in E. coli genes, as well as the patterns found in the intergenic region, including repetitive extragenic palindromic sequences and the Shine-Delgarno motif. To account for potential sequencing errors and or frameshifts in raw genomic DNA sequence, it allows for the (very unlikely) possibility of insertions and deletions of individual nucleotides within a codon. The parameters of the HMM are estimated using approximately one million nucleotides of annotated DNA in EcoSeq6 and the model tested on a disjoint set of contigs containing about 325,000 nucleotides. The HMM finds the exact locations of about 80% of the known E. coli genes, and approximate locations for about 10%. It also finds several potentially new genes, and locates several places were insertion or deletion errors/and or frameshifts may be present in the contigs.  相似文献   

15.
The sequence of 1.6 kb of DNA surrounding the alcohol dehydrogenase (Adh) gene from five species of the Planitibia subgroup of the Hawaiian picture-winged Drosophila, with estimated divergence times of 0.4-5.1 Myr, has been determined. The gene trees which were found by using the sequence divergence from different regions of the sequences are generally in accord with the phylogeny proposed for these species when chromosomal inversions and island of origin are used. One of the species (D. picticornis) appears to be more distant from the other species in this group than they are from a member of the Grimshawi group (D. affinidisjuncta) which is chromosomally more distant. Two of the species (D. differens and D. plantibia) show heterogeneity in the nucleotide changes in the Adh coding region, heterogeneity which is interpreted to be due to a gene conversion or recombination after hybridization between the two species. The minimal rate of nucleotide substitution of synonymous nucleotides and of nontranscribed nucleotides downstream from the coding region is estimated as 1.5 x 10(-8) and 1.1 x 10(-8) substitutions/nucleotide/year, respectively. This rate is two to three times the maximal rate estimated for mammalian synonymous substitutions.  相似文献   

16.
The nucleotide sequence of the mitochondrial DNA (mtDNA) in the region coding for the 3' end of the large rRNA has been determined for two human cell lines bearing independent cytoplasmic chloramphenicol-resistant (CAP-r) mutations. Comparison of the sequences of these two phenotypically different CAP-r mutants with their CAP-sensitive (CAP-s) parental cell lines has revealed a single base change for each in a region which is highly conserved among species. One CAP-r mutation is associated with an A to G transition on the coding strand while the second contains a G to T transversion 52 nucleotides away. Comparable sequence changes in this region had previously been found for mouse and yeast cell mitochondrial CAP-r mutants. Thus, changes in the large rRNA gene eliminate the inhibition of the ribosome by CAP and different nucleotide changes may result in variations in the drug-r phenotype.  相似文献   

17.
A model of hole transfer in DNA molecules has been proposed, which takes into account changes in the reorganization energy and orbital coupling between the neighboring bases during the charge transfer in different molecular sequences. It is shown that the rate of hole transfer by the superexchange and hopping transfer mechanisms is limited by the relaxation of the geometries of nucleobases participating in charge migration and the dynamics of solvent molecules. The rate of charge transfer in the DNA molecule is found to be dependent on the height of the potential barriers between the nucleotide and the molecular sequences. The inclusion of the interchain charge transfer, which is characterized by weak coupling between the nucleotides located in opposite strands, does not affect the general charge transport in DNA. The increase in the number of the parallel components of the hopping mechanism leads to a rise in the charge transfer rate in the double helix.  相似文献   

18.
A simple model is put forward to explain the long-known three-base periodicity in coding DNA. We propose the concept of same-phase triplet clustering, i.e. a condition wherein a triplet appears several times in one phase without interruption by the two other possible phases. For instance, in the sequence (i): NTT_GNN_NTT_GNN_NTT_GNN_NNN_NTT_GNN (where N is any nucleotide but combinations producing TTG are excluded) there would be clustering of same-phase TTG because this triplet appears uninterruptedly in phase 2. In contrast, in the sequence (ii): TTG_NTT_GNN_NNT_TGN_NNN_NTT_GNN there is no same-phase clustering because neighboring TTGs are all in different phases. Observe also that in sequence (i) TTG triplets are separated by 3, 3 and 6 nucleotides (3n distances), while in sequence (ii) they are separated by 1, 4 and 5 nucleotides (non-3n distances). In this work, we demonstrate that in coding DNA the 3n distances generated by (i)-type sequences proportionally outnumber the non-3n distances generated by (ii)-type sequences, this condition would be the basis of three-base periodicity. Randomized sequences had (i)- and (ii)-type sequences too but clustering was statistically different. To prove our model we generated (i)-type sequences in a randomized sequence by inducing clustering of same-phase triplets. In agreement with the model this sequence displayed three-base periodicity. Furthermore, two- and four-base periodicities could also be induced by artificially inducing clustering of duplets and tetraplets.  相似文献   

19.
20.
This paper analyzes the nucleotide sequences of three viruses: Kunjin, west Nile, and yellow fever. Each virus has one long open reading frame of greater than 10,200 nucleotides that codes for four structural and seven nonstructural genes. The Kunjin and west Nile viruses are the most closely related pair, when assessed on the basis of matches between their nucleotide sequences. As would be expected, the matching is least for bases at third-position codon sites and is greatest for second-position sites. Statistics are presented for the numbers of mismatches that are transitions or transversions. Nucleotide base usage is also reported. To each of the 33 virus-gene segments, nonhomogeneous Markov chain models have been fitted to describe the sequences of nucleotide bases. The models allow for different transition probabilities ("transition" is used in the mathematical sense here) and for different degrees of dependency, at the three sites in the codons. Reasonably satisfactory fits can be obtained for many of the genes by using models that are first order for both first- and second-position sites in the codon but that are second order for third-position sites. One consequence of such a model is that the correlation between one amino acid and the next is limited to the correlation of the last base of the former with the first base of the latter. Other consequences are that the model can (and does) prohibit the occurrence of stop codons within a gene and that subsequences of only first-position bases, or only third-position bases, are also first-order Markov chains. In theory, second-position subsequences may not be Markov chains at all. In practice, the data suggest that each of these subsequences is effectively a zero-order Markov chain, i.e., bases spaced three apart are statistically independent. Stationarity of nucleotide base distributions can be interpreted in either of two ways: (1) spatially along the sites or (2) temporally at each site. These interpretations must often be inconsistent, when the former allows for Markov dependence between adjacent sites whereas the latter assumes independence between sites. The inconsistency can be overcome, for these viruses, if subsequences at different codon positions are analyzed separately.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号