首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 0 毫秒
A Markov analysis of DNA sequences   总被引:12,自引:0,他引:12  
We present a model by which we look at the DNA sequence as a Markov process. It has been suggested by several workers that some basic biological or chemical features of nucleic acids stand behind the frequencies of dinucleotides (doublets) in these chains. Comparing patterns of doublet frequencies in DNA of different organisms was shown to be a fruitful approach to some phylogenetic questions (Russel & Subak-Sharpe, 1977). Grantham (1978) formulated mRNA sequence indices, some of which involve certain doublet frequencies. He suggested that using these indices may provide indications of the molecular constraints existing during gene evolution. Nussinov (1981) has shown that a set of dinucleotide preference rules holds consistently for eukaryotes, and suggested a strong correlation between these rules and degenerate codon usage. Gruenbaum, Cedar & Razin (1982) found that methylation in eukaryotic DNA occurs exclusively at C-G sites. Important biological information thus seems to be contained in the doublet frequencies. One of the basic questions to be asked (the "correlation question") is to what extent are the 64 trinucleotide (triplet) frequencies measured in a sequence determined by the 16 doublet frequencies in the same sequence. The DNA is described here as a Markov process, with the nucleotides being outcomes of a sequence generator. Answering the correlation question mentioned above means finding the order of the Markov process. The difficulty is that natural sequences are of finite length, and statistical noise is quite strong. We show that even for a 16000 nucleotide long sequence (like that of the human mitochondrial genome) the finite length effect cannot be neglected. Using the Markov chain model, the correlation between doublet and triplet frequencies can, however, be determined even for finite sequences, taking proper account of the finite length. Two natural DNA sequences, the human mitochondrial genome and the SV40 DNA, are analysed as examples of the method.  相似文献   

Recognition of coding regions within eukaryotic genomes is one of oldest but yet not solved problems of bioinformatics. New high-accuracy methods of splicing sites recognition are needed to solve this problem. A question of current interest is to identify specific features of nucleotide sequences nearby splicing sites and recognize sites in sequence context. We performed a statistical analysis of human genes fragment database and revealed some characteristics of nucleotide sequences in splicing sites neighborhood. Frequencies of all nucleotides and dinucleotides in splicing sites environment were computed and nucleotides and dinucleotides with extremely high\low occurrences were identified. Statistical information obtained in this work can be used in further development of the methods of splicing sites annotation and exon-intron structure recognition.  相似文献   

Computer analysis of DNA and protein sequences.   总被引:2,自引:0,他引:2  
Some recent trends in the development of theoretical methods for DNA and protein sequence analysis are reviewed, with particular emphasis on the design of new databases, motif searches, sequence alignment algorithms and applications of neural networks.  相似文献   

Markov analysis of viral DNA/RNA sequences   总被引:1,自引:0,他引:1  
This work applies a previously published method for determining the order of a Markov chain to the DNA/RNA sequences of φX174, SV40 and MS2. In the first two cases rather long-range order is found—third and second order respectively—but zero order is the appropriate model for MS2. These results point to some inadequacies in previous informational calculations on virus genomes and lead to insights on features such as gene overlap and secondary structure.  相似文献   

A clustering method for repeat analysis in DNA sequences   总被引:1,自引:0,他引:1  
Volfovsky N  Haas BJ  Salzberg SL 《Genome biology》2001,2(8):research0027.1-research002711


A computational system for analysis of the repetitive structure of genomic sequences is described. The method uses suffix trees to organize and search the input sequences; this data structure has been used previously for efficient computation of exact and degenerate repeats.


The resulting software tool collects all repeat classes and outputs summary statistics as well as a file containing multiple sequences (multi fasta), that can be used as the target of searches. Its use is demonstrated here on several complete microbial genomes, the entire Arabidopsis thaliana genome, and a large collection of rice bacterial artificial chromosome end sequences.


We propose a new clustering method for analysis of the repeat data captured in suffix trees. This method has been incorporated into a system that can find repeats in individual genome sequences or sets of sequences, and that can organize those repeats into classes. It quickly and accurately creates repeat databases from small and large genomes. The associated software (RepeatFinder), should prove helpful in the analysis of repeat structure for both complete and partial genome sequences.  相似文献   

Here we propose a weighted measure for the similarity analysis of DNA sequences. It is based on LZ complexity and (0,1) characteristic sequences of DNA sequences. This weighted measure enables biologists to extract similarity information from biological sequences according to their requirements. For example, by this weighted measure, one can obtain either the full similarity information or a similarity analysis from a given biological aspect. Moreover, the length of DNA sequence is not problematic. The application of the weighted measure to the similarity analysis of β-globin genes from nine species shows its flexibility.  相似文献   

A set of six cloned barley (Hordeum vulgare) repetitive DNA sequences was used for the analysis of phylogenetic relationships among 31 species (46 taxa) of the genus Hordeum, using molecular hybridization techniques. in situ hybridization experiments showed dispersed organization of the sequences over all chromosomes of H. vulgare and the wild barley species H. bulbosum, H. marinum and H. murinum. Southern blot hybridization revealed different levels of polymorphism among barley species and the RFLP data were used to generate a phylogenetic tree for the genus Hordeum. Our data are in a good agreement with the classification system which suggests the division of the genus into four major groups, containing the genomes I, X, Y, and H. However, our investigation also supports previous molecular studies of barley species where the unique position of H. bulbosum has been pointed out. In our experiments, H. bulbosum generally had hybridization patterns different from those of H. vulgare, although both carry the I genome. Based on our results we present a hypothesis concerning the possible origin and phylogeny of the polyploid barley species H. secalinum, H. depressum and the H. brachyantherum complex.  相似文献   

Compilation and analysis of Escherichia coli promoter DNA sequences.   总被引:602,自引:130,他引:472       下载免费PDF全文
The DNA sequence of 168 promoter regions (-50 to +10) for Escherichia coli RNA polymerase were compiled. The complete listing was divided into two groups depending upon whether or not the promoter had been defined by genetic (promoter mutations) or biochemical (5' end determination) criteria. A consensus promoter sequence based on homologies among 112 well-defined promoters was determined that was in substantial agreement with previous compilations. In addition, we have tabulated 98 promoter mutations. Nearly all of the altered base pairs in the mutants conform to the following general rule: down-mutations decrease homology and up-mutations increase homology to the consensus sequence.  相似文献   

Prediction of gene sequences and their exon-intron structure in large eukaryotic genomic sequences is one of the central problems of mathematical biology. Solving this problem involves, in particular, high-accuracy splice site recognition. Using statistical analysis of a splice site-containing human gene fragment database, some characteristic features were described for nucleotide sequences in the splicing site neighborhood, the frequencies of all nucleotides and dinucleotides were determined, and those with frequencies increased or decreased in comparison to a random sequence were identified. The results can be used in sequence annotation, splicing site prediction, and the recognition of the gene exon-intron structure.  相似文献   

Simple repetitive DNA sequences from primates: Compilation and analysis   总被引:25,自引:0,他引:25  
Simple repeats composed of tandemly repeated units 1–6 nucleotides (nt) long have been extracted from a selected set of primate genomic DNA sequences. Of the 501 theoretically possible, different types of repeats only 67 were present in the analyzed database in at least two different size ranges over 12 nt. They include all simple repeats known to be polymorphic in the primate genome. A list of moderately expanding and nonexpanding oligonucleotide patterns has also been included. Furthermore, we have compiled statistical data with emphasis on the overall variability of the most abundant 67 types of repeats. We have demonstrated that the expandability of at least some simple repeats may be affected by the overall base composition and by flanking sequences. In particular, the occurrence of tandemly repeated CAG and GCC triplets in exons positively correlates with their G+C content. We also noted that in the vicinity of Alu sequences tetrameric repeats are more abundant than in the total genomic DNA. This paper can be used as a comprehensive guide in identification of the most abundant and potentially polymorphic simple repeats. It is also of broader significance as a step toward understanding the contribution of flanking sequences and the overall sequence composition to variability of simple repeats. Correspondence to: J. Jurka  相似文献   

To study the phylogenetics of sugarcane (Saccharum officinarum L.) and its relatives we sequenced four loci on cytoplasmic genomes (two chloroplast and two mitochondrial) and analyzed mitochondrial RFLPs generated using probes for COXI, COXII, COXIII, Cob, 18S+5S, 26S, ATPase 6, ATPase 9, and ATPase (D'Hont et al. 1993). Approximately 650 bp of DNA in the intergenic spacer region between rbcL and atpB and approximately 150 bp from the chloroplast 16S rDNA through the intergenic spacer region tRNAval gene were sequenced. In the mitochondrial genome, part of the 18S rRNA gene and approximately 150 bp from the 18S gene 3 end, through an intergenic spacer region, to the 5S rRNA gene were sequenced. No polymorphisms were observed between maize, sorghum, and Saccharum complex members for the mitochondrial 18S internal region or for the intergenic tRNAval chloroplast locus. Two polymorphisms (insertion-deletion events, indels) were observed within the 18S-5S mitochondrial locus, which separated the accessions into three groups: one containing all of the Erianthus, Eccoilopus, Imperata, Sorghum, and 1 Miscanthus species; a second containing Saccharum species, Narenga porphyrocoma, Sclerostachya fusca, and 1 presumably hybrid Miscanthus sp. from New Guinea; and a third containing maize. Eighteen accessions were sequenced for the intergenic region between rbcL and atpB, which was the most polymorphic of the regions studied and contained 52 site mutations and 52 indels, across all taxa. Within the Saccharum complex, at most 7 site mutations and 16 indels were informative. The maternal lineage of Erianthus/Eccoilopus was nearly as divergent from the remaining Saccharum complex members as it was from sorghum, in agreement with a previous study. Sequences from the rbcL-atpB spacer were aligned with GENBANK sequences for wheat, rice, barley, and maize, which were used as outgroups in phylogenetic analyses. To determine whether limited intra-complex variability was caused by under sampling of taxa, we used seven restriction enzymes to digest the PCR-amplified rbcL-atpB spacer of an additional 36 accessions within the Saccharum complex. This analysis revealed ten restriction sites (none informative) and eight length variants (four informative). The small amount of variation present in the organellar DNAs of this polyploid complex suggests that either the complex is very young or that rates of evolution between the Saccharum complex and outgroup taxa are different. Other phylogenetic information will be required to resolve systematic relationships within the complex. Finally, no variation was observed in commercial sugarcane varieties, implying a world-wide cytoplasmic monoculture for this crop.  相似文献   

Reliable quantification by PCR requires careful experimental design and conditions, often involving sampling of the PCR reactions at different time points or amplifying multiple dilutions of a standard DNA. We describe here an accurate, quantitative and easily automatizable solid-phase method based on competetive PCR. The PCR products are analyzed by solid-phase minisequencing after capture of biotinylated PCR products in streptavidin-coated microtiter wells and single-nucleotide extension of a specific detection primer by a radioactively labelled nucleotide. The results are expressed as numeric cpm-values, and the incorporated label expresses the relative amount of sequence variants in the original template mixture. We have applied the method to determination of allele frequencies in pooled DNA samples, of mitochondrial heteroplasmy, of gene copy numbers, and to forensic DNA analysis.  相似文献   

A statistical analysis of the occurrence of particular nucleotide runs in DNA sequences of different species has been carried out. There are considerable differences of run distributions in DNA sequences of procaryotes, invertebrates and vertebrates. There is an abundance of short runs (1-2 nucleotides long) in the coding sequences and there is a deficiency of such runs in the noncoding regions. However, some interesting exceptions from this rule exist for the run distribution of adenine in procaryotes and for the arrangement of purine-pyrimidine runs in eucaryotes. The similarity in the distributions of such runs in the coding and noncoding regions may be due to some structural features of the DNA molecule as a whole. Runs of guanine (or cytosine) of three to six nucleotides occur predominantly in noncoding DNA regions in eucaryotes, especially in vertebrates.  相似文献   

Sequences of the cohesive ends and the 3'-terminal regions of phi80 DNA have been determined. Sequences of the cohesive ends were obtained through the use of two standard methods. The first method involved the incorporation of all four labeled deoxyribonucleotides into the phi80 cohesive ends using DNA polymerase I. The DNA was then partially digested with micrococcal nuclease or pancreatic DNase. The products were separated by two-dimensional electrophoresis and characterized by composition, 3'-terminal, and nearest neighbor analyses. The second method involved partial incorporation using one, two, or three labeled deoxyribonucleotides followed by similar analyses. Sequences of the double-stranded regions adjacent to the cohesive ends were determined by three new methods. These methods were: (a) the DNA was specifically labeled at the 3' terminus and then partially degraded. Labeled oligonucleotide products were sequenced by their mobilities on various separation systems. (b) The cohesive ends were enlarged by limited degradation with exonuclease III. After this treatment, the DNA was partially repaired with labeled nucleotides, digested, and the products were analyzed. (c) A synthetic ologonucleotide primer was bound to phi80 DNA which had been repaired with DNA polymerase I, and then partially digested with lambda-exonuclease. The primer was extended into the region of interest by partial repair with labeled nucleotides. The extended primer was isolated and analyzed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号