Most of the gene prediction algorithms for prokaryotes are based on Hidden Markov Models or similar machine-learning approaches, which imply the optimization of a high number of parameters. The present paper presents a novel method for the classification of coding and non-coding regions in prokaryotic genomes, based on a suitably defined compression index of a DNA sequence. The main features of this new method are the non-parametric logic and the costruction of a dictionary of words extracted from the sequences. These dictionaries can be very useful to perform further analyses on the genomic sequences themselves. The proposed approach has been applied on some prokaryotic complete genomes, obtaining optimal scores of correctly recognized coding and non-coding regions. Several false-positive and false-negative cases have been investigated in detail, which have revealed that this approach can fail in the presence of highly structured coding regions (e.g., genes coding for modular proteins) or quasi-random non-coding regions (e.g., regions hosting non-functional fragments of copies of functional genes; regions hosting promoters or other protein-binding sequences). We perform an overall comparison with other gene-finder software, since at this step we are not interested in building another gene-finder system, but only in exploring the possibility of the suggested approach.  相似文献   

Recently, it was observed that noncoding regions of DNA sequences possess long-range power-law correlations, whereas coding regions typically display only short-range correlations. We develop an algorithm based on this finding that enables investigators to perform a statistical analysis on long DNA sequences to locate possible coding regions. The algorithm is particularly successful in predicting the location of lengthy coding regions. For example, for the complete genome of yeast chromosome III (315,344 nucleotides), at least 82% of the predictions correspond to putative coding regions; the algorithm correctly identified all coding regions larger than 3000 nucleotides, 92% of coding regions between 2000 and 3000 nucleotides long, and 79% of coding regions between 1000 and 2000 nucleotides. The predictive ability of this new algorithm supports the claim that there is a fundamental difference in the correlation property between coding and noncoding sequences. This algorithm, which is not species-dependent, can be implemented with other techniques for rapidly and accurately locating relatively long coding regions in genomic sequences.  相似文献   

Short, interspersed repetitive DNA sequences in prokaryotic genomes.   总被引:42,自引:2,他引:40  

CONSORF is a fully automatic high-accuracy identification system that provides consensus prokaryotic CDS information. It first predicts the CDSs supported by consensus alignments. The alignments are derived from multiple genome-to-proteome comparisons with other prokaryotes using the FASTX program. Then, it fills the empty genomic regions with the CDSs supported by consensus ab initio predictions. From those consensus results, CONSORF provides prediction reliability scores, predicted frame-shifts, alternative start sites and best pair-wise match information against other prokaryotes. These results are easily accessed from a website.  相似文献   

Protein coding regions of a genome fragment can be mathematicallypredicted by studying variations in the statistical propertiesor by searching the signals characteristic of the junctionsbetween the coding and non-coding regions. We propose here anew statistical method using correspondence analysis. This methoddoes not use any reference codon set but takes into accountthe codon usage homogeneity along the studied genome fragment.Comparison with previously published methods especially the‘codon usage method’ of Staden has been made, andtwo examples are presented here. Applications to analysis ofprokaryotic operon and eukaryotic split genes are also discussed.Use of the method has also shown two structures not previouslydescribed: i) in the human prt gene, a strong triplet structureexists in a non-coding region; ii) in the human tp-a codon usageis not uniform between the different exons Received on September 25, 1986  相似文献   

A coding sequence is defined as a DNA sequence coding the primary structure of a protein (a polypeptide). Such a sequence must satisfy a specific constraint, which consists in coding a functional protein. As the genetic code is degenerated, there exists, for a given polypeptide, a set of synonymous sequences which would code the same polypeptide. Translation conditional models are being defined on such sets. The aim of this paper is to give a common formalism. Besides the codon bias model, a few other conditional models will be defined. Statistical estimators and comparison methods will be briefly presented. These models can be used for gene classification, or to find out, in a real sequence, remarkable features. An example will be presented on Escherichia coli genes.  相似文献   

M Levine  G M Rubin  R Tjian 《Cell》1984,38(3):667-673
Several human DNA sequences were isolated by virtue of homology to a highly conserved region that has been identified in a number of homeotic genes in Drosophila. Structural analysis of the human DNAs indicate that two separate and distinct regions sharing a high degree of homology with the homeo box sequences of Drosophila are separated by only 5 kb in the human genome. Sequence determination of these regions reveals that both human DNA sequences contain a region capable of coding 61 amino acids, which shares greater than 90% homology with the peptide sequences specified by the homeo box domain of Drosophila homeotic genes, Antennapedia, fushi tarazu, and Ultrabithorax. By contrast, the human DNA sequences lying outside of the 190 nucleotide homeo box region share virtually no sequence homology, either with the flanking sequences of the other human clones or with flanking regions of the known Drosophila homeotic genes.  相似文献   

An automated algorithm is presented that delineates protein sequence fragments which display similarity. The method incorporates a selection of a number of local nonoverlapping sequence alignments with the highest similarity scores and a graphtheoretical approach to elucidate the consistent start and end points of the fragments comprising one or more ensembles of related subsequences. The procedure allows the simultaneous identification of different types of repeats within one sequence. A multiple alignment of the resulting fragments is performed and a consensus sequence derived from the ensemble(s). Finally, a profile is constructed form the multiple alignment to detect possible and more distant members within the sequence. The method tolerates mutations in the repeats as well as insertions and deletions. The sequence spans between the various repeats or repeat clusters may be of different lengths. The technique has been applied to a number of proteins where the repeating fragments have been derived from information additional to the protein sequences. © 1993 Wiley-Liss, Inc.  相似文献   

We develop a quantitative method for analyzing repetitions of identical short oligomers in coding and noncoding DNA sequences. We analyze sequences presently available in the GenBank separately for primate, mammal, vertebrate, rodent, invertebrate and plant taxonomic partitions. We find that some oligomers "cluster" more than they would if randomly distributed, while other oligomers "repel" each other. To quantify this degree of clustering, we define clustering measures. We find that (i) clustering significantly differs in coding and noncoding DNA; (ii) in most cases, monomers, dimers and tetramers cluster in noncoding DNA but appear to repel each other in coding DNA. (iii) The degree of clustering for different sources (primates, invertebrates, and plants) is more conserved among these sources in the case of coding DNA than in the case of noncoding DNA. (iv) In contrast to other oligomers, we find that trimers always prefer to cluster. (v) Clustering of each particular oligomer is conserved within the same organism.  相似文献   

A procedure for the de novo construction of nucleosome core particles from defined DNA sequences of prokaryotic origin is described. Efficient de novo reconstitution without added carrier DNA is demonstrated. DNase I and exonuclease III analysis of a nucleosome core prepared from a 154 base pair fragment extending from base 853 to base 1006 of pBR322 indicates a non-random positioning of the histone core along the DNA. As bacteria have no histones, their DNA cannot be expected to have a histone core positioning signal encoded in it, the efficient formation of a uniquely positioned core particle is not self evident. The possibility that a phosphate end group positions DNA fragments on the histone is considered. The de novo reconstitution of carrier-less defined nucleosome core particles should facilitate the physicochemical study of nucleosomes on the fine structural level.  相似文献   

Mishra P  Pandey PN 《Bioinformation》2011,6(10):372-374
The number of amino acid sequences is increasing very rapidly in the protein databases like Swiss-Prot, Uniprot, PIR and others, but the structure of only some amino acid sequences are found in the Protein Data Bank. Thus, an important problem in genomics is automatically clustering homologous protein sequences when only sequence information is available. Here, we use graph theoretic techniques for clustering amino acid sequences. A similarity graph is defined and clusters in that graph correspond to connected subgraphs. Cluster analysis seeks grouping of amino acid sequences into subsets based on distance or similarity score between pairs of sequences. Our goal is to find disjoint subsets, called clusters, such that two criteria are satisfied: homogeneity: sequences in the same cluster are highly similar to each other; and separation: sequences in different clusters have low similarity to each other. We tested our method on several subsets of SCOP (Structural Classification of proteins) database, a gold standard for protein structure classification. The results show that for a given set of proteins the number of clusters we obtained is close to the superfamilies in that set; there are fewer singeltons; and the method correctly groups most remote homologs.  相似文献   

The isolation and partial characterization of two cloned segments of Drosophila melanogaster DNA containing "heat shock" gene sequences is described. We have inserted sheared embryonic D. melanogaster DNA by the poly(dA-dt) connector method (Lobban and Kaiser, 1973) into the R1 restriction site of the ampicillin-resistant plasmid pSF2124 (So, Gill and Falkow, 1975). A collection of independent hybrid plasmids was screened by colony hybridization (Grunstein and Hogness, 1975) for sequences complementary to in vitro labeled polysomal poly(A)+ heat shock RNA. Two clones were identified which contain sequences complementary to a heat shock mRNA species that directs the in vitro synthesis of the 70,000 dalton heat-induced polypeptide. Both cloned segments hybridize in situ to the heat-induced puff sites located at 87A and 87C of the salivary gland polytene chromosomes.  相似文献   

This paper describes a computer method that uses codon preference to help find protein coding regions in long DNA sequences. The method can distinguish between introns and exons and can help to detect sequencing errors.  相似文献   

The random-breakage mapping method [Game et al. (1990) Nucleic Acids Res., 18, 4453-4461] was applied to DNA sequences in human fibroblasts. The methodology involves NotI restriction endonuclease digestion of DNA from irradiated calls, followed by pulsed-field gel electrophoresis, Southern blotting and hybridization with DNA probes recognizing the single copy sequences of interest. The Southern blots show a band for the unbroken restriction fragments and a smear below this band due to radiation induced random breaks. This smear pattern contains two discontinuities in intensity at positions that correspond to the distance of the hybridization site to each end of the restriction fragment. By analyzing the positions of those discontinuities we confirmed the previously mapped position of the probe DXS1327 within a NotI fragment on the X chromosome, thus demonstrating the validity of the technique. We were also able to position the probes D21S1 and D21S15 with respect to the ends of their corresponding NotI fragments on chromosome 21. A third chromosome 21 probe, D21S11, has previously been reported to be close to D21S1, although an uncertainty about a second possible location existed. Since both probes D21S1 and D21S11 hybridized to a single NotI fragment and yielded a similar smear pattern, this uncertainty is removed by the random-breakage mapping method.  相似文献   

H Komori  F Matsunaga  Y Higuchi  M Ishiai  C Wada    K Miki 《The EMBO journal》1999,18(17):4597-4607
The initiator protein (RepE) of F factor, a plasmid involved in sexual conjugation in Escherichia coli, has dual functions during the initiation of DNA replication which are determined by whether it exists as a dimer or as a monomer. A RepE monomer functions as a replication initiator, but a RepE dimer functions as an autogenous repressor. We have solved the crystal structure of the RepE monomer bound to an iteron DNA sequence of the replication origin of plasmid F. The RepE monomer consists of topologically similar N- and C-terminal domains related to each other by internal pseudo 2-fold symmetry, despite the lack of amino acid similarities between the domains. Both domains bind to the two major grooves of the iteron (19 bp) with different binding affinities. The C-terminal domain plays the leading role in this binding, while the N-terminal domain has an additional role in RepE dimerization. The structure also suggests that superhelical DNA induced at the origin of plasmid F by four RepEs and one HU dimer has an essential role in the initiation of DNA replication.  相似文献   

We propose a new approach to study protein coding and non-coding regions in DNA sequences, by making use of two complementary statistical methods. The principal component analysis (PCA) is a graphical method to represent DNA sequences which are characterized by some quantitative parameters: it is a help to the intuition. The discriminating analysis (DA) is a quantitative method which permits to classify the DNA sequences. It leads to an evaluation of the first method and to a decision. The value of this approach has been confirmed since we also have found some results which had been described recently in the literature. Furthermore, this general methodology has permitted us to show the existence of parameters which identify the nucleic acid sequence functional domains, without having to make use of the properties of the genetic code.  相似文献   

Asparagine-linked protein glycosylation is a prevalent protein modification reaction in eukaryotic systems. This process involves the co-translational transfer of a pre-assembled tetradecasaccharide from a dolichyl-pyrophosphate donor to the asparagine side chain of nascent proteins at the endoplasmic reticulum (ER) membrane. Recently, the first such system of N-linked glycosylation was discovered in the Gram-negative bacterium, Campylobacter jejuni. Glycosylation in this organism involves the transfer of a heptasaccharide from an undecaprenyl-pyrophosphate donor to the asparagine side chain of proteins at the bacterial periplasmic membrane. Here we provide a detailed comparison of the machinery involved in the N-linked glycosylation systems of eukaryotic organisms, exemplified by the yeast Saccharomyces cerevisiae, with that of the bacterial system in C. jejuni. The two systems display significant similarities and the relative simplicity of the bacterial glycosylation process could provide a model system that can be used to decipher the complex eukaryotic glycosylation machinery.  相似文献   

The entropies of protein coding genes from Escherichia coli were calculated according to Boltzmann's formula. Entropies of the coding regions were compared to the entropies of noncoding or miscoding ones. With nucleotides as code units, the entropies of the coding regions, when compared to the entropies of complete sequences (leader and coding region as well as trailer), were seen to be lower but with a marginal statistical significance. With triplets of nucleotides as code units, the entropies of correct reading frames were significantly lower than the entropies of frameshifts +1 and -1. With amino acids as code units, the results were opposite: Biologically functional proteins had significantly higher entropies than proteins translated from the frameshifted sequences. We attempt to explain this paradox with the hypothesis that the genetic code may have the ability of lowering information content (increasing entropy) of proteins while translating them from DNA. This ability might be beneficial to bacteria because it would make the functional proteins more probable (having a higher entropy) than nonfunctional proteins translated from frameshifted sequences.  相似文献   

