首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
F Rodier  J Sallantin 《Biochimie》1985,67(5):533-539
Learning processes are applied to the recognition of protein coding regions in prokaryotes. Non-contradictory, statistical and logical rules are deduced from a set of known examples of coding sequences. These rules enable to build characteristic patterns on the m-RNA upstream of the initiating codon. These rules are applied with success to recognize more than 180 coding sequences and to detect and/or eliminate hypothetical reading frames or unknown genes.  相似文献   

2.
Most of the gene prediction algorithms for prokaryotes are based on Hidden Markov Models or similar machine-learning approaches, which imply the optimization of a high number of parameters. The present paper presents a novel method for the classification of coding and non-coding regions in prokaryotic genomes, based on a suitably defined compression index of a DNA sequence. The main features of this new method are the non-parametric logic and the costruction of a dictionary of words extracted from the sequences. These dictionaries can be very useful to perform further analyses on the genomic sequences themselves. The proposed approach has been applied on some prokaryotic complete genomes, obtaining optimal scores of correctly recognized coding and non-coding regions. Several false-positive and false-negative cases have been investigated in detail, which have revealed that this approach can fail in the presence of highly structured coding regions (e.g., genes coding for modular proteins) or quasi-random non-coding regions (e.g., regions hosting non-functional fragments of copies of functional genes; regions hosting promoters or other protein-binding sequences). We perform an overall comparison with other gene-finder software, since at this step we are not interested in building another gene-finder system, but only in exploring the possibility of the suggested approach.  相似文献   

3.
There are no well-known properties in regulatory DNA analogous to those in coding sequences; their spatial location is not regular, the consensus regulatory elements are often degenerate and there are no understandable rules governing their evolution. This makes it difficult to recognize regulatory regions within genome. We review developments in the statistical characterization of regulatory regions and methods of their recognition in eukaryotic genomes.  相似文献   

4.
H Grosjean  W Fiers 《Gene》1982,18(3):199-209
By considering the nucleotide sequence of several highly expressed coding regions in bacteriophage MS2 and mRNAs from Escherichia coli, it is possible to deduce some rules which govern the selection of the most appropriate synonymous codons NNU or NNC read by tRNAs having GNN, QNN or INN as anticodon. The rules fit with the general hypothesis that an efficient in-phase translation is facilitated by proper choice of degenerate codewords promoting a codon-anticodon interaction with intermediate strength (optimal energy) over those with very strong or very weak interaction energy. Moreover, codons corresponding to minor tRNAs are clearly avoided in these efficiently expressed genes. These correlations are clearcut in the normal reading frame but not in the corresponding frameshift sequences +1 and +2. We hypothesize that both the optimization of codon-anticodon interaction energy and the adaptation of the population to codon frequency or vice versa in highly expressed mRNAs of E. coli are part of a strategy that optimizes the efficiency of translation. Conversely, codon usage in weakly expressed genes such as repressor genes follows exactly the opposite rules. It may be concluded that, in addition to the need for coding an amino acid sequence, the energetic consideration for codon-anticodon pairing, as well as the adaptation of codons to the tRNA population, may have been important evolutionary constraints on the selection of the optimal nucleotide sequence.  相似文献   

5.
Genome annotation in differently evolved organisms presents challenges because the lack of sequence-based homology limits the ability to determine the function of putative coding regions. To provide an alternative to annotation by sequence homology, we developed a method that takes advantage of unusual trypanosomatid biology and skews in nucleotide composition between coding regions and upstream regions to rank putative open reading frames based on the likelihood of coding. The method is 93% accurate when tested on known genes. We have applied our method to the full complement of open reading frames on Chromosome I of Trypanosoma brucei, and we can predict with high confidence that 226 putative coding regions are likely to be functional. Methods such as the one described here for discriminating true coding regions are critical for genome annotation when other sources of evidence for function are limited.  相似文献   

6.
Statistical analyses on the positional correlation of physical-stability and base-sequence distribution maps with genetic map are made for the whole DNA (48502 bases) of lambda-phage. The susceptibility to a double-helix unfolding perturbation and the fraction of the transient opening of a particular region of the double helix are adopted to define this physical stability. The principal features obtained are: A) The DNA double strand of protein coding regions is found to have homostabilizing propensity around a defined stability which is characteristic to each individual gene. B) The stability of the double helix in non-protein coding region fluctuates, on average over the whole region, more than that in protein coding region. C) Boundary regions of protein coding and non-protein coding regions are regions of high stability-fluctuation. Stability especially fluctuates at the protein-coding-region side of the boundary. Contrary to the quiet feature of the interior part of protein coding region rather noisy part exists at its edge. D) One frequently opening region coincides with the attaching site for the site specific recombination between phage and bacterial DNA. There are two possible ways to explain the noisy feature in the stability distribution in non-protein coding regions: 1) The region has been used as the locus of recombination as evolution took place. Thus DNAs which were homostabilized around a different value characteristic to each individual DNA, have been joined there many times, so that the noise has accumulated as a remnant of evolutional history; and/or 2) the base-composition homogenizing or double-helix homostabilizing mechanism does not work in unneeded region such as non-protein coding region or introns. Since corresponding characteristics have been found in our previous analyses on other viral and globin-gene DNAs, the rules mentioned above may be comprehensively extended to other DNAs.  相似文献   

7.
The nuclear ribosomal locus coding for the large subunit is represented in tandem arrays in the plant genome. These consecutive gene blocks, consisting of several regions, are widely applied in plant phylogenetics. The regions coding for the subunits of the rRNA have the lowest rate of evolution. Also the spacer regions like the internal transcribed spacers (ITS) and external transcribed spacers (ETS) are widely utilized in phylogenetics. The fact, that these regions are present in many copies in the plant genome is an advantage for laboratory practice but might be problem for phylogenetic analysis. Beside routine usage, the rDNA regions provide the great potential to study complex evolutionary mechanisms, such as reticulate events or array duplications. The understanding of these processes is based on the observation that the multiple copies of rDNA regions are homogenized through concerted evolution. This phenomenon results to paralogous copies, which can be misleading when incorporated in phylogenetic analyses. The fact that non-functional copies or pseudogenes can coexist with ortholougues in a single individual certainly makes also the analysis difficult. This article summarizes the information about the structure and utility of the phylogenetically informative spacer regions of the rDNA, namely internal- and external transcribed spacer regions as well as the intergenic spacer (IGS).  相似文献   

8.
Abstract

Statistical analyses on the positional correlation of physical-stability and base-sequence distribution maps with genetic map are made for the whole DNA (48502 bases) of λ-phage. The susceptibility to a double-helix unfolding perturbation and the fraction of the transient opening of a particular region of the double helix are adopted to define this physical stability.

The principal features obtained are: A) The DNA double strand of protein coding regions is found to have homostabilizing propensity around a defined stability which is characteristic to each individual gene. B) The stability of the double helix in non-protein coding region fluctuates, on average over the whole region, more than that in protein coding region. C) Boundary regions of protein coding and non-protein coding regions are regions of high stability-fluctuation. Stability especially fluctuates at the protein-coding-region side of the boundary. Contrary to the quiet feature of the interior part of protein coding region rather noisy part exists at its edge. D) One frequently opening region coincides with the attaching site for the site specific recombination between phage and bacterial DNA.

There are two possible ways to explain the noisy feature in the stability distribution in non-protein coding regions: 1) The region has been used as the locus of recombination as evolution took place. Thus DNAs which were homostabilized around a different value characteristic to each individual DNA, have been joined there many times, so that the noise has accumulated as a remnant of evolutional history; and/or 2) the base-composition homogenizing or double-helix homostabilizing mechanism does not work in unneeded region such as non-protein coding region or introns.

Since corresponding characteristics have been found in our previous analyses on other viral and globin-gene DNAs, the rules mentioned above may be comprehensively extended to other DNAs.  相似文献   

9.
Comparative genomics provides insight into the evolutionary dynamics that shape discrete sequences as well as whole genomes. To advance comparative genomics within the Brassicaceae, we have end sequenced 23,136 medium-sized insert clones from Boechera stricta, a wild relative of Arabidopsis (Arabidopsis thaliana). A significant proportion of these sequences, 18,797, are nonredundant and display highly significant similarity (BLASTn e-value < or = 10(-30)) to low copy number Arabidopsis genomic regions, including more than 9,000 annotated coding sequences. We have used this dataset to identify orthologous gene pairs in the two species and to perform a global comparison of DNA regions 5' to annotated coding regions. On average, the 500 nucleotides upstream to coding sequences display 71.4% identity between the two species. In a similar analysis, 61.4% identity was observed between 5' noncoding sequences of Brassica oleracea and Arabidopsis, indicating that regulatory regions are not as diverged among these lineages as previously anticipated. By mapping the B. stricta end sequences onto the Arabidopsis genome, we have identified nearly 2,000 conserved blocks of microsynteny (bracketing 26% of the Arabidopsis genome). A comparison of fully sequenced B. stricta inserts to their homologous Arabidopsis genomic regions indicates that indel polymorphisms >5 kb contribute substantially to the genome size difference observed between the two species. Further, we demonstrate that microsynteny inferred from end-sequence data can be applied to the rapid identification and cloning of genomic regions of interest from nonmodel species. These results suggest that among diploid relatives of Arabidopsis, small- to medium-scale shotgun sequencing approaches can provide rapid and cost-effective benefits to evolutionary and/or functional comparative genomic frameworks.  相似文献   

10.
Secondary structure of tobacco mosaic virus protein   总被引:1,自引:0,他引:1  
A set of rules is proposed for the prediction of α-helices in proteins. These rules lead to the correct assignment to either helical or non-helical regions of over 80% of the amino acid residues in the proteins from which they are derived. Applied to tobacco mosaic virus protein these rules lead to the prediction of five α-helical regions which may be consistent with other data.  相似文献   

11.
12.
Here we present a model of nucleotide substitution in protein-coding regions that also encode the formation of conserved RNA structures. In such regions, apparent evolutionary context dependencies exist, both between nucleotides occupying the same codon and between nucleotides forming a base pair in the RNA structure. The overlap of these fundamental dependencies is sufficient to cause "contagious" context dependencies which cascade across many nucleotide sites. Such large-scale dependencies challenge the use of traditional phylogenetic models in evolutionary inference because they explicitly assume evolutionary independence between short nucleotide tuples. In our model we address this by replacing context dependencies within codons by annotation-specific heterogeneity in the substitution process. Through a general procedure, we fragment the alignment into sets of short nucleotide tuples based on both the protein coding and the structural annotation. These individual tuples are assumed to evolve independently, and the different tuple sets are assigned different annotation-specific substitution models shared between their members. This allows us to build a composite model of the substitution process from components of traditional phylogenetic models. We applied this to a data set of full-genome sequences from the hepatitis C virus where five RNA structures are mapped within the coding region. This allowed us to partition the effects of selection on different structural elements and to test various hypotheses concerning the relation of these effects. Of particular interest, we found evidence of a functional role of loop and bulge regions, as these were shown to evolve according to a different and more constrained selective regime than the nonpairing regions outside the RNA structures. Other potential applications of the model include comparative RNA structure prediction in coding regions and RNA virus phylogenetics.  相似文献   

13.
14.
MOTIVATION: The large volume of single nucleotide polymorphism data now available motivates the development of methods for distinguishing neutral changes from those which have real biological effects. Here, two different machine-learning methods, decision trees and support vector machines (SVMs), are applied for the first time to this problem. In common with most other methods, only non-synonymous changes in protein coding regions of the genome are considered. RESULTS: In detailed cross-validation analysis, both learning methods are shown to compete well with existing methods, and to out-perform them in some key tests. SVMs show better generalization performance, but decision trees have the advantage of generating interpretable rules with robust estimates of prediction confidence. It is shown that the inclusion of protein structure information produces more accurate methods, in agreement with other recent studies, and the effect of using predicted rather than actual structure is evaluated. AVAILABILITY: Software is available on request from the authors.  相似文献   

15.
Dynamic flexibility in the Escherichia coli genome.   总被引:2,自引:0,他引:2  
L Tsai  Z Sun 《FEBS letters》2001,507(2):225-230
Empirical rules based on tetranucleotide parameters were presented to predict the structural parameters twist (Omega), roll (rho), tilt (tau) and slide (D(y)). A statistical mechanical model was used to analyze the flexibility of the Escherichia coli genome. The replication terminus region displayed a low level of flexibility. A strong correlation can be seen between G+C content and flexibility. Average flexibilities in the coding regions were found to be significantly larger than those in non-coding regions. The flexible characteristics in the 5'-neighborhood of the coding regions and in three class sigma promoter sequences in the E. coli genome were also analyzed.  相似文献   

16.
17.
Indels in the coding regions of a gene can either cause frameshifts or amino acid insertions/deletions. Frameshifting indels are indels that have a length that is not divisible by 3 and subsequently cause frameshifts. Indels that have a length divisible by 3 cause amino acid insertions/deletions or block substitutions; we call these 3n indels. The new amino acid changes resulting from 3n indels could potentially affect protein function. Therefore, we construct a SIFT Indel prediction algorithm for 3n indels which achieves 82% accuracy, 81% sensitivity, 82% specificity, 82% precision, 0.63 MCC, and 0.87 AUC by 10-fold cross-validation. We have previously published a prediction algorithm for frameshifting indels. The rules for the prediction of 3n indels are different from the rules for the prediction of frameshifting indels and reflect the biological differences of these two different types of variations. SIFT Indel was applied to human 3n indels from the 1000 Genomes Project and the Exome Sequencing Project. We found that common variants are less likely to be deleterious than rare variants. The SIFT indel prediction algorithm for 3n indels is available at http://sift-dna.org/  相似文献   

18.
We have determined the nucleotide sequence of a 1,200-base pair (bp) genomic fragment that includes the kappa-chain constant-region gene (C kappa) from two species of native Australian rodents, Rattus leucopus cooktownensis and Rattus colletti. Comparison of these sequences with each other and with other rodent C kappa genes shows three surprising features. First, the coding regions are diverging at a rate severalfold higher than that of the nearby noncoding regions. Second, replacement changes within the coding region are accumulating at a rate at least as great as that of silent changes. Third, most of the amino acid replacements are localized in one region of the C kappa domain--namely, the carboxy-terminal "bends" in the alpha-carbon backbone. These three features have previously been described from comparisons of the two allelic forms of C kappa genes in R. norvegicus. These data imply the existence of considerable evolutionary constraints on the noncoding regions (based on as yet undetermined functions) or powerful positive selection to diversify a portion of the constant-region domain (whose physiological significance is not known). These surprising features of C kappa evolution appear to be characteristic only of closely related C kappa genes, since comparison of rodent with human sequences shows the expected greater conservation of coding regions, as well as a predominance of silent nucleotide substitutions within the coding regions.   相似文献   

19.
A genome must locate its coding genes on the chromosomes in a meaningful manner with the help of natural selection, but the mechanism of gene order evolution is poorly understood. To explore the role of selection in shaping the current order of coding genes and their cis-regulatory elements, a comparative genomic approach was applied to the baker's yeast Saccharomyces cerevisiae and its close relatives. S. cerevisiae have experienced a whole-genome duplication followed by an extensive reorganization process of gene order, during which a number of new adjacent gene pairs appeared. We found that the proportion of new adjacent gene pairs in divergent orientation is significantly reduced, suggesting that such new divergent gene pairs may be disfavored most likely because their coregulation may be deleterious. It is also found that such new divergent gene pairs have particularly long intergenic regions. These observations suggest that selection specifically worked against deletions in intergenic regions of new divergent gene pairs, perhaps because they should be physically kept away so that they are not coregulated. It is indicated that gene regulation would be one of the major factors to determine the order of coding genes.  相似文献   

20.
Prediction of splice junctions in mRNA sequences.   总被引:8,自引:6,他引:2       下载免费PDF全文
K Nakata  M Kanehisa    C DeLisi 《Nucleic acids research》1985,13(14):5327-5340
A general method based on the statistical technique of discriminant analysis is developed to distinguish boundaries of coding and non-coding regions in nucleic acid sequences. In particular, the method is applied to the prediction of splicing sites in messenger RNA precursors. Information used for discrimination includes consensus sequence patterns around splice junctions, free energy of snRNA and mRNA base pairing, and statistical differences between coding and non-coding regions such as periodic appearance of specific bases in coding regions reflecting the non-random usage of degenerate codons. Given the reading frame of an exon (but not the exon/intron boundaries), the method will predict the following exon, namely, the intron to be excised out. When applied to human sequences in the GenBank database, the method correctly identified 80% of true splice junctions.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号