首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
3.
Structure of the rat prolactin gene   总被引:17,自引:0,他引:17  
The organization and sequence of the rat preprolactin gene has been investigated. Analysis of two different plasmids containing pituitary cDNA inserts has provided the complete 681-nucleotide coding sequence of preprolactin as well as 17 nucleotides preceding the initiation codon and 90 nucleotides following the termination codon. Digestion of rat chromosomal DNA with the restriction endonuclease Eco RI followed by size fractionation and hybridization to a labeled prolactin cDNA probe has demonstrated that prolactin genomic sequences are located on 6.0-, 3.9-, and 2.9-kilobase fragments. The 6.0- and 3.9-kilobase fragments were isolated from a library of cloned rat DNA fragments. The sequence of more than 1800 nucleotides of the cloned DNA has been determined. The sequenced region contains coding regions of 180 and 189 nucleotides which specify the COOH-terminal 123 amino acids of the 227-amino-acid sequence of rat preprolactin. These coding regions are separated by an intervening sequence of 597 nucleotides. At least one other large intervening sequence separates this region from the region coding for the NH2-terminal portion of preprolactin. Hybridization experiments suggested that the intervening sequences of the rat prolactin gene contain DNA sequences which are repeated elsewhere in the rat genome.  相似文献   

4.
The genomes of lungfish, together with those of some urodele amphibians, are the largest of all vertebrate genomes. It has been assumed that the bulk of the DNA making up these large genomes has been derived from repeat elements, like the noncoding DNA of those genomes that have been sequenced (e.g., human). In an attempt to characterize repeat sequences in the lungfish genome, we have isolated, by restriction enzyme digestion of genomic DNA, sequences of a repeat element in Neoceratodus forsteri, the most primitive of the living lungfishes. The fragments sequenced from the EcoRI and BglII digests were used to perform genome walking PCR in order to obtain the full sequence of the repeat element. This element shares homology with the non-LTR (LINE) element, Chicken Repeat 1 (CR1), described for several vertebrates and some invertebrates; we have called it N. forsteri CR1 (NfCR1). NfCR1 shares all the domains of other CR1 elements but it also has several unique features that suggest it may no longer be active in the lungfish genome. It occurs in both full-length and 5'-truncated versions and in its present "inactive" form represents approximately 0.05% of the lungfish genome.  相似文献   

5.
Summary We have investigated the compositional properties of coding sequences from cold-blooded vertebrates and we have compared them with those from warm-blooded vertebrates. Moreover, we have studied the compositional correlations of coding sequences with the genomes in which they are contained, as well as the compositional correlations among the codon positions of the genes analyzed.The distribution of GC levels of the third codon positions of genes from cold-blooded vertebrates are distinctly different from those of warm-blooded vertebrates in that they do not reach the high values attained by the latter. Moreover, coding sequences from cold-blooded vertebrates are either equal, or, in most cases, lower in GC (not only in third, but also in first and second codon positions) than homologous coding sequences from warm-blooded vertebrates; higher values are exceptional. These results at the gene level are in agreement with the compositional differences between cold-blooded and warm-blooded vertebrates previously found at the whole genome (DNA) level (Bernardi and Bernardi 1990a,b).Two linear correlations were found: one between the GC levels of coding sequences (or of their third codon positions) and the GC levels of the genomes of cold-blooded vertebrates containing them; and another between the GC levels of third and first+ second codon positions of genes from cold-blooded vertebrates. The first correlation applies to the genomes (or genome compartments) of all vertebrates and the second to the genes of all living organisms. These correlations are tantamount to a genomic code.  相似文献   

6.
MOTIVATION: Prediction of the coding potential for stretches of DNA is crucial in gene calling and genome annotation, where it is used to identify potential exons and to position their boundaries in conjunction with functional sites, such as splice sites and translation initiation sites. The ability to discriminate between coding and non-coding sequences relates to the structure of coding sequences, which are organized in codons, and by their biased usage. For statistical reasons, the longer the sequences, the easier it is to detect this codon bias. However, in many eukaryotic genomes, where genes harbour many introns, both introns and exons might be small and hard to distinguish based on coding potential. RESULTS: Here, we present novel approaches that specifically aim at a better detection of coding potential in short sequences. The methods use complementary sequence features, combined with identification of which features are relevant in discriminating between coding and non-coding sequences. These newly developed methods are evaluated on different species, representative of four major eukaryotic kingdoms, and extensively compared to state-of-the-art Markov models, which are often used for predicting coding potential. The main conclusions drawn from our analyses are that (1) combining complementary sequence features clearly outperforms current Markov models for coding potential prediction in short sequence fragments, (2) coding potential prediction benefits from length-specific models, and these models are not necessarily the same for different sequence lengths and (3) comparing the results across several species indicates that, although our combined method consistently performs extremely well, there are important differences across genomes. SUPPLEMENTARY DATA: http://bioinformatics.psb.ugent.be/.  相似文献   

7.
ABSTRACT: BACKGROUND: Gene finding is a complicated procedure that encapsulates algorithms for coding sequence modeling, identification of promoter regions, issues concerning overlapping genes and more. In the present study we focus on coding sequence modeling algorithms; that is, algorithms for identification and prediction of the actual coding sequences from genomic DNA. In this respect, we promote a novel multivariate method known as Canonical Powered Partial Least Squares (CPPLS) as an alternative to the commonly used Interpolated Markov model (IMM). Comparisons between the methods were performed on DNA, codon and protein sequences with highly conserved genes taken from several species with different genomic properties. RESULTS: The multivariate CPPLS approach classified coding sequence substantially better than the commonly used IMM on the same set of sequences. We also found that the use of CPPLS with codon representation gave significantly better classification results than both IMM with protein (p < 0.001) and with DNA (p < 0.001). Further, although the mean performance was similar, the variation of CPPLS performance on codon representation was significantly smaller than for IMM (p < 0.001). CONCLUSIONS: The performance of coding sequence modeling can be substantially improved by using an algorithm based on the multivariate CPPLS method applied to codon or DNA frequencies.  相似文献   

8.
9.
MOTIVATION: Accurate prediction of genes in genomes has always been a challenging task for bioinformaticians and computational biologists. The discovery of existence of distinct scaling relations in coding and non-coding sequences has led to new perspectives in the understanding of the DNA sequences. This has motivated us to exploit the differences in the local singularity distributions for characterization and classification of coding and non-coding sequences. RESULTS: The local singularity density distribution in the coding and non-coding sequences of four genomes was first estimated using the wavelet transform modulus maxima methodology. Support vector machines classifier was then trained with the extracted features. The trained classifier is able to provide an average test accuracy of 97.7%. The local singularity features in a DNA sequence can be exploited for successful identification of coding and non-coding sequences. CONTACT: Available on request from bd.kulkarni@ncl.res.in.  相似文献   

10.
Behura SK  Severson DW 《Gene》2012,504(2):226-232
We present a detailed genome-scale comparative analysis of simple sequence repeats within protein coding regions among 25 insect genomes. The repetitive sequences in the coding regions primarily represented single codon repeats and codon pair repeats. The CAG triplet is highly repetitive in the coding regions of insect genomes. It is frequently paired with the synonymous codon CAA to code for polyglutamine repeats. The codon pairs that are least repetitive code for polyalanine repeats. The frequency of hexanucleotide and dinucleotide motifs of codon pair repeats is significantly (p<0.001) different in the Drosophila species compared to the non-Drosophila species. However, the frequency of synonymous and non-synonymous codon pair repeats varies in a correlated manner (r(2)=0.79) among all the species. Results further show that perfect and imperfect repeats have significant association with the trinucleotide and hexanucleotide coding repeats in most of these insects. However, only select species show significant association between the numbers of perfect/imperfect hexamers and repeat coding for single amino acid/amino acid pair runs. Our data further suggests that genes containing simple sequence coding repeats may be under negative selection as they tend to be poorly conserved across species. The sequences of coding repeats of orthologous genes vary according to the known phylogeny among the species. In conclusion, the study shows that simple sequence coding repeats are important features of genome diversity among insects.  相似文献   

11.
12.
DNA sequences, potentially coding for histidine-rich proteins, were isolated from a P. falciparum genomic library using an oligonucleotide probe consisting of histidine codon repeats. Sequencing revealed that the different DNA fragments contain long repetitive regions very homologous to the probe. One clone was fully sequenced and contains two open reading frames that overlap in the repetitive region but are located on opposite strands. Analysis suggests that both are coding. One frame could code for a small histidine-rich protein, the other for a protein containing many aspartic acid residues. Southern blotting revealed that these sequences are conserved in all three P. falciparum strains studied.  相似文献   

13.
With the exponential growth of genomic sequences, there is an increasing demand to accurately identify protein coding regions (exons) from genomic sequences. Despite many progresses being made in the identification of protein coding regions by computational methods during the last two decades, the performances and efficiencies of the prediction methods still need to be improved. In addition, it is indispensable to develop different prediction methods since combining different methods may greatly improve the prediction accuracy. A new method to predict protein coding regions is developed in this paper based on the fact that most of exon sequences have a 3-base periodicity, while intron sequences do not have this unique feature. The method computes the 3-base periodicity and the background noise of the stepwise DNA segments of the target DNA sequences using nucleotide distributions in the three codon positions of the DNA sequences. Exon and intron sequences can be identified from trends of the ratio of the 3-base periodicity to the background noise in the DNA sequences. Case studies on genes from different organisms show that this method is an effective approach for exon prediction.  相似文献   

14.
Heuristic approach to deriving models for gene finding.   总被引:21,自引:2,他引:19       下载免费PDF全文
Computer methods of accurate gene finding in DNA sequences require models of protein coding and non-coding regions derived either from experimentally validated training sets or from large amounts of anonymous DNA sequence. Here we propose a new, heuristic method producing fairly accurate inhomogeneous Markov models of protein coding regions. The new method needs such a small amount of DNA sequence data that the model can be built 'on the fly' by a web server for any DNA sequence >400 nt. Tests on 10 complete bacterial genomes performed with the GeneMark.hmm program demonstrated the ability of the new models to detect 93.1% of annotated genes on average, while models built by traditional training predict an average of 93.9% of genes. Models built by the heuristic approach could be used to find genes in small fragments of anonymous prokaryotic genomes and in genomes of organelles, viruses, phages and plasmids, as well as in highly inhomogeneous genomes where adjustment of models to local DNA composition is needed. The heuristic method also gives an insight into the mechanism of codon usage pattern evolution.  相似文献   

15.
Complete chromosome/genome sequences available from humans, Drosophila melanogaster, Caenorhabditis elegans, Arabidopsis thaliana, and Saccharomyces cerevisiae were analyzed for the occurrence of mono-, di-, tri-, and tetranucleotide repeats. In all of the genomes studied, dinucleotide repeat stretches tended to be longer than other repeats. Additionally, tetranucleotide repeats in humans and trinucleotide repeats in Drosophila also seemed to be longer. Although the trends for different repeats are similar between different chromosomes within a genome, the density of repeats may vary between different chromosomes of the same species. The abundance or rarity of various di- and trinucleotide repeats in different genomes cannot be explained by nucleotide composition of a sequence or potential of repeated motifs to form alternative DNA structures. This suggests that in addition to nucleotide composition of repeat motifs, characteristic DNA replication/repair/recombination machinery might play an important role in the genesis of repeats. Moreover, analysis of complete genome coding DNA sequences of Drosophila, C. elegans, and yeast indicated that expansions of codon repeats corresponding to small hydrophilic amino acids are tolerated more, while strong selection pressures probably eliminate codon repeats encoding hydrophobic and basic amino acids. The locations and sequences of all of the repeat loci detected in genome sequences and coding DNA sequences are available at http://www.ncl-india.org/ssr and could be useful for further studies.  相似文献   

16.
Synonymous codon usage patterns of bacteriophage and host genomes were compared. Two indexes, G + C base composition of a gene (fgc) and fraction of translationally optimal codons of the gene (fop), were used in the comparison. Synonymous codon usage data of all the coding sequences on a genome are represented as a cloud of points in the plane of fop vs. fgc. The Escherichia coli coding sequences appear to exhibit two phases, "rising" and "flat" phases. Genes that are essential for survival and are thought to be native are located in the flat phase, while foreign-type genes from prophages and transposons are found in the rising phase with a slope of nearly unity in the fgc vs. fop plot. Synonymous codon distribution patterns of genes from temperate phages P4, P2, N15 and lambda are similar to the pattern of E. coli rising phase genes. In contrast, genes from the virulent phage T7 or T4, for which a phage-encoded DNA polymerase is identified, fall in a linear curve with a slope of nearly zero in the fop vs. fgc plane. These results may suggest that the G + C contents for T7, T4 and E. coli flat phase genes are subject to the directional mutation pressure and are determined by the DNA polymerase used in the replication. There is significant variation in the fop values of the phage genes, suggesting an adjustment to gene expression level. Similar analyses of codon distribution patterns were carried out for Haemophilus influenzae, Bacillus subtilis, Mycobacterium tuberculosis and their phages with complete genomic sequences available.  相似文献   

17.
Starting from two datasets of codon usage in coding sequences from mesophilic and thermophilic bacteria, we used internal correspondence analysis to study the variability of codon usage within and between species, and within and between amino acids. The first dataset included 18,958,458 codons from 58,482 coding sequences from completely sequenced genomes of 25 species, along with 6,793,581 dinucleotides from 21,876 intergenic spaces. The second dataset, with partially sequenced genomes, included 97,095,873 codons from 293 bacterial species. Results were consistent between the two datasets. The trend for the amino-acid composition of thermophilic proteins was found to be under the control of a pressure at the nucleic acid level, not a selection at the protein level. This effect was not present in intergenic spaces, ruling out a pressure at the DNA level. The pattern at the mRNA level was more complex than a simple purine enrichment of the sense strand of coding sequences. Outliers in the partial genome dataset introduced a note of caution about the interpretation of temperature as the direct determinant of the trend observed in thermophiles. The surprising lack of selection on the amino-acid content of thermophilic proteins suggests that the amino-acid repertoire was set up in a hot environment.  相似文献   

18.
19.
Static DNA curvature distributions of full-sequenced genomes and large DNA contigs from different organisms were calculated. Very distinctive differences among histogram profiles coming from archaebacteria, eubacteria, and eukaryotes were observed. Eubacterial profiles were, on average, more curved than were archaeal and eukaryotic profiles. A comparative analysis between real and randomized DNA sequences revealed that eubacterial genomes presented, overall, higher curvature values than random sequences. An opposite portrait was exhibited by archaeal and eukaryotic genomes. They displayed a lower frequency of curved regions than their corresponding randomized sequences. The contributions of coding and intergenic regions to the curvature profile were also analyzed. Intergenic regions, on average, were found to be more curved than the overall genomic sequences, especially in prokaryotic organisms. Nevertheless, because of their small size with respect to coding regions, the contribution of intergenic sequences to the overall curvature profile tended to be minor. A clear relationship between codon usage and DNA curvature was demonstrated, and a proposal of the possible coevolution of both systems is discussed. Finally, we present a procedure to quantify the deviation of a curvature profile from randomness through a formal statistical analysis.  相似文献   

20.
With the continuing accomplishments of the human genome project, high-throughput strategies to identify DNA sequences that are important in mammalian gene regulation are becoming increasingly feasible. In contrast to the historic, labour-intensive, wet-laboratory methods for identifying regulatory sequences, many modern approaches are heavily focused on the computational analysis of large genomic data sets. Data from inter-species genomic sequence comparisons and genome-wide expression profiling, integrated with various computational tools, are poised to contribute to the decoding of genomic sequence and to the identification of those sequences that orchestrate gene regulation. In this review, we highlight several genomic approaches that are being used to identify regulatory sequences in mammalian genomes.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号