首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
2.
The nucleotide sequence running from the genetic left end of bacteriophage T7 DNA to within the coding sequence of gene 4 is given, except for the internal coding sequence for the gene 1 protein, which has been determined elsewhere. The sequence presented contains nucleotides 1 to 3342 and 5654 to 12,100 of the approximately 40,000 base-pairs of T7 DNA. This sequence includes: the three strong early promoters and the termination site for Escherichia coli RNA polymerase: eight promoter sites for T7 RNA polymerase; six RNAase III cleavage sites; the primary origin of replication of T7 DNA; the complete coding sequences for 13 previously known T7 proteins, including the anti-restriction protein, protein kinase, DNA ligase, the gene 2 inhibitor of E. coli RNA polymerase, single-strand DNA binding protein, the gene 3 endonuclease, and lysozyme (which is actually an N-acetylmuramyl-l-alanine amidase); the complete coding sequences for eight potential new T7-coded proteins; and two apparently independent initiation sites that produce overlapping polypeptide chains of gene 4 primase. More than 86% of the first 12,100 base-pairs of T7 DNA appear to be devoted to specifying amino acid sequences for T7 proteins, and the arrangement of coding sequences and other genetic elements is very efficient. There is little overlap between coding sequences for different proteins, but junctions between adjacent coding sequences are typically close, the termination codon for one protein often overlapping the initiation codon for the next. For almost half of the potential T7 proteins, the sequence in the messenger RNA that can interact with 16 S ribosomal RNA in initiation of protein synthesis is part of the coding sequence for the preceding protein. The longest non-coding region, about 900 base-pairs, is at the left end of the DNA. The right half of this region contains the strong early promoters for E. coli RNA polymerase and the first RNAase III cleavage site. The left end contains the terminal repetition (nucleotides 1 to 160), followed by a striking array of repeated sequences (nucleotides 175 to 340) that might have some role in packaging the DNA into phage particles, and an A · T-rich region (nucleotides 356 to 492) that contains a promoter for T7 RNA polymerase, and which might function as a replication origin.  相似文献   

3.
In the present study, we developed a method for detecting sequences whose similarity to a target sequence is statistically significant and we examined the distribution of these sequences in the E. coli K-12 genome. Target sequences examined are as follows: (i) short repeat: Crossover hot-spot instigator (Chi) sequence, replication termination (Ter) sequence, and DnaA binding sequence (DnaA box); (ii) potential stem-loop structure repeats: palindromic unit (PU), boxC sequences, and intergenic repeat unit (IRU); (iii) potential RNA coding repeats: rRNAs, PAIR, TRIP, and QUAD; and (iv) potential protein coding repeats: insertion elements (ISs) and Long Direct Repeats (LDRs). We also examined the distribution of these sequences on leading and lagging strands. We obtained another four statistically significant LDR sequences with more than 187 bp matched to LDR-A near the LDR loci, suggesting that these regions might be used as high recombination hot spots for LDR. Adaptation of individual LDRs to E. coli genome is also discussed on the basis of codon usage.  相似文献   

4.
It is proven that under the independent codon model, the likelihood of a DNA coding sequence read according to the correct frame is asymptotically larger than that read with an incorrect frame. Based on this proposition, a single set of probabilities of the codon usage is enough for discriminating the six frames of coding sequences under the independent codon model. The direct coding sequence of Escherichia coli genome is taken as an example to examine the codon independency by using the mutual information and chi2 analysis. The contrast between the coding frame and the two offset frames is evident. A self-learning approach for generating training set is proposed to estimate probability parameters.  相似文献   

5.
6.
Codon usage tables have been produced for E. coli, yeast, human, and mouse. The nonrandom employment of codons allows assignment of probability values to trinucleotides in any DNA sequence. These values represent the probability that a given trinucleotide is used as a codon in the organism from which the table is derived. For the graphical delineation of coding areas in DNA sequences, a probability is assigned to each trinucleotide equal to its frequency in the codon table. Averaging and smoothing procedures then greatly enhance the detectability of areas of high average codon probability and better represent the mean codon probability. These manipulations increase graphical clarity without altering the overall magnitude of probabilities. Averaging introduces an error of less than 0.5% between "raw" and smoothed data. This graphical delineation of coding sequences does not depend on the presence of punctuation, ribosomal binding sites, etc: moreover the delineation of introns and exons is also possible.  相似文献   

7.
The identification of genes involved in host-pathogen interactions is important for the elucidation of mechanisms of disease resistance and host susceptibility. A traditional way to classify the origin of genes sampled from a pool of mixed cDNA is through sequence similarity to known genes from either the pathogen or host organism or other closely related species. This approach does not work when the identified sequence has no close homologues in the sequence databases. In our previous studies, we classified genes using their codon frequencies. This method, however, explicitly required the prediction of CDS regions and thus could not be applied to sequences composed from the non-coding regions of genes. In this study, we show that the use of sliding-window triplet frequencies extends the application of the algorithm to both coding and non-coding sequences and also increases the prediction accuracy of a Support Vector Machine classifier from 95.6+/-0.3 to 96.5+/-0.2. Thus the use of the triplet frequencies increased the prediction accuracy of the new method by more than 20% compared to our previous approach. A functional analysis of sequences detected gene families having significantly higher or lower probability to be correctly classified compared to the average accuracy of the method is described. The server to perform classification of EST sequences using triplet frequencies is available at (URL: http://mips.gsf.de/proj/est3).  相似文献   

8.
Our previous study demonstrated the anti-apoptosis function of the human immunodeficiency virus type 1 (HIV-1) vpu gene product in normal CD4+ T lymphocytes. In this study, using sequences obtained from the HIV sequence database, we compared vpu sequences from 184 preparations of various subtypes of HIV-1 from diverse geographical regions. Our analysis revealed that CRF01_AE isolates had premature stop codon mutations at the vpu gene at a much higher rate (36%) than other subtypes (0-9%). The premature stop codon mutations in vpu existed mostly at two amino acid residues: the methionine initiation codon and the boundary between the transmembrane (TM) and cytoplasmic domains. The mutations at the latter site were more often detected in CRF01_AE. The higher mutation rates at vpu in CRF01_AE were confirmed by sequence comparison of polymerase chain reaction products newly obtained directly from the DNA extracted from peripheral blood mononuclear cells (PBMCs), but not from the RNA from the plasma, in CRF01_AE- and subtype B-infected individuals. This finding may indicate the possibility that the more abundant population of HIV-1 CRF01_AE is able to induce apoptosis in CD4+ T lymphocytes than the populations of other subtypes.  相似文献   

9.
ABSTRACT: BACKGROUND: Gene finding is a complicated procedure that encapsulates algorithms for coding sequence modeling, identification of promoter regions, issues concerning overlapping genes and more. In the present study we focus on coding sequence modeling algorithms; that is, algorithms for identification and prediction of the actual coding sequences from genomic DNA. In this respect, we promote a novel multivariate method known as Canonical Powered Partial Least Squares (CPPLS) as an alternative to the commonly used Interpolated Markov model (IMM). Comparisons between the methods were performed on DNA, codon and protein sequences with highly conserved genes taken from several species with different genomic properties. RESULTS: The multivariate CPPLS approach classified coding sequence substantially better than the commonly used IMM on the same set of sequences. We also found that the use of CPPLS with codon representation gave significantly better classification results than both IMM with protein (p < 0.001) and with DNA (p < 0.001). Further, although the mean performance was similar, the variation of CPPLS performance on codon representation was significantly smaller than for IMM (p < 0.001). CONCLUSIONS: The performance of coding sequence modeling can be substantially improved by using an algorithm based on the multivariate CPPLS method applied to codon or DNA frequencies.  相似文献   

10.
11.
Chen C  Montelaro RC 《Journal of virology》2003,77(19):10280-10287
Synthesis of Gag-Pol polyproteins of retroviruses requires ribosomes to shift translational reading frame once or twice in a -1 direction to read through the stop codon in the gag reading frame. It is generally believed that a slippery sequence and a downstream RNA structure are required for the programmed -1 ribosomal frameshifting. However, the mechanism regulating the Gag-Pol frameshifting remains poorly understood. In this report, we have defined specific mRNA elements required for sufficient ribosomal frameshifting in equine anemia infectious virus (EIAV) by using full-length provirus replication and Gag/Gag-Pol expression systems. The results of these studies revealed that frameshifting efficiency and viral replication were dependent on a characteristic slippery sequence, a five-base-paired GC stretch, and a pseudoknot structure. Heterologous slippery sequences from human immunodeficiency virus type 1 and visna virus were able to substitute for the EIAV slippery sequence in supporting EIAV replication. Disruption of the GC-paired stretch abolished the frameshifting required for viral replication, and disruption of the pseudoknot reduced the frameshifting efficiency by 60%. Our data indicated that maintenance of the essential RNA signals (slippery sequences and structural elements) in this region of the genomic mRNA was critical for sufficient ribosomal frameshifting and EIAV replication, while concomitant alterations in the amino acids translated from the same region of the mRNA could be tolerated during replication. The data further indicated that proviral mutations that reduced frameshifting efficiency by as much as 50% continued to sustain viral replication and that greater reductions in frameshifting efficiency lead to replication defects. These studies define for the first time the RNA sequence and structural determinants of Gag-Pol frameshifting necessary for EIAV replication, reveal novel aspects relative to frameshifting elements described for other retroviruses, and provide new genetic determinants that can be evaluated as potential antiviral targets.  相似文献   

12.
Structure of the rat prolactin gene   总被引:17,自引:0,他引:17  
The organization and sequence of the rat preprolactin gene has been investigated. Analysis of two different plasmids containing pituitary cDNA inserts has provided the complete 681-nucleotide coding sequence of preprolactin as well as 17 nucleotides preceding the initiation codon and 90 nucleotides following the termination codon. Digestion of rat chromosomal DNA with the restriction endonuclease Eco RI followed by size fractionation and hybridization to a labeled prolactin cDNA probe has demonstrated that prolactin genomic sequences are located on 6.0-, 3.9-, and 2.9-kilobase fragments. The 6.0- and 3.9-kilobase fragments were isolated from a library of cloned rat DNA fragments. The sequence of more than 1800 nucleotides of the cloned DNA has been determined. The sequenced region contains coding regions of 180 and 189 nucleotides which specify the COOH-terminal 123 amino acids of the 227-amino-acid sequence of rat preprolactin. These coding regions are separated by an intervening sequence of 597 nucleotides. At least one other large intervening sequence separates this region from the region coding for the NH2-terminal portion of preprolactin. Hybridization experiments suggested that the intervening sequences of the rat prolactin gene contain DNA sequences which are repeated elsewhere in the rat genome.  相似文献   

13.
中国人白细胞介素-12 cDNA基因的克隆及序列分析与比较   总被引:3,自引:0,他引:3  
焦宏远  詹美云 《病毒学报》2000,16(4):336-340
为研究中国人IL-12的基因特征,采用逆转录巢式聚合酶链反应(RT-nPCR)从中国人脐带血单核细胞中分别克隆了P35、P40两亚基cDNA基因,包括完整的前体蛋白编码序列,其中P35 cDNA编码219个氨基酸的多肽,P40 cDNA编码328个氨基酸的多肽,与国外序列(NKSF、CLMF)比较结果发现:所克隆序列P35同NKSF相比,第44aa密友子由GTC(Val)→GTG(Val),但未改  相似文献   

14.
One of the main advantages of de novo gene synthesis is the fact that it frees the researcher from any limitations imposed by the use of natural templates. To make the most out of this opportunity, efficient algorithms are needed to calculate a coding sequence, combining different requirements, such as adapted codon usage or avoidance of restriction sites, in the best possible way. We present an algorithm where a “variation window” covering several amino acid positions slides along the coding sequence. Candidate sequences are built comprising the already optimized part of the complete sequence and all possible combinations of synonymous codons representing the amino acids within the window. The candidate sequences are assessed with a quality function, and the first codon of the best candidates’ variation window is fixed. Subsequently the window is shifted by one codon position. As an example of a freely accessible software implementing the algorithm, we present the Mr. Gene web-application. Additionally two experimental applications of the algorithm are shown.  相似文献   

15.
Nagase T  Nishio S  Itoh T 《Plasmid》2008,59(1):36-44
Translation initiation of mRNA encoding the plasmid-specified initiator protein (Rep) required for initiation of the ColE2 plasmid DNA replication is fairly efficient in Escherichia coli despite the absence of a canonical Shine-Dalgarno sequence. Although a GA cluster sequence exists upstream the initiation codon, its activity as the SD sequence has been shown to be very inefficient. Deletion analyses have shown that there are sequences important for the Rep translation in the regions upstream the GA cluster sequence and downstream the initiation codon. To further define regions important for translation of the Rep mRNA, a set of the ColE2 rep genes bearing single-base substitution mutations in the coding region near the initiation codon was generated and their translation activities examined. We showed that translation of the Rep mRNA was reduced by some of these mutations in a region ranging at least 70 nucleotides from the initiation codon in the coding region, indicating the presence of translation enhancer(s) outside the translation initiation region which is covered by the ribosome bound to the initiation codon. Some of them seem to be essential and specific for translation of the ColE2 Rep mRNA due to the absence of a canonical SD sequence.  相似文献   

16.
DNA序列信息的一种新的测度   总被引:4,自引:3,他引:1  
根据信息理论给出了测度DNA序列信息的一种新的方法,获得DNA序列4个层次的信息量测度:Ib,If(1),If(2)andIf(3),这4种信息测度可分别用来测度DNA的碱基序列、密码子序列、编码蛋白质序列和功能蛋白质序列的信息量。从M.edulis的线粒体基因组中两个较短的编码蛋白质的DNA序列和使用具有不同倍性的间并密码子组组成的模拟DNA序列中所获得计算结果表明,这些信息测度确实能用来揭示所  相似文献   

17.
MOTIVATION: Prediction of the coding potential for stretches of DNA is crucial in gene calling and genome annotation, where it is used to identify potential exons and to position their boundaries in conjunction with functional sites, such as splice sites and translation initiation sites. The ability to discriminate between coding and non-coding sequences relates to the structure of coding sequences, which are organized in codons, and by their biased usage. For statistical reasons, the longer the sequences, the easier it is to detect this codon bias. However, in many eukaryotic genomes, where genes harbour many introns, both introns and exons might be small and hard to distinguish based on coding potential. RESULTS: Here, we present novel approaches that specifically aim at a better detection of coding potential in short sequences. The methods use complementary sequence features, combined with identification of which features are relevant in discriminating between coding and non-coding sequences. These newly developed methods are evaluated on different species, representative of four major eukaryotic kingdoms, and extensively compared to state-of-the-art Markov models, which are often used for predicting coding potential. The main conclusions drawn from our analyses are that (1) combining complementary sequence features clearly outperforms current Markov models for coding potential prediction in short sequence fragments, (2) coding potential prediction benefits from length-specific models, and these models are not necessarily the same for different sequence lengths and (3) comparing the results across several species indicates that, although our combined method consistently performs extremely well, there are important differences across genomes. SUPPLEMENTARY DATA: http://bioinformatics.psb.ugent.be/.  相似文献   

18.
Recombinant protein translation in Escherichia coli may be limited by stable (i.e. low free energy) secondary structures in the mRNA translation initiation region. To circumvent this issue, we have set-up a computer tool called 'ExEnSo' (Expression Enhancer Software) that generates a random library of 8192 sequences, calculates the free energy of secondary structures of each sequence in the -70/+96 region (base 1 is the translation initiation codon), and then selects the sequence having the highest free energy. The software uses this 'optimized' sequence to create a 5' primer that can be used in PCR experiments to amplify the coding sequence of interest prior to sub-cloning into a prokaryotic expression vector. In this article, we report how ExEnSo was set-up and the results obtained with nine coding sequences with low expression levels in E. coli. The free energy of the -70/+96 region of all these coding sequences was increased compared to the non-optimized sequences. Moreover, the protein expression of eight out of nine of these coding sequences was increased in E. coli, indicating a good correlation between in silico and in vivo results. ExEnSo is available as a free online tool.  相似文献   

19.
A frequently used approach for detecting potential coding regions is to search for stop codons. In the standard genetic code 3 out of 64 trinucleotides are stop codons. Hence, in random or non-coding DNA one can expect every 21st trinucleotide to have the same sequence as a stop codon. In contrast, the open reading frames (ORFs) of most protein-coding genes are considerably longer. Thus, the stop codon frequency in coding sequences deviates from the background frequency of the corresponding trinucleotides. This has been utilized for gene prediction, in particular, in detecting protein-coding ORFs. Traditional methods based on stop codon frequency are based on the assumption that the GC content is about 50%. However, many genomes show significant deviations from that value. With the presented method we can describe the effects of GC content on the selection of appropriate length thresholds of potentially coding ORFs. Conversely, for a given length threshold, we can calculate the probability of observing it in a random sequence. Thus, we can derive the maximum GC content for which ORF length is practicable as a feature for gene prediction methods and the resulting false positive rates. A rough estimate for an upper limit is a GC content of 80%. This estimate can be made more precise by including further parameters and by taking into account start codons as well. We demonstrate the feasibility of this method by applying it to the genomes of the bacteria Rickettsia prowazekii, Escherichia coli and Caulobacter crescentus, exemplifying the effect of GC content variations according to our predictions. We have adapted the method for predicting coding ORFs by stop codon frequency to the case of GC contents different from 50%. Usually, several methods for gene finding need to be combined. Thus, our results concern a specific part within a package of methods. Interestingly, for genomes with low GC content such as that of R. prowazekii, the presented method provides remarkably good results even when applied alone.  相似文献   

20.
The DNA sequence orgainzation of the protein encoding region of the gene for silk fibroin has been analyzed. The accompanying paper (Manningm R. F., and Gage, L. P. (1980) J. Biol. Chem. 255, 9451-9457) shows that the total length of the gene, and its protein, as well as the pattern of restriction sites in the gene is highly polymorphic among inbred stocks of Bombyx mori, In this paper, those features of fibroin gene structure which are invariant among these alleles are presented. Fibroin is composed primarily of relatively short "crystalline" and "amorphous" peptides of known sequence whose arrangement in the protein is unknown. Knowledge of the codons most commonly used in fibroin mRNA allowed utilization of particular restriction inzymes as a means for determing the nature and organization of crystalline and amorphous coding sequences in the fibroin gene. Three restriction endonucleases were identified that cleve sequences coding for amorphous region peptides. Their cleavage pattern revelaed that the repetitive coding sequence of the gene core (approximately 15 kilobases) is divided into at least 10 large crystalline coding domains interrupted by smaller amorphous coding domains. Many restriction endoncleases do not cleave the fibroin core at all, three of them with four gase recognition sequences. Specific deductions as to codon usage and repetitive sequence homogeneity in the gene follow from these results. One novel finding is the rigorous exclusion of the glycine codon GGA prior to serine codons even though this glycine codon is used frequently prior to alanine codons. The sequence homogeneity and the regularly alternating arrangement of crystalline and amorphous coding sequences of the gene are discussed in terms of the function of fibroin protein and the evolution of highly repetitive DNA.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号