首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
Seligmann H 《Bio Systems》2011,105(3):271-285
Genomic amino acid usages coevolve with cloverleaf formation capacities of corresponding primate mitochondrial tRNAs, also for antisense tRNAs, suggesting translational function for sense and antisense tRNAs. Some antisense tRNAs are antitermination tRNAs (anticodons match stops (UAR: UAA, UAG; AGR: AGA, AGG)). Genomes possessing antitermination tRNAs avoid corresponding stops in frames 0 and +1, preventing translational antitermination. In frame +2, AGR stop frequencies and corresponding antisense antitermination tRNAs coevolve positively. This suggests expression of frameshifted overlapping genes, potentially shortening genomes, increasing metabolic efficiency. Blast analyses of hypothetical proteins translated from one and seven +1, respectively, +2 frameshifted human mitochondrial protein coding genes align with eleven GenBank sequences (31% of the mitochondrial coding regions). These putative overlap genes contain few UARs, AGRs align with arginine. Overlap gene numbers increase in presence of, and with time since evolution of antitermination tRNA AGR in 57 primate mitochondrial genomes. Numbers of putative proteins translated from antisense protein coding sequences and detected by blast also coevolve positively with antitermination tRNAs; expression of two of these ‘antisense’ mRNAs increases under low resource availability. Although more direct evidence is still lacking for the existence of proteins translated from overlapping mitochondrial genes and for antisense tRNAs activity, coevolutions between predicted overlap genes and the antitermination tRNAs required to translate them suggest expression of overlapping genes by an overlapping genetic code. Functions of overlapping genes remain unknown, perhaps originating from dual lifestyles of ancestral free living-parasitic mitochondria. Their amino acid composition suggests expression under anaerobic conditions.  相似文献   

2.
3.
Polyoma virus. The early region and its T-antigens.   总被引:12,自引:2,他引:10  
The DNA sequence of the early coding region of polyoma virus is presented. It consists of 2739 nucleotides. The sequence predicts that more than one reading frame can be used to code for the three known polyoma virus early proteins (designated small, middle and large T-antigens). From the DNA sequence, the 'splicing' signals used in the processing of viral RNA to functional messenger RNAs can be predicted, as well as the sizes and sequences of the three proteins. Other unusual aspects of the DNA sequence are noted. Comparisons are made between the DNA sequences and the predicted amino acid sequences of the respective large T-antigens of polyoma virus and the related virus Simian Virus (SV) 40.  相似文献   

4.
If DNA were a random string over its alphabet {A, C, G, T}, an optimal code would assign two bits to each nucleotide. DNA may be imagined to be a highly ordered, purposeful molecule, and one might therefore reasonably expect statistical models of its string representation to produce much lower entropy estimates. Surprisingly, this has not been the case for many natural DNA sequences, including portions of the human genome. We introduce a new statistical model (compression algorithm), the strongest reported to date, for naturally occurring DNA sequences. Conventional techniques code a nucleotide using only slightly fewer bits (1.90) than one obtains by relying only on the frequency statistics of individual nucleotides (1.95). Our method in some cases increases this gap by more than fivefold (1.66) and may lead to better performance in microbiological pattern recognition applications. One of our main contributions, and the principle source of these improvements, is the formal inclusion of inexact match information in the model. The existence of matches at various distances forms a panel of experts which are then combined into a single prediction. The structure of this combination is novel and its parameters are learned using Expectation Maximization (EM). Experiments are reported using a wide variety of DNA sequences and compared whenever possible with earlier work. Four reasonable notions for the string distance function used to identify near matches, are implemented and experimentally compared. We also report lower entropy estimates for coding regions extracted from a large collection of nonredundant human genes. The conventional estimate is 1.92 bits. Our model produces only slightly better results (1.91 bits) when considering nucleotides, but achieves 1.84-1.87 bits when the prediction problem is divided into two stages: (i) predict the next amino acid-based on inexact polypeptide matches, and (ii) predict the particular codon. Our results suggest that matches at the amino acid level play some role, but a small one, in determining the statistical structure of nonredundant coding sequences.  相似文献   

5.
Analysis of the frequencies of occurrence of mono- and dinucleotides in sequenced E. coli DNA fragments was performed. The DNA sequences of total length 135 000 nucleotides were considered. It was demonstrated that the fragments of DNA which have different functional properties also have different parameters of neighbour nucleotides correlation. Moreover, periodical positional dependence of correlation parameters in coding regions was found. The evolution significance of stated observation is discussed, so as the opportunity of using them in the special model of nucleotide's sequences, which is needed for development of the computer recognition algorithms for genomic functional units.  相似文献   

6.
Xiao M  Zhu ZZ  Liu J  Zhang CY 《Acta biotheoretica》2002,50(3):155-165
We have refined entropy theory to explore the meaning of the increasing sequence data on nucleic acids and proteins more conveniently. The concept of selection constraint was not introduced, only the analyzed sequences themselves were considered. The refined theory serves as a basis for deriving a method to analyze non-coding regions (NCRs) as well as coding regions. Positions with maximal entropy might play the most important role in genome functions as opposed to positions with minimal entropy. This method was tested in the well-characterized coding regions of 12 strains of Classical Swine Fever Virus (CSFV) and non-coding regions of 20 strains of CSFV. It is suitable to analyze nucleic acid sequences of a complete genome and to detect sensitive positions for mutagenesis. As such, the method serves to formulate the basis for elucidating the functional mechanism.  相似文献   

7.
Recently, it was observed that noncoding regions of DNA sequences possess long-range power-law correlations, whereas coding regions typically display only short-range correlations. We develop an algorithm based on this finding that enables investigators to perform a statistical analysis on long DNA sequences to locate possible coding regions. The algorithm is particularly successful in predicting the location of lengthy coding regions. For example, for the complete genome of yeast chromosome III (315,344 nucleotides), at least 82% of the predictions correspond to putative coding regions; the algorithm correctly identified all coding regions larger than 3000 nucleotides, 92% of coding regions between 2000 and 3000 nucleotides long, and 79% of coding regions between 1000 and 2000 nucleotides. The predictive ability of this new algorithm supports the claim that there is a fundamental difference in the correlation property between coding and noncoding sequences. This algorithm, which is not species-dependent, can be implemented with other techniques for rapidly and accurately locating relatively long coding regions in genomic sequences.  相似文献   

8.
9.
Properties of mRNA leading regions that modulate protein synthesis are little known (besides effects of their secondary structure). Here I explore how coding properties of leading regions may account for their disparate efficiencies. Trinucleotides that form off frame stop codons decrease costs of ribosomal slippages during protein synthesis: protein activity (as a proxy of gene expression, and as measured in experiments using artificial variants of 5' leading sequences of beta galactosidase in Escherichia coli) increases proportionally to the number of stop motifs in any frame in the 5' leading region. This suggests that stop codons in the 5' leading region, upstream of the recognized coding sequence, terminate eventual translations that sometimes start before ribosomes reach the mRNA's recognized start codon, increasing efficiency. This hypothesis is confirmed by further analyses: mRNAs with 5' leading regions containing in the same frame a start preceding a stop codon (in any frame) produce less enzymatic activity than those with the stop preceding the start. Hence coding properties, in addition to other properties, such as the secondary structure of the 5' leading region, regulate translation. This experimentally (a) confirms that within coding regions, off frame stops increase protein synthesis efficiency by early stopping frameshifted translation; (b) suggests that this occurs for all frames also in 5' leading regions and that (c) several alternative start codons that function at different probabilities should routinely be considered for all genes in the region of the recognized initiation codon. An unknown number of short peptides might be translated from coding and non-coding regions of RNAs.  相似文献   

10.
Two independent methods are used to evaluate the protein-coding information content in different classes of DNA sequences. The first method allows to evaluate the statistical relevance of finding unidentified reading frames, longer than 100 codons, on both DNA strands of: a) 117 DNA sequences that code for 142 nuclear proteins; b) 39 stable RNA coding sequences and c) 36 other DNA sequences which include regulatory and as yet unknown function sequences. The finding of 50 reading frames longer than 100 codons (complementary inverted proteins or c.i.p. genes) located on the DNA strand complementary to the protein-coding one is drastically in excess of the number predicted by chance alone. An independent method (testcode) applied to c.i.p. gene sequences, which assigns the probability of coding to a given sequence, predicts that more than 50% of these genes are translated in a functional product. These analyses indicate the existence of a new class of protein-coding genes, located on the DNA sequences complementary to the protein-coding DNA strand.  相似文献   

11.
The Shannon information entropy of protein sequences.   总被引:6,自引:1,他引:5       下载免费PDF全文
A comprehensive data base is analyzed to determine the Shannon information content of a protein sequence. This information entropy is estimated by three methods: a k-tuplet analysis, a generalized Zipf analysis, and a "Chou-Fasman gambler." The k-tuplet analysis is a "letter" analysis, based on conditional sequence probabilities. The generalized Zipf analysis demonstrates the statistical linguistic qualities of protein sequences and uses the "word" frequency to determine the Shannon entropy. The Zipf analysis and k-tuplet analysis give Shannon entropies of approximately 2.5 bits/amino acid. This entropy is much smaller than the value of 4.18 bits/amino acid obtained from the nonuniform composition of amino acids in proteins. The "Chou-Fasman" gambler is an algorithm based on the Chou-Fasman rules for protein structure. It uses both sequence and secondary structure information to guess at the number of possible amino acids that could appropriately substitute into a sequence. As in the case for the English language, the gambler algorithm gives significantly lower entropies than the k-tuplet analysis. Using these entropies, the number of most probable protein sequences can be calculated. The number of most probable protein sequences is much less than the number of possible sequences but is still much larger than the number of sequences thought to have existed throughout evolution. Implications of these results for mutagenesis experiments are discussed.  相似文献   

12.
We perform spectral entropy and GC content analyses in the beta-esterase gene cluster, including the Est-6 gene and the psiEst-6 putative pseudogene, in seven species of the Drosophila melanogaster species subgroup. psiEst-6 combines features of functional and nonfunctional genes. The spectral entropies show distinctly lower structural ordering for psiEst-6 than for Est-6 in all species studied. Our observations agree with previous results for D. melanogaster and provide additional support to our hypothesis that after the duplication event Est-6 retained the esterase-coding function and its role during copulation, while psiEst-6 lost that function but now operates in conjunction with Est-6 as an intergene. Entropy accumulation is not a completely random process for either gene. Structural entropy is nucleotide dependent. The relative normalized deviations for structural entropy are higher for G than for C nucleotides. The entropy values are similar for Est-6 and psiEst-6 in the case of A and T but are lower for Est-6 in the case of G and C. The GC content in synonymous positions is uniformly higher in Est-6 than in psiEst-6, which agrees with the reduced GC content generally observed in pseudogenes and nonfunctional sequences. The observed differences in entropy and GC content reflect an evolutionary shift associated with the process of pseudogenization and subsequent functional divergence of psiEst-6 and Est-6 after the duplication event.  相似文献   

13.
The nucleotide sequence of recombinant plasmids representing a full-size cDNA of cow alpha s1-casein was investigated. The corresponding mRNA consists of 1133 nucleotides except for poly(A) and includes 642 nucleotides of the coding region, 63 nucleotides of 5'- and 428 nucleotides of the 3'-noncoding regions. A comparative analysis of nucleotide sequences of cow alpha s1-casein and guinea pig B-casein showed that the homology in the 5'-nontranslatable region is 90.5%, that of a precasein single peptide is 82.22%, while that of the major polypeptide in the coding region is 64% without taking into account the blank spaces. The homology is higher in the 3'-noncoding region than in the coding region and makes up to 72%. The data obtained testify to the high degree of conservatism of sequences in casein mRNA noncoding regions as well as to functional and regulatory role of these sequences in gene expression of caseins.  相似文献   

14.
Late SV40 16S and 19S mRNAs were found to contain an average of three m6A residues per mRNA molecule. The methylated residues of both the viral and cellular mRNAs occur in two sequences; Gpm6ApC and (Ap)nm6ApC, where n = 1-4. More than 60% of the m6A residues in SV40 16S and 19S mRNAs occur in Gpm6ApC even though there are twice as many (A)nAC than GAC sequences in these messengers. The m6A containing oligonucleotides of late SV40 MRNAs were localized in the viral messengers. In the 16S mRNA two m6A oligonucleotides were located at the 5' coding region between 0.95--0.0 map units. The third m6A residue was mapped between 0.0--0.14 map units in the translated portion of this mRNA. The overall pattern of internal methylation in the 19S mRNA is similar. However, some differences between 16S and 19S mRNAs were observed in both the content and location of the longer (Ap)n m6AC nucleotides. These results provide the first example of precise localization of internal methylation sequences in mRNA species with defined coding specificity. It implies that a) location of m6A residues is not random but specific to a particular region of the RNA, b) apart from sequence specificity other structural features of the mRNA may influence internal methylation and c) m6A residues are present in coding regions of SV40 mRNAs.  相似文献   

15.
16.
A cloned histone gene cluster of the highly reiterated type from the sea urchin Psammechinus miliaris was analyzed by DNA sequencing. More than half of the 6 kb repeat was sequenced, including coding regions of all five histones, some prelude and trailing sequences lying adjacent to the structural genes, and segments of the AT-rich spacer DNA. The gene cluster does not code for gonad-specific histone variants but may instead be active in early sea urchin development, as indicated by comparison to reference histones. The encoded histones seem not to be derived from longer precursor proteins, nor is there any evidence for insert sequences within the coding regions. Sequence similarities exist among the putative ribosome-binding sites adjacent to the initiator codons of individual genes. The AT-rich spacer segments between the genes differ from each other, are made up from relatively simple nucleotide arrangements, but are not repetitious, and apparently do not code for additional large proteins.  相似文献   

17.
The 18S defective interfering RNA of Semliki Forest virus has been reverse transcribed to cDNA, which was shown to be heterogeneous by restriction enzyme analysis. After transformation to E.coli, using pBR322 as a vector, two clones, pKTH301 and pKTH309 with inserts of 1.7 kb and 2 kb, were characterized, respectively. The restriction maps of the two clones were different but suggested that both contained repeating units. At the 3' terminus, pKTH301 had preserved 106 nucleotides and pKTH309 102 nucleotides from the 3' end of the viral 42S genome. The conserved 3' terminal sequence was joined to a different sequence in the two clones, and these sequences were not derived from the region coding for the viral structural proteins. The DI RNAs represented by the two clones are generated from the viral 42S RNA by several noncontinuous internal deletions, since the largest colinear regions with 42S RNA are 320 nucleotides in pKTH301, and 430 and 340 nucleotides in pKTH309. All these fragments had unique RNase T1 oligonucleotide fingerprints, suggesting that they were derived from different regions of 42S RNA.  相似文献   

18.
In viruses an increased coding ability is provided by overlapping genes, in which two alternative open reading frames (ORFs) may be translated to yield two distinct proteins. The identification of signature sequences in overlapping genes is a topic of particular interest, since additional out-of-frame coding regions can be nested within known genes. In this work, a novel feature peculiar to overlapping coding regions is presented. It was detected by analysis of a sample set of 21 virus genomic sequences and consisted in the repeated occurrence of a cluster of basic amino acid residues, encoded by a frame, combined to a stretch of acidic residues, encoded by the corresponding overlapping frame. A computer scan of an additional set of virus sequences demonstrated that this feature is common to several other known overlapping ORFs and led to prediction of a novel overlapping gene in hepatitis G virus (HGV). The occurrence of a bifunctional coding region in HGV was also supported by its extremely lower rate of synonymous nucleotide substitutions compared to that observed in the other gene regions of the HGV genome. Analysis of the amino acid sequence that was deduced from the putative overlapping gene revealed a high content of basic residues and the presence of a nuclear targeting signal; these characteristics suggest that a core-like protein may be expressed by this novel ORF. Received: 21 July 1999 / Accepted: 26 October 1999  相似文献   

19.
A large protein sequence database with over 31,000 sequences and 10 million residues has been analysed. The pair probabilities have been converted to entropies using Boltzmann’s law of statistical thermodynamics. A scoring weight corresponding to “mixing entropy” of the amino acid pairs has been developed from which the entropies of the protein sequences have been calculated. The entropy values of natural sequences are lower than their random counterparts of same length and similar amino acid composition. Based on the results it has been proposed that natural sequences are a special set of polypeptides with additional qualification of biological functionality that can be quantified using the entropy concept as worked out in this paper.  相似文献   

20.
Cloned DNAs encoding four different proteins have been isolated from recombinant cDNA libraries constructed with Glycine max seed mRNAs. Two cloned DNAs code for the alpha and alpha'-subunits of the 7S seed storage protein (conglycinin). The other cloned cDNAs code for proteins which are synthesized in vitro as 68,000 d., 60,000 d. or 53,000 d. polypeptides. Hybrid selection experiments indicate that, under low stringency hybridization conditions, all four cDNAs hybridize with mRNAs for the alpha and alpha'-subunits and the 68,000 d., 60,000 d. and 53,000 d. in vitro translation products. Within three of the mRNA, there is a conserved sequence of 155 nucleotides which is responsible for this hybridization. The conserved nucleotides in the alpha and alpha'-subunit cDNAs and the 68,000 d. polypeptide cDNAs span both coding and noncoding sequences. The differences in the coding nucleotides outside the conserved region are extensive. This suggests that selective pressure to maintain the 155 conserved nucleotides has been influenced by the structure of the seed mRNA. RNA blot hybridizations demonstrate that mRNA encoding the other major subunit (beta) of the 7S seed storage protein also shares sequence homology with the conserved 155 nucleotide sequence of the alpha and alpha'-subunit mRNAs, but not with other coding sequences.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号