首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
With the exponential growth of genomic sequences, there is an increasing demand to accurately identify protein coding regions (exons) from genomic sequences. Despite many progresses being made in the identification of protein coding regions by computational methods during the last two decades, the performances and efficiencies of the prediction methods still need to be improved. In addition, it is indispensable to develop different prediction methods since combining different methods may greatly improve the prediction accuracy. A new method to predict protein coding regions is developed in this paper based on the fact that most of exon sequences have a 3-base periodicity, while intron sequences do not have this unique feature. The method computes the 3-base periodicity and the background noise of the stepwise DNA segments of the target DNA sequences using nucleotide distributions in the three codon positions of the DNA sequences. Exon and intron sequences can be identified from trends of the ratio of the 3-base periodicity to the background noise in the DNA sequences. Case studies on genes from different organisms show that this method is an effective approach for exon prediction.  相似文献   

2.
Localizing triplet periodicity in DNA and cDNA sequences   总被引:1,自引:0,他引:1  

Background  

The protein-coding regions (coding exons) of a DNA sequence exhibit a triplet periodicity (TP) due to fact that coding exons contain a series of three nucleotide codons that encode specific amino acid residues. Such periodicity is usually not observed in introns and intergenic regions. If a DNA sequence is divided into small segments and a Fourier Transform is applied on each segment, a strong peak at frequency 1/3 is typically observed in the Fourier spectrum of coding segments, but not in non-coding regions. This property has been used in identifying the locations of protein-coding genes in unannotated sequence. The method is fast and requires no training. However, the need to compute the Fourier Transform across a segment (window) of arbitrary size affects the accuracy with which one can localize TP boundaries. Here, we report a technique that provides higher-resolution identification of these boundaries, and use the technique to explore the biological correlates of TP regions in the genome of the model organism C. elegans.  相似文献   

3.
Sorimachi K  Okayasu T 《Amino acids》2008,34(4):661-668
When nucleotide (G, C, T and A) contents were plotted against each nucleotide, their relationships were clearly expressed by a linear formula, y = αx + β in the coding and non-coding regions. This linear relationship was obtained from the complete single-stranded DNA. Similarly, nucleotide contents at all three codon positions were expressed by linear regression lines based on the content of each nucleotide. In addition, 64 codon usages were also expressed by linear formulas against nucleotide content. Thus, the nucleotide content not only in coding sequence but also in non-coding sequence can be expressed by a linear formula, y = αx + β, in 145 organisms (112 bacteria, 15 archaea and 18 eukaryotes). Based on these results, the ratio of C/T, G/T, C/A or G/A one can essentially estimate all four nucleotide contents in the complete single-stranded DNA, and the determination of any ratio of two kinds of nucleotides can essentially estimate four nucleotide contents, nucleotide contents at the three different codon positions and codon distributions at 64 codons in the coding region. The maximum and minimum values of G content were ∼0.35 and ∼0.15, respectively, among various organisms examined. Codon evolution occurs according to linear formulas between these two values. Electronic supplementary material The online version of this article (doi:) contains supplementary material, which is available to authorized users.  相似文献   

4.
Complimentary DNA sequence data of Φ × 174, fd, f1, G4, Ml3, MS2, λ and T7 phages ofEscherichia coli are analysed at mono-, di-, tri- and tetranucleotide levels. Our analysis shows that, (i) mononucleotides have certain preferences to occur at specific positions X1, X2, X3 of codon, (ii) These nucleotides interact nonlinearly to form dinucleotide and this dinucleotide also interacts nonlinearely with a third nucleotide to form codon, (iii) However, nonlinear interactions are negligible at tetranucleotide level suggesting that, coding regions of complimentary DNA are Markov chains of order two. Trinucleotide potential values in three frames have suggested that, at least thirteen different trinucleotides can be used as a marker to locate coding regions in DNA of prokaryotes. (iv) Parallel paired codons are expressed in such a way that one of the codons in the pair expresses with high frequency while the other with low frequency. On the other hand the complimentary codon pairs express with small frequency difference, (v) In the synonymous codon groups, codon ending with T are found to express with more frequency  相似文献   

5.
J L Weber 《Gene》1987,52(1):103-109
The genome of the human malaria parasite Plasmodium falciparum has an A + T content of about 82%, higher than any other organism whose DNA has been characterized. Computer analysis of 36 kb of available nucleotide sequences from this species showed that the coding regions, with an A + T content of 69.0%, are flanked by more A + T-rich regions of 86.0% A + T. Within the coding sequences, the A/T ratio was 1.68 in the mRNA sense strand, and overall A + T content in the three codon positions increased in the order 1st-2nd-3rd position. Codons with T or especially A in the third position were strongly preferred. Codon usage among individual parasite genes was very similar compared to genes from other species. Dinucleotide frequencies for the parasite DNA were close to those expected for a random sequence with the known base composition, except that the CpG frequency in the coding sequences was low.  相似文献   

6.
7.
Sajid Marhon 《Bio Systems》2010,101(3):185-676
In a previous paper (Yin and Yau, 2005), a novel method was proposed to measure the power spectrum of a DNA sequence at frequency N/3 in order to distinguish protein-coding and non-coding regions in DNA sequences. This was accomplished by computing the distribution of the four nucleotides in the three reading frames (codon positions) and identifying variance as an indicator of 3-base periodicity. That work included an empirical justification for the claim that there exists a linear, 3:2 correlation between the variance and the power spectrum. In this note, we provide a theoretical justification for that observation in the form of a mathematical proof of this correlation. This work thus provides a more rigorous justification for the use of the variance instead of the more computationally expensive power spectrum, allowing users of this technique to apply it with absolute confidence that no compromise in accuracy is incurred.  相似文献   

8.
De novo origin of coding sequence remains an obscure issue in molecular evolution. One of the possible paths for addition (subtraction) of DNA segments to (from) a gene is stop codon shift. Single nucleotide substitutions can destroy the existing stop codon, leading to uninterrupted translation up to the next stop codon in the gene’s reading frame, or create a premature stop codon via a nonsense mutation. Furthermore, short indels-caused frameshifts near gene’s end may lead to premature stop codons or to translation past the existing stop codon. Here, we describe the evolution of the length of coding sequence of prokaryotic genes by change of positions of stop codons. We observed cases of addition of regions of 3′UTR to genes due to mutations at the existing stop codon, and cases of subtraction of C-terminal coding segments due to nonsense mutations upstream of the stop codon. Many of the observed stop codon shifts cannot be attributed to sequencing errors or rare deleterious variants segregating within bacterial populations. The additions of regions of 3′UTR tend to occur in those genes in which they are facilitated by nearby downstream in-frame triplets which may serve as new stop codons. Conversely, subtractions of coding sequence often give rise to in-frame stop codons located nearby. The amino acid composition of the added region is significantly biased, compared to the overall amino acid composition of the genes. Our results show that in prokaryotes, shift of stop codon is an underappreciated contributor to functional evolution of gene length.  相似文献   

9.
The aim of this paper is to give measurements indicative of evolutional stages of the species. Two types of statistics of trinucleotides in coding regions are analysed for 27 species. The first one is the codon space, the nucleotide ratio for each of the three codon positions. We apply principal component analysis on this space and extract two principal components faithfully describing the original distribution of the codon space. The first principal component corresponds to the GC content. The second principal component classifies the species into three evolutional groups, Archaea, Bacteria and Eukaryota. The second statistics is the real and theoretical frequency of amino acids. The real frequency of an amino acid in a coding sequence is its frequency in the translated protein. The theoretical frequency is the expected frequency calculated from the ratio of nucleotides. We introduce the discrepancy between these two frequencies as an index of non-randomness of nucleotides in the sequence. This index of non-randomness divides the species into two groups: eukaryotes having smaller non-randomness (i.e. being more random) and prokaryotes having higher non-randomness.  相似文献   

10.
MOTIVATION: At the core of most protein gene-finding algorithms are the coding measures used to make a decision on coding/non-coding. Of the protein coding measures, the Fourier measure is one of the most important. However, due to the limited length of the windows usually used, the accuracy of the measure is not satisfactory. This paper is devoted to improving the accuracy by lengthening the sequence to amplify the periodicity of 3 in the coding regions. RESULTS: A new algorithm is presented called the lengthen-shuffle Fourier transform algorithm. For the same window length, the percentage accuracy of the new algorithm is 6-7% higher than that of the ordinary Fourier transform algorithm. The resulting percentage accuracy (average of specificity and sensitivity) of the new measure is 84.9% for the window length 162 bp. AVAILABILITY: The program is available on request fromC.- T. Zhang. Contact: ctzhang@tju.edu.cn   相似文献   

11.
In Neurospora crassa, the expression of unlinked structural genes which encode nitrogen catabolic enzymes is subject to genetic and metabolic regulation. The negative-acting nmr regulatory gene appears to play a role in nitrogen catabolite repression. Using the N. crassa nmr gene as a probe, homologous sequences were identified in a variety of other filamentous fungi. The polymerase chain reaction was used to isolate the nmr-like gene from the exotic Mauriceville strain of N. crassa and from the two related species, N. intermedia and N. sitophila. Sequence comparisons were carried out with a 1.7-kb DNA segment which includes the entire coding region of nmr plus 5' and 3' noncoding sequences. The size of the nmr coding region was identical in all three Neurospora species. Approximately 30 nucleotide base substitutions were found in the coding region of the nmr gene of each of the sister species when compared to the standard N. crassa sequence. However, most of the base changes occurred in third codon positions and were silent. The NMR proteins of N. sitophila and of N. intermedia display only three and four amino acid substitutions, respectively, from the N. crassa protein. Two regions of high variability, which include deletions and insertions of bases, were found in the 5' and 3' noncoding regions of the gene.  相似文献   

12.
13.
The beta-globin gene cluster of human, gorilla and chimpanzee contain the same number and organization of beta-type globin genes: 5'-epsilon (embryonic)-G gamma and A gamma (fetal)-psi beta (inactive)-delta and beta (adult)-3'. We have isolated the psi beta-globin gene regions from the three species and determined their nucleotide sequences. These three pseudogenes each share the same substitutions in the initiator codon (ATG----GTA), a substitution in codon 15 which generates a termination signal TGG----TGA, nucleotide deletion in codon 20 and the resulting frame shift which yields many termination signals in exons 2 and 3. The basic structure of these psi beta-globin genes, however, remains consistent with that found for functional beta-globin genes: their coding regions are split by two introns, IVS 1 (which splits codon 30, 121 base-pairs in length) and IVS 2 (which splits codon 104, 840 to 844 base-pairs in length). These introns retain the normal splice junctions found in other eukaryotic split genes. The three hominoid psi beta-globin genes show a high degree of sequence correspondence, with the number of differences found among them being only about one-third of that predicted for DNA sites evolving at the neutral rate (i.e. for sites evolving in the absence of purifying selection). Thus, there appears to be a deceleration in the rate of evolution of the psi beta-globin locus in higher primates.  相似文献   

14.
15.
16.
Linear algebraic concept of subspace plays a significant role in the recent techniques of spectrum estimation. In this article, the authors have utilized the noise subspace concept for finding hidden periodicities in DNA sequence. With the vast growth of genomic sequences, the demand to identify accurately the protein-coding regions in DNA is increasingly rising. Several techniques of DNA feature extraction which involves various cross fields have come up in the recent past, among which application of digital signal processing tools is of prime importance. It is known that coding segments have a 3-base periodicity, while non-coding regions do not have this unique feature. One of the most important spectrum analysis techniques based on the concept of subspace is the least-norm method. The least-norm estimator developed in this paper shows sharp period-3 peaks in coding regions completely eliminating background noise. Comparison of proposed method with existing sliding discrete Fourier transform (SDFT) method popularly known as modified periodogram method has been drawn on several genes from various organisms and the results show that the proposed method has better as well as an effective approach towards gene prediction. Resolution, quality factor, sensitivity, specificity, miss rate, and wrong rate are used to establish superiority of least-norm gene prediction method over existing method.  相似文献   

17.
针对DNA序列单碱基的不同类型突变,利用数字信号处理方法,研究了单碱基替换突变、删除突变、插入突变对DNA序列三周期功率谱的影响。研究结果表明:对于不同长度的编码序列,替换突变对序列功率谱的影响较小,删除突变和插入突变对序列功率谱的影响较大;随着序列编码区长度的减小,替换、删除、插入突变对序列编码区的功率谱影响会越来越大。对于中等长度外显子,插入突变对序列三周期功率谱影响最大,对于短外显子,删除突变对序列三周期功率谱的影响最大。研究结果可为含突变基因编码区的识别与检测提供参考。  相似文献   

18.
The complete nucleotide sequence of an active class I HLA gene, HLA-A3, has been determined. This sequence, together with that obtained for the HLA-CW3 gene, represents the first complete nucleotide sequence to be determined for functional class I HLA genes. The gene organisation of HLA-A3 closely resembles that of class I H-2 genes in mouse: it shows a signal exon, three exons encoding the three extracellular domains, one exon encoding the transmembrane region and three exons encoding the cytoplasmic domain. The complete nucleotide sequences of the active HLA genes, HLA-A3 and HLA-CW3, now permit a meaningful comparison of the nucleotide sequences of class I HLA genes by alignment with the sequence established for a HLA-B7-specific cDNA clone and the sequences of two HLA class I pseudogenes HLA 12.4 and LN- 11A . The comparisons show that there is a non-random pattern of nucleotide differences in both exonic and intronic regions featuring segmental homologies over short regions, which is indicative of a gene conversion mechanism. In addition, analysis of the frequency of nucleotide substitution at the three base positions within the codons of the functional genes HLA-A3, HLA-B7 and HLA-CW3 shows that the pattern of nucleotide substitution in the exon coding for the 3rd extracellular domain is consistent with strong selection pressure to conserve the sequence. The distribution of nucleotide variation in the other exons specifying the mature protein is nearly random with respect to the frequencies of substitution at the three nucleotide positions of their codons. The evolutionary implications of these findings are discussed.  相似文献   

19.
不具有3-碱基周期性的编码序列初探   总被引:4,自引:0,他引:4  
对120个较短编码序列(<1 200 bp)的Fourier频谱进行分析表明,3-碱基周期性在短编码序列中并不是绝对存在的.统计分析提示,编码序列有无3-碱基周期性与序列的碱基组成和分布、所编码蛋白质氨基酸的选用和顺序以及同义密码子的使用都有一定的关系.一般地,非周期-3序列中A+U含量高于G+C含量,周期-3序列的情况则相反;非周期-3序列中碱基在密码子三个位点上的分布比周期-3序列中的分布均匀;非周期-3序列密码子和氨基酸的使用偏向没有周期-3序列的大.在利用Fourier分析方法预测DNA序列中的基因和外显子时,应充分考虑到这些现象.  相似文献   

20.
Shah K  Krishnamachari A 《Bio Systems》2012,107(3):142-144
Genomes of almost all organisms have been found to exhibit several periodicities, the most prominent one is the three base periodicity. It is more pronounced in the gene coding regions and has been exploited to identify the segments of a genome that code for a protein. The reason for this three base periodicity in the gene-coding region has been attributed to inhomogeneous nucleotide compositions in the three codon positions. However, this reason cannot explain the three base periodicity present at the level of the whole genome where the codon concept is not applicable. Even though the distribution of each nucleotide is uniform at the positions 0(mod 3), 1(mod 3) and 2(mod 3) when the whole genome data is considered, our analysis reveals that the three base periodicity is arising because of higher correlations among the nucleotides separated by three bases.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号