首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
MOTIVATION: Accurate prediction of genes in genomes has always been a challenging task for bioinformaticians and computational biologists. The discovery of existence of distinct scaling relations in coding and non-coding sequences has led to new perspectives in the understanding of the DNA sequences. This has motivated us to exploit the differences in the local singularity distributions for characterization and classification of coding and non-coding sequences. RESULTS: The local singularity density distribution in the coding and non-coding sequences of four genomes was first estimated using the wavelet transform modulus maxima methodology. Support vector machines classifier was then trained with the extracted features. The trained classifier is able to provide an average test accuracy of 97.7%. The local singularity features in a DNA sequence can be exploited for successful identification of coding and non-coding sequences. CONTACT: Available on request from bd.kulkarni@ncl.res.in.  相似文献   

2.
KaKs_Calculator 3.0 is an updated toolkit that is capable of calculating selective pressure on both coding and non-coding sequences. Similar to the nonsynonymous/synonymous substitution rate ratio for coding sequences, selection on non-coding sequences can be quantified as the ratio of non-coding nucleotide substitution rate to synonymous substitution rate of adjacent coding sequences. As testified on empirical data, KaKs_Calculator 3.0 shows effectiveness to detect the strength and mode of sele...  相似文献   

3.
编码序列和非编码序列的3-tuple分布特征   总被引:2,自引:0,他引:2  
傅强  钱敏平  陈良标  朱玉贤 《遗传学报》2005,32(10):1018-1026
非编码序列,特别是内含子的起源,是一个重要的悬而未决的问题。首先通过计算模式生物的编码序列和非编码序列的不同阅读框中3-tupie的频率分布,发现编码区中不同阅读框具有十分不同的3-tuple分布,而在非编码区中,不同阅读框的3-tuple分布几乎相等,并且这一性质不具有物种依赖性。为了描述分布差异的程度,引进夏量一对称相对熵,并通过比较原核生物和真核生物,发现无论是编码区还是非编码区,原核生物都具有比真核生物更高的SRE值。进一步研究表明,某一生物的SRE值与该生物全基因组中编码区所占的百分比存在一定的相关性(相关系数为0.86)。计算机模拟进化实验发现,2%的突变就足以使典型的嗯核生物编码区高SRE值变为真核生物内含子区特有的低SRE值。比对数据库中已经注释的内含子和编码区序列,证明确实有一部分与编码区具有很高同源性的内含子序列。实验表明,至少部分真核生物的内含子可能起源于编码序列,同时也说明SRE可能被用于研究物种基因组序列的进化。  相似文献   

4.
We study the length distribution functions for the 16 possible distinct dimeric tandem repeats in DNA sequences of diverse taxonomic partitions of GenBank (known human and mouse genomes, and complete genomes of Caenorhabditis elegans and yeast). For coding DNA, we find that all 16 distribution functions are exponential. For non-coding DNA, the distribution functions for most of the dimeric repeats have surprisingly long tails, that fit a power-law function. We hypothesize that: (i) the exponential distributions of dimeric repeats in protein coding sequences indicate strong evolutionary pressure against tandem repeat expansion in coding DNA sequences; and (ii) long tails in the distributions of dimers in non-coding DNA may be a result of various mutational mechanisms. These long, non-exponential tails in the distribution of dimeric repeats in non-coding DNA are hypothesized to be due to the higher tolerance of non-coding DNA to mutations. By comparing genomes of various phylogenetic types of organisms, we find that the shapes of the distributions are not universal, but rather depend on the specific class of species and the type of a dimer.  相似文献   

5.
6.
We estimate DNA sequence error rates in Genbank records containing protein-coding and non-coding DNA sequences by comparing sequences of the inbred mouse strain C57BL/6J, sequenced as part of the mouse genome project and independently by other laboratories. C57BL/6J was produced by more than 100 generations of brother-sister mating, and can be assumed to be virtually free of residual polymorphism and mutational variation, so differences between independent sequences can be attributed to error. The estimated single nucleotide error rate for coding DNA is 0.10% (SE 0.012%), which is substantially lower than previous estimates for error rates in Genbank accessions. The estimated single nucleotide error rate for intronic DNA sequences (0.22%; SE 0.051%) is significantly higher than the rate for coding DNA. Since error rates for the mouse genome sequence are very low, the vast majority of the errors we detected are likely to be in individual Genbank accessions. The frequency of insertion-deletion (indel) errors in non-coding DNA approaches that of single nucleotide errors in non-coding DNA, whereas indel errors are uncommon in coding sequences.  相似文献   

7.
A statistical analysis of occurrence of particular nucleotide runs (1 divided by 10 nucleotides long) in DNA sequences of different species has been carried out. There are considerable differences in run distributions in DNA sequences of prokaryotes, invertebrates and vertebrates. Distribution of various types of runs has been found to be different in coding and non-coding sequences. There is an abundance of short runs 1 divided by 2 nucleotides long in coding sequences, and there is a deficiency of such runs in the non-coding regions. However, some interesting exceptions from this rule exist: for run distribution of adenine in prokaryotes and for distribution of purine-pyrimidine runs in eukaryotes. This may be stipulated by the fact that the distribution of runs are predetermined by structural peculiarities of the entire DNA molecule. Runs of guanine or cytosine of three to six nucleotides long occur predominantly in the non-coding DNA regions in eukaryotes, especially in vertebrates.  相似文献   

8.
Interpolated markov chains for eukaryotic promoter recognition.   总被引:9,自引:0,他引:9  
MOTIVATION: We describe a new content-based approach for the detection of promoter regions of eukaryotic protein encoding genes. Our system is based on three interpolated Markov chains (IMCs) of different order which are trained on coding, non-coding and promoter sequences. It was recently shown that the interpolation of Markov chains leads to stable parameters and improves on the results in microbial gene finding (Salzberg et al., Nucleic Acids Res., 26, 544-548, 1998). Here, we present new methods for an automated estimation of optimal interpolation parameters and show how the IMCs can be applied to detect promoters in contiguous DNA sequences. Our interpolation approach can also be employed to obtain a reliable scoring function for human coding DNA regions, and the trained models can easily be incorporated in the general framework for gene recognition systems. RESULTS: A 5-fold cross-validation evaluation of our IMC approach on a representative sequence set yielded a mean correlation coefficient of 0.84 (promoter versus coding sequences) and 0.53 (promoter versus non-coding sequences). Applied to the task of eukaryotic promoter region identification in genomic DNA sequences, our classifier identifies 50% of the promoter regions in the sequences used in the most recent review and comparison by Fickett and Hatzigeorgiou ( Genome Res., 7, 861-878, 1997), while having a false-positive rate of 1/849 bp.  相似文献   

9.
MOTIVATION: Prediction of the coding potential for stretches of DNA is crucial in gene calling and genome annotation, where it is used to identify potential exons and to position their boundaries in conjunction with functional sites, such as splice sites and translation initiation sites. The ability to discriminate between coding and non-coding sequences relates to the structure of coding sequences, which are organized in codons, and by their biased usage. For statistical reasons, the longer the sequences, the easier it is to detect this codon bias. However, in many eukaryotic genomes, where genes harbour many introns, both introns and exons might be small and hard to distinguish based on coding potential. RESULTS: Here, we present novel approaches that specifically aim at a better detection of coding potential in short sequences. The methods use complementary sequence features, combined with identification of which features are relevant in discriminating between coding and non-coding sequences. These newly developed methods are evaluated on different species, representative of four major eukaryotic kingdoms, and extensively compared to state-of-the-art Markov models, which are often used for predicting coding potential. The main conclusions drawn from our analyses are that (1) combining complementary sequence features clearly outperforms current Markov models for coding potential prediction in short sequence fragments, (2) coding potential prediction benefits from length-specific models, and these models are not necessarily the same for different sequence lengths and (3) comparing the results across several species indicates that, although our combined method consistently performs extremely well, there are important differences across genomes. SUPPLEMENTARY DATA: http://bioinformatics.psb.ugent.be/.  相似文献   

10.
EUGENE'HOM is a gene prediction software for eukaryotic organisms based on comparative analysis. EUGENE'HOM is able to take into account multiple homologous sequences from more or less closely related organisms. It integrates the results of TBLASTX analysis, splice site and start codon prediction and a robust coding/non-coding probabilistic model which allows EUGENE'HOM to handle sequences from a variety of organisms. The current target of EUGENE'HOM is plant sequences. The EUGENE'HOM web site is available at http://genopole.toulouse.inra.fr/bioinfo/eugene/EuGeneHom/cgi-bin/EuGeneHom.pl.  相似文献   

11.
12.
The melting of the coding and non-coding classes of natural DNA sequences was investigated using a program, MELTSIM, which simulates DNA melting based upon an empirically parameterized nearest neighbor thermodynamic model. We calculated T(m) results of 8144 natural sequences from 28 eukaryotic organisms of varying F(GC) (mole fraction of G and C) and of 3775 coding and 3297 non-coding sequences derived from those natural sequences. These data demonstrated that the T(m) vs. F(GC) relationships in coding and non-coding DNAs are both linear but have a statistically significant difference (6.6%) in their slopes. These relationships are significantly different from the T(m) vs. F(GC) relationship embodied in the classical Marmur-Schildkraut-Doty (MSD) equation for the intact long natural sequences. By analyzing the simulation results from various base shufflings of the original DNAs and the average nearest neighbor frequencies of those natural sequences across the F(GC) range, we showed that these differences in the T(m) vs. F(GC) relationships are largely a direct result of systematic F(GC)-dependent biases in nearest neighbor frequencies for those two different DNA classes. Those differences in the T(m) vs. F(GC) relationships and biases in nearest neighbor frequencies also appear between the sequences from multicellular and unicellular organisms in the same coding or non-coding classes, albeit of smaller but significant magnitudes.  相似文献   

13.
The mature mRNA always carries nucleotide sequences that faithfully mirror the protein product according to the niles of the genetic code. However, in the chromosome, the nucleotide sequence that represents a certain protein is interrupted by additional sequences. Therefore, most eukaryotic genes are longer than their final mRNA products. The human genome project revealed that only a tiny portion of sequences serves as protein-coding region and almost one quarter of the genome is occupied by non-coding intervening sequences. The elimination of these non-coding regions from the precursor RNA in a process termed splicing must be extremely precise, because even a single nucleotide mistake may cause a fatal error. At present, two types of intervening sequences have been identified in protein-coding genes. One of them, the U2-dependent or major-class is prevalent and represents 99% of known sequences. The other one, the so-called U12-dependent or minor-class of introns, occurs in much lesser amounts in the genome. The basic problem of nuclear splicing concerns i/ the molecular mechanisms, which ensure that the coding regions are correctly recognized and spliced together: ii/ the principles and mechanisms that guarantee the high fidelity of the splicing system; iii/ the differences in the excision mechanisms of the two classes of introns. We are going to present models explaining how intervening sequences are accurately removed and the coding regions correctly juxtaposed. The two splicing mechanisms will also be compared.  相似文献   

14.
蛋白质编码区与非编码区的特征与识别   总被引:2,自引:0,他引:2  
在核苷酸序列分析中,一个重要的问题是如何识别一段未知序列中的编码区和非编码区.近年来提出了一些方法,但效果都不够理想.本文在对编码区与非编码区特征进行大量统计分析的基础上,提出一种加权距离判别法,并对该方法的精度进行了评价.  相似文献   

15.
A fractal method to distinguish coding and non-coding sequences in a complete genome is proposed, based on different statistical behaviors between these two kinds of sequences. We first propose a number sequence representation of DNA sequences. Multifractal analysis is then performed on the measure representation of the obtained number sequence. The three exponents C(-1), C1 and C2 are selected from the result of multifractal analysis. Each DNA may be represented by a point in the three-dimensional space generated by these three-component vectors. It is shown that points corresponding to coding and non-coding sequences in the complete genome of many prokaryotes are roughly distributed in different regions. Fisher's discriminant algorithm can be used to separate these two regions in the spanned space. If the point (C(-1),C1,C2) for a DNA sequence is situated in the region corresponding to coding sequences, the sequence is discriminated as a coding sequence; otherwise, the sequence is classified as a non-coding one. For all 51 prokaryotes we considered , the average discriminant accuracies pc,pnc,qc and qnc reach 72.28%, 84.65%, 72.53% and 84.18%, respectively.  相似文献   

16.
17.
Identification of coding regions in DNA sequences remains challenging. Various methods have been proposed, but these are limited by species-dependence and the need for adequate training sets. The elements in DNA coding regions are known to be distributed in a quasi-random way, while those in non-coding regions have typical similar structures. For short sequences, these statistical characteristics cannot be extracted correctly and cannot even be detected. This paper introduces a new way to solve the problem: balanced estimation of diffusion entropy (BEDE).  相似文献   

18.
Summary Coding sequences of eucaryotic nuclear DNA were characterized by an excess of short runs and a deficit of long runs of weak and of strong hydrogen bonding bases; non-coding sequences by a deficit of short runs and an excess of long runs, in the same of purines and of pyrimidines. The conservation of these attributes across DNA sequences coding for proteins of widely different function, across widely different eucaryotic species for the same protein and across related genes that diverged a long time ago and that now show large differences in base and, if coding, amino acid sequence suggested that these attributes have survival value. It was concluded that these attributes constitute probalistic constraints on th primary structure (base sequence) of both coding and non-coding DNA.  相似文献   

19.
Prediction of splice junctions in mRNA sequences.   总被引:8,自引:6,他引:2       下载免费PDF全文
K Nakata  M Kanehisa    C DeLisi 《Nucleic acids research》1985,13(14):5327-5340
A general method based on the statistical technique of discriminant analysis is developed to distinguish boundaries of coding and non-coding regions in nucleic acid sequences. In particular, the method is applied to the prediction of splicing sites in messenger RNA precursors. Information used for discrimination includes consensus sequence patterns around splice junctions, free energy of snRNA and mRNA base pairing, and statistical differences between coding and non-coding regions such as periodic appearance of specific bases in coding regions reflecting the non-random usage of degenerate codons. Given the reading frame of an exon (but not the exon/intron boundaries), the method will predict the following exon, namely, the intron to be excised out. When applied to human sequences in the GenBank database, the method correctly identified 80% of true splice junctions.  相似文献   

20.
The complete nucleotide sequences of three cDNAs coding for the C-terminal part of mouse histocompatibility (H-2) antigens, and for the 3' non coding regions of these clones have been determined. Comparison of the sequence indicates a large homology throughout the coding and non-coding regions and suggests the existence of a genetic mechanism which homogenizes nucleotide sequences among genes of the H-2 multigene family.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号