首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
With the rapid increase of DNA databases of human and other eukaryotic model organisms, a large great number of genes need to be distinguished from the DNA databases. Exact recognition of translation initiation sites (TISs) of eukaryotic genes is very important to understand the translation initiation process, predict the detailed structure of eukaryotic genes, and annotate uncharacterized sequences. The problem has not been solved satisfactorily, especially for recognizing TISs of the eukaryotic genes with shorter first exons. It is an important task for extracting new features and finding new powerful algorithms for recognizing TISs of eukaryotic genes. In this paper, the important characteristics of shorter flanking fragments around TISs are extracted and an expectation-maximization (EM) algorithm based on incomplete data is used to recognize TISs of eukaryotic genes. The accuracy is up to 87.8% over a six-fold cross-validation test. The result shows that the identification variables are effectively extracted and the EM algorithm is a powerful tool to predict the TISs of eukaryotic genes. The algorithm also can be applied to other classification or clustering tasks in bioinformatics.  相似文献   

2.
Recognition of protein-coding genes, a classical bioinformatics issue, is an absolutely needed step for annotating newly sequenced genomes. The Z-curve algorithm, as one of the most effective methods on this issue, has been successfully applied in annotating or re-annotating many genomes, including those of bacteria, archaea and viruses. Two Z-curve based ab initio gene-finding programs have been developed: ZCURVE (for bacteria and archaea) and ZCURVE_V (for viruses and phages). ZCURVE_C (for 57 bacteria) and Zfisher (for any bacterium) are web servers for re-annotation of bacterial and archaeal genomes. The above four tools can be used for genome annotation or re-annotation, either independently or combined with the other gene-finding programs. In addition to recognizing protein-coding genes and exons, Z-curve algorithms are also effective in recognizing promoters and translation start sites. Here, we summarize the applications of Z-curve algorithms in gene finding and genome annotation.  相似文献   

3.
Gudlaugsdottir S  Boswell DR  Wood GR  Ma J 《Genetica》2007,131(3):299-306
Since it was first recognised that eukaryotic genes are fragmented into coding segments (exons) separated by non-coding segments (introns), the reason for this phenomenon has been debated. There are two dominant theories: that the piecewise arrangement of genes allows functional protein domains, represented by exons, to recombine by shuffling to form novel proteins with combinations of functions; or that introns represent parasitic DNA that can infest the eukaryotic genome because it does not interfere grossly with the fitness of its host. Differing distributions of exon lengths are predicted by these two theories. In this paper we examine distributions of exon lengths for six different organisms and find that they offer empirical evidence that both theories may in part be correct.  相似文献   

4.
The prediction of translation initiation sites (TISs) in eukaryotic mRNAs has been a challenging problem in computational molecular biology. In this paper, we present a new algorithm to recognize TISs with a very high accuracy. Our algorithm includes two novel ideas. First, we introduce a class of new sequence-similarity kernels based on string editing, called edit kernels, for use with support vector machines (SVMs) in a discriminative approach to predict TISs. The edit kernels are simple and have significant biological and probabilistic interpretations. Although the edit kernels are not positive definite, it is easy to make the kernel matrix positive definite by adjusting the parameters. Second, we convert the region of an input mRNA sequence downstream to a putative TIS into an amino acid sequence before applying SVMs to avoid the high redundancy in the genetic code. The algorithm has been implemented and tested on previously published data. Our experimental results on real mRNA data show that both ideas improve the prediction accuracy greatly and that our method performs significantly better than those based on neural networks and SVMs with polynomial kernels or Salzberg kernels.  相似文献   

5.
人类全基因组范围的CpG岛的预测与分析   总被引:1,自引:0,他引:1  
CpG岛的甲基化是表观遗传中基因表达调控的重要机制。虽然目前已存在几个从DNA序列判别CpG岛的标准,但如何在标准中选择合适的参数仍是研究的焦点。文章通过分析比较两种经典CpG岛判定标准与三种预测方法,提出了改进的CpG岛预测方法——CpGISeeker。应用该预测方法,结合判定标准中的三个基本参数组合出的13组组合参数,在人类全基因组范围内进行了CpG岛预测,并统计分析了CpG岛的重复序列组成以及相对于基因转录起始位点的位置分布情况。分析结果表明CpGISeeker具有更精确判定CpG岛的特性;同时还提示,随着判定标准严格性的增加,CpG岛的重复序列含量降低,与基因转录起始位点的相关性提高。将CpG岛最小尺寸为500bp、GC含量为60%、CpG出现率达到0.65的组合参数作为标准,是目前预测CpG岛的最佳方式。  相似文献   

6.
7.
Combinatorial control of exon recognition   总被引:3,自引:0,他引:3  
Pre-mRNA splicing is a fundamental process required for the expression of most metazoan genes. It is carried out by the spliceosome, which catalyzes the removal of noncoding intronic sequences to assemble exons into mature mRNAs prior to export and translation. Given the complexity of higher eukaryotic genes and the relatively low level of splice site conservation, the precision of the splicing machinery in recognizing and pairing splice sites is impressive. Introns ranging in size from <100 up to 100,000 bases are removed efficiently. At the same time, a large number of alternative splicing events are observed between different cell types, during development, or during other biological processes. This extensive alternative splicing implies a significant flexibility of the spliceosome to identify and process exons within a given pre-mRNA. To reach this flexibility, splice site selection in higher eukaryotes has evolved to depend on multiple parameters such as splice site strength, the presence or absence of splicing regulators, RNA secondary structures, the exon/intron architecture, and the process of pre-mRNA synthesis itself. The relative contributions of each of these parameters control how efficiently splice sites are recognized and flanking introns are removed.  相似文献   

8.
Although ribosome-profiling and translation initiation sequencing (TI-seq) analyses have identified many noncanonical initiation codons, the precise detection of translation initiation sites (TISs) remains a challenge, mainly because of experimental artifacts of such analyses. Here, we describe a new method, TISCA (TIS detection by translation Complex Analysis), for the accurate identification of TISs. TISCA proved to be more reliable for TIS detection compared with existing tools, and it identified a substantial number of near-cognate codons in Kozak-like sequence contexts. Analysis of proteomics data revealed the presence of methionine at the NH2-terminus of most proteins derived from near-cognate initiation codons. Although eukaryotic initiation factor 2 (eIF2), eIF2A and eIF2D have previously been shown to contribute to translation initiation at near-cognate codons, we found that most noncanonical initiation events are most probably dependent on eIF2, consistent with the initial amino acid being methionine. Comprehensive identification of TISs by TISCA should facilitate characterization of the mechanism of noncanonical initiation.  相似文献   

9.
In a number of programs for gene structure prediction in higher eukaryotic genomic sequences, exon prediction is decoupled from gene assembly: a large pool of candidate exons is predicted and scored from features located in the query DNA sequence, and candidate genes are assembled from such a pool as sequences of nonoverlapping frame-compatible exons. Genes are scored as a function of the scores of the assembled exons, and the highest scoring candidate gene is assumed to be the most likely gene encoded by the query DNA sequence. Considering additive gene scoring functions, currently available algorithms to determine such a highest scoring candidate gene run in time proportional to the square of the number of predicted exons. Here, we present an algorithm whose running time grows only linearly with the size of the set of predicted exons. Polynomial algorithms rely on the fact that, while scanning the set of predicted exons, the highest scoring gene ending in a given exon can be obtained by appending the exon to the highest scoring among the highest scoring genes ending at each compatible preceding exon. The algorithm here relies on the simple fact that such highest scoring gene can be stored and updated. This requires scanning the set of predicted exons simultaneously by increasing acceptor and donor position. On the other hand, the algorithm described here does not assume an underlying gene structure model. Indeed, the definition of valid gene structures is externally defined in the so-called Gene Model. The Gene Model specifies simply which gene features are allowed immediately upstream which other gene features in valid gene structures. This allows for great flexibility in formulating the gene identification problem. In particular it allows for multiple-gene two-strand predictions and for considering gene features other than coding exons (such as promoter elements) in valid gene structures.  相似文献   

10.
11.
12.
Can Codon Usage Bias Explain Intron Phase Distributions and Exon Symmetry?   总被引:1,自引:0,他引:1  
More introns exist between codons (phase 0) than between the first and the second bases (phase 1) or between the second and the third base (phase 2) within the codon. Many explanations have been suggested for this excess of phase 0. It has, for example, been argued to reflect an ancient utility for introns in separating exons that code for separate protein modules. There may, however, be a simple, alternative explanation. Introns typically require, for correct splicing, particular nucleotides immediately 5 in exons (typically a G) and immediately 3 in the following exon (also often a G). Introns therefore tend to be found between particular nucleotide pairs (e.g., G|G pairs) in the coding sequence. If, owing to bias in usage of different codons, these pairs are especially common at phase 0, then intron phase biases may have a trivial explanation. Here we take codon usage frequencies for a variety of eukaryotes and use these to generate random sequences. We then ask about the phase of putative intron insertion sites. Importantly, in all simulated data sets intron phase distribution is biased in favor of phase 0. In many cases the bias is of the magnitude observed in real data and can be attributed to codon usage bias. It is also known that exons may carry either the same phase (symmetric) or different phases (asymmetric) at the opposite ends. We simulated a distribution of different types of exons using frequencies of introns observed in real genes assuming random combination of intron phases at the opposite sides of exons. Surprisingly the simulated pattern was quite similar to that observed. In the simulants we typically observe a prevalence of symmetric exons carrying phase 0 at both ends, which is common for eukaryotic genes. However, at least in some species, the extent of the bias in favor of symmetric (0,0) exons is not as great in simulants as in real genes. These results emphasize the need to construct a biologically relevant null model of successful intron insertion.Reviewing Editor: Dr. Manyuan Long  相似文献   

13.
14.
Intron-exon structures of eukaryotic model organisms.   总被引:27,自引:1,他引:27       下载免费PDF全文
To investigate the distribution of intron-exon structures of eukaryotic genes, we have constructed a general exon database comprising all available intron-containing genes and exon databases from 10 eukaryotic model organisms: Homo sapiens, Mus musculus, Gallus gallus, Rattus norvegicus, Arabidopsis thaliana, Zea mays, Schizosaccharomyces pombe, Aspergillus, Caenorhabditis elegans and Drosophila. We purged redundant genes to avoid the possible bias brought about by redundancy in the databases. After discarding those questionable introns that do not contain correct splice sites, the final database contained 17 102 introns, 21 019 exons and 2903 independent or quasi-independent genes. On average, a eukaryotic gene contains 3.7 introns per kb protein coding region. The exon distribution peaks around 30-40 residues and most introns are 40-125 nt long. The variable intron-exon structures of the 10 model organisms reveal two interesting statistical phenomena, which cast light on some previous speculations. (i) Genome size seems to be correlated with total intron length per gene. For example, invertebrate introns are smaller than those of human genes, while yeast introns are shorter than invertebrate introns. However, this correlation is weak, suggesting that other factors besides genome size may also affect intron size. (ii) Introns smaller than 50 nt are significantly less frequent than longer introns, possibly resulting from a minimum intron size requirement for intron splicing.  相似文献   

15.
16.
There is a pressing need to align the growing set of expressed sequence tags (ESTs) with the newly sequenced human genome. However, the problem is complicated by the exon/intron structure of eukaryotic genes misread nucleotides in ESTs, and the millions of repetitive sequences in genomic sequences. To solve this problem, algorithms that use dynamic programming have been proposed. In reality, however, these algorithms require an enormous amount of processing time. In an effort to improve the computational efficiency of these classical DP algorithms, we developed software that fully utilizes lookup-tables to detect the start- and endpoints of an EST within a given DNA sequence efficiently, and subsequently promptly identify exons and introns. In addition, the locations of all splice sites must be calculated correctly with high sensitivity and accuracy, while retaining high computational efficiency. This goal is hard to accomplish in practice, due to misread nucleotides in ESTs and repetitive sequences in the genome. Nevertheless, we present two heuristics that effectively settle this issue. Experimental results confirm that our technique improves the overall computation time by orders of magnitude compared with common tools, such as SIM4 and BLAT, and simultaneously attains high sensitivity and accuracy against a clean dataset of documented genes.  相似文献   

17.
18.
MOTIVATION: Prediction of the coding potential for stretches of DNA is crucial in gene calling and genome annotation, where it is used to identify potential exons and to position their boundaries in conjunction with functional sites, such as splice sites and translation initiation sites. The ability to discriminate between coding and non-coding sequences relates to the structure of coding sequences, which are organized in codons, and by their biased usage. For statistical reasons, the longer the sequences, the easier it is to detect this codon bias. However, in many eukaryotic genomes, where genes harbour many introns, both introns and exons might be small and hard to distinguish based on coding potential. RESULTS: Here, we present novel approaches that specifically aim at a better detection of coding potential in short sequences. The methods use complementary sequence features, combined with identification of which features are relevant in discriminating between coding and non-coding sequences. These newly developed methods are evaluated on different species, representative of four major eukaryotic kingdoms, and extensively compared to state-of-the-art Markov models, which are often used for predicting coding potential. The main conclusions drawn from our analyses are that (1) combining complementary sequence features clearly outperforms current Markov models for coding potential prediction in short sequence fragments, (2) coding potential prediction benefits from length-specific models, and these models are not necessarily the same for different sequence lengths and (3) comparing the results across several species indicates that, although our combined method consistently performs extremely well, there are important differences across genomes. SUPPLEMENTARY DATA: http://bioinformatics.psb.ugent.be/.  相似文献   

19.
A database called eukaryotic intron database (EID) was developed based on the data from GenBank.Studies on the statistical characteristics of EID show that there were 103,848 genes,478,484 introns,and 582,332 exons,with an average of 4.61 introns and 5.61 exons per gene.Introns of 40-120 nt in length were abundant in the database.Results of the statistical analysis on the data from nine model species showed that in eukaryotes,higher species do not necessarily have more introns or exons in a gene than lower species.Furthermore,characteristics of EID,such as intron phase,distribution of different splice sites,and the relationship between genome size and intron proportion or intron density,have been studied.  相似文献   

20.
Prediction of exact boundaries of exons   总被引:3,自引:0,他引:3  
It is known that while the programs used to predict genes are good at determining coding nucleotides, there are considerable inaccuracies in the determination of the gene structural elements. Among them, the most notable is that of the exact boundaries of exons. In order to assess this, we had earlier reviewed various programs that predict potential splice sites and exons. The results led to the following two observations: (i) a high proportion of false positive splice sites from computational predictions occur in the vicinity of real splice sites; and (ii) current algorithms are misled to predict wrong splice sites more often when the coding potential ends within +/-25 nucleotides from real sites than when it ends at farther positions. In this report, we review decision tree models for human splice sites and the resultant software tool, namely SpliceProximalCheck, that discriminates such'proximal' false positives from real splice sites. Further presented is an integrated system (MZEF-SPC) with Splice ProximalCheck (SPC) as a front-end tool operating on the results of Michael Zhang's exon finder program. Examination of the output of the integrated program on an illustrative gene set revealed that as much as 61 of 93 MZEF-predicted false positive exons could be eliminated by SPC for a loss of only 3 out of 33 MZEF-predicted true positive exons.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号