首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
2.

Background  

The number of sequenced eukaryotic genomes is rapidly increasing. This means that over time it will be hard to keep supplying customised gene finders for each genome. This calls for procedures to automatically generate species-specific gene finders and to re-train them as the quantity and quality of reliable gene annotation grows.  相似文献   

3.
Predicting protein-coding genes still remains a significant challenge. Although a variety of computational programs that use commonly machine learning methods have emerged, the accuracy of predictions remains a low level when implementing in large genomic sequences. Moreover, computational gene finding in newly se- quenced genomes is especially a difficult task due to the absence of a training set of abundant validated genes. Here we present a new gene-finding program, SCGPred, to improve the accuracy of prediction by combining multiple sources of evidence. SCGPred can perform both supervised method in previously well-studied genomes and unsupervised one in novel genomes. By testing with datasets composed of large DNA sequences from human and a novel genome of Ustilago maydi, SCGPred gains a significant improvement in comparison to the popular ab initio gene predictors. We also demonstrate that SCGPred can significantly improve prediction in novel genomes by combining several foreign gene finders with similarity alignments, which is superior to other unsupervised methods. Therefore, SCGPred can serve as an alternative gene-finding tool for newly sequenced eukaryotic genomes. The program is freely available at http://bio.scu.edu.cn/SCGPred/.  相似文献   

4.
Development of joint application strategies for two microbial gene finders   总被引:2,自引:0,他引:2  
MOTIVATION: As a starting point in annotation of bacterial genomes, gene finding programs are used for the prediction of functional elements in the DNA sequence. Due to the faster pace and increasing number of genome projects currently underway, it is becoming especially important to have performant methods for this task. RESULTS: This study describes the development of joint application strategies that combine the strengths of two microbial gene finders to improve the overall gene finding performance. Critica is very specific in the detection of similarity-supported genes as it uses a comparative sequence analysis-based approach. Glimmer employs a very sophisticated model of genomic sequence properties and is sensitive also in the detection of organism-specific genes. Based on a data set of 113 microbial genome sequences, we optimized a combined application approach using different parameters with relevance to the gene finding problem. This results in a significant improvement in specificity while there is similarity in sensitivity to Glimmer. The improvement is especially pronounced for GC rich genomes. The method is currently being applied for the annotation of several microbial genomes. AVAILABILITY: The methods described have been implemented within the gene prediction component of the GenDB genome annotation system.  相似文献   

5.
GeneMark.hmm: new solutions for gene finding.   总被引:35,自引:0,他引:35       下载免费PDF全文
The number of completely sequenced bacterial genomes has been growing fast. There are computer methods available for finding genes but yet there is a need for more accurate algorithms. The GeneMark. hmm algorithm presented here was designed to improve the gene prediction quality in terms of finding exact gene boundaries. The idea was to embed the GeneMark models into naturally derived hidden Markov model framework with gene boundaries modeled as transitions between hidden states. We also used the specially derived ribosome binding site pattern to refine predictions of translation initiation codons. The algorithm was evaluated on several test sets including 10 complete bacterial genomes. It was shown that the new algorithm is significantly more accurate than GeneMark in exact gene prediction. Interestingly, the high gene finding accuracy was observed even in the case when Markov models of order zero, one and two were used. We present the analysis of false positive and false negative predictions with the caution that these categories are not precisely defined if the public database annotation is used as a control.  相似文献   

6.
Motivation: A growing number of genomes are sequenced. The differences in evolutionary pattern between functional regions can thus be observed genome-wide in a whole set of organisms. The diverse evolutionary pattern of different functional regions can be exploited in the process of genomic annotation. The modelling of evolution by the existing comparative gene finders leaves room for improvement. Results: A probabilistic model of both genome structure and evolution is designed. This type of model is called an Evolutionary Hidden Markov Model (EHMM), being composed of an HMM and a set of region-specific evolutionary models based on a phylogenetic tree. All parameters can be estimated by maximum likelihood, including the phylogenetic tree. It can handle any number of aligned genomes, using their phylogenetic tree to model the evolutionary correlations. The time complexity of all algorithms used for handling the model are linear in alignment length and genome number. The model is applied to the problem of gene finding. The benefit of modelling sequence evolution is demonstrated both in a range of simulations and on a set of orthologous human/mouse gene pairs. AVAILABILITY: Free availability over the Internet on www server: http://www.birc.dk/Software/evogene.  相似文献   

7.
MOTIVATION: Computational gene prediction methods are an important component of whole genome analyses. While ab initio gene finders have demonstrated major improvements in accuracy, the most reliable methods are evidence-based gene predictors. These algorithms can rely on several different sources of evidence including predictions from multiple ab initio gene finders, matches to known proteins, sequence conservation and partial cDNAs to predict the final product. Despite the success of these algorithms, prediction of complete gene structures, especially for alternatively spliced products, remains a difficult task. RESULTS: LOCUS (Length Optimized Characterization of Unknown Spliceforms) is a new evidence-based gene finding algorithm which integrates a length-constraint into a dynamic programming-based framework for prediction of gene products. On a Caenorhabditis elegans test set of alternatively spliced internal exons, its performance exceeds that of current ab initio gene finders and in most cases can accurately predict the correct form of all the alternative products. As the length information used by the algorithm can be obtained in a high-throughput fashion, we propose that integration of such information into a gene-prediction pipeline is feasible and doing so may improve our ability to fully characterize the complete set of mRNAs for a genome. AVAILABILITY: LOCUS is available from http://ural.wustl.edu/software.html  相似文献   

8.

Background  

Although it is not difficult for state-of-the-art gene finders to identify coding regions in prokaryotic genomes, exact prediction of the corresponding translation initiation sites (TIS) is still a challenging problem. Recently a number of post-processing tools have been proposed for improving the annotation of prokaryotic TIS. However, inherent difficulties of these approaches arise from the considerable variation of TIS characteristics across different species. Therefore prior assumptions about the properties of prokaryotic gene starts may cause suboptimal predictions for newly sequenced genomes with TIS signals differing from those of well-investigated genomes.  相似文献   

9.

Background

Despite the continuous production of genome sequence for a number of organisms, reliable, comprehensive, and cost effective gene prediction remains problematic. This is particularly true for genomes for which there is not a large collection of known gene sequences, such as the recently published chicken genome. We used the chicken sequence to test comparative and homology-based gene-finding methods followed by experimental validation as an effective genome annotation method.

Results

We performed experimental evaluation by RT-PCR of three different computational gene finders, Ensembl, SGP2 and TWINSCAN, applied to the chicken genome. A Venn diagram was computed and each component of it was evaluated. The results showed that de novo comparative methods can identify up to about 700 chicken genes with no previous evidence of expression, and can correctly extend about 40% of homology-based predictions at the 5' end.

Conclusions

De novo comparative gene prediction followed by experimental verification is effective at enhancing the annotation of the newly sequenced genomes provided by standard homology-based methods.  相似文献   

10.
Operon prediction without a training set   总被引:5,自引:0,他引:5  
  相似文献   

11.
12.
JIGSAW: integration of multiple sources of evidence for gene prediction   总被引:3,自引:0,他引:3  
MOTIVATION: Computational gene finding systems play an important role in finding new human genes, although no systems are yet accurate enough to predict all or even most protein-coding regions perfectly. Ab initio programs can be augmented by evidence such as expression data or protein sequence homology, which improves their performance. The amount of such evidence continues to grow, but computational methods continue to have difficulty predicting genes when the evidence is conflicting or incomplete. Genome annotation pipelines collect a variety of types of evidence about gene structure and synthesize the results, which can then be refined further through manual, expert curation of gene models. RESULTS: JIGSAW is a new gene finding system designed to automate the process of predicting gene structure from multiple sources of evidence, with results that often match the performance of human curators. JIGSAW computes the relative weight of different lines of evidence using statistics generated from a training set, and then combines the evidence using dynamic programming. Our results show that JIGSAW's performance is superior to ab initio gene finding methods and to other pipelines such as Ensembl. Even without evidence from alignment to known genes, JIGSAW can substantially improve gene prediction accuracy as compared with existing methods. AVAILABILITY: JIGSAW is available as an open source software package at http://cbcb.umd.edu/software/jigsaw.  相似文献   

13.
As the pace of genome sequencing has accelerated, the need for highly accurate gene prediction systems has grown. Computational systems for identifying genes in prokaryotic genomes have sensitivities of 98-99% or higher (Delcher et al., Nucleic Acids Res., 27, 4636-4641, 1999). These accuracy figures are calculated by comparing the locations of verified stop codons to the predictions. Determining the accuracy of start codon prediction is more problematic, however, due to the relatively small number of start sites that have been confirmed by independent, non-computational methods. Nonetheless, the accuracy of gene finders at predicting the exact gene boundaries at both the 5' and 3' ends of genes is of critical importance for microbial genome annotation, especially in light of the important signaling information that is sometimes found on the 5' end of a protein coding region. In this paper we propose a probabilistic method to improve the accuracy of gene identification systems at finding precise translation start sites. The new system, RBSfinder, is tested on a validated set of genes from Escherichia coli, for which it improves the accuracy of start site locations predicted by computational gene finding systems from the range 67-77% to 90% correct.  相似文献   

14.

Background

The Generalized Hidden Markov Model (GHMM) has proven a useful framework for the task of computational gene prediction in eukaryotic genomes, due to its flexibility and probabilistic underpinnings. As the focus of the gene finding community shifts toward the use of homology information to improve prediction accuracy, extensions to the basic GHMM model are being explored as possible ways to integrate this homology information into the prediction process. Particularly prominent among these extensions are those techniques which call for the simultaneous prediction of genes in two or more genomes at once, thereby increasing significantly the computational cost of prediction and highlighting the importance of speed and memory efficiency in the implementation of the underlying GHMM algorithms. Unfortunately, the task of implementing an efficient GHMM-based gene finder is already a nontrivial one, and it can be expected that this task will only grow more onerous as our models increase in complexity.

Results

As a first step toward addressing the implementation challenges of these next-generation systems, we describe in detail two software architectures for GHMM-based gene finders, one comprising the common array-based approach, and the other a highly optimized algorithm which requires significantly less memory while achieving virtually identical speed. We then show how both of these architectures can be accelerated by a factor of two by optimizing their content sensors. We finish with a brief illustration of the impact these optimizations have had on the feasibility of our new homology-based gene finder, TWAIN.

Conclusions

In describing a number of optimizations for GHMM-based gene finders and making available two complete open-source software systems embodying these methods, it is our hope that others will be more enabled to explore promising extensions to the GHMM framework, thereby improving the state-of-the-art in gene prediction techniques.  相似文献   

15.
Interpolated Markov models for eukaryotic gene finding.   总被引:21,自引:0,他引:21  
Computational gene finding research has emphasized the development of gene finders for bacterial and human DNA. This has left genome projects for some small eukaryotes without a system that addresses their needs. This paper reports on a new system, GlimmerM, that was developed to find genes in the malaria parasite Plasmodium falciparum. Because the gene density in P. falciparum is relatively high, the system design was based on a successful bacterial gene finder, Glimmer. The system was augmented with specially trained modules to find splice sites and was trained on all available data from the P. falciparum genome. Although a precise evaluation of its accuracy is impossible at this time, laboratory tests (using RT-PCR) on a small selection of predicted genes confirmed all of those predictions. With the rapid progress in sequencing the genome of P. falciparum, the availability of this new gene finder will greatly facilitate the annotation process.  相似文献   

16.
We have developed a novel method for estimating the parameters of hidden Markov models for gene finding in newly sequenced species. Our approach does not rely on curated training data sets, but instead uses extrinsic evidence (including paired-end ditags that have not been used in gene finding previously) and iterative training. This new method is particularly suitable for annotation of species with large evolutionary distance to the closest annotated species. We have used our approach to produce an initial annotation of more than 16 000 genes in the newly sequenced Schistosoma japonicum draft genome. We established the high quality of our predictions by comparison to full-length cDNAs (withdrawn from the extrinsic evidence) and to CEGMA core genes. We also evaluated the effectiveness of the new training procedure on Caenorhabditis elegans genome. ExonHunter and the newest parametric files for S. japonicum genome are available for download at www.bioinformatics.uwaterloo.ca/downloads/exonhunter  相似文献   

17.
The Ensembl genome database project   总被引:45,自引:4,他引:45       下载免费PDF全文
The Ensembl (http://www.ensembl.org/) database project provides a bioinformatics framework to organise biology around the sequences of large genomes. It is a comprehensive source of stable automatic annotation of the human genome sequence, with confirmed gene predictions that have been integrated with external data sources, and is available as either an interactive web site or as flat files. It is also an open source software engineering project to develop a portable system able to handle very large genomes and associated requirements from sequence analysis to data storage and visualisation. The Ensembl site is one of the leading sources of human genome sequence annotation and provided much of the analysis for publication by the international human genome project of the draft genome. The Ensembl system is being installed around the world in both companies and academic sites on machines ranging from supercomputers to laptops.  相似文献   

18.
TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders   总被引:5,自引:0,他引:5  
We describe two new Generalized Hidden Markov Model implementations for ab initio eukaryotic gene prediction. The C/C++ source code for both is available as open source and is highly reusable due to their modular and extensible architectures. Unlike most of the currently available gene-finders, the programs are re-trainable by the end user. They are also re-configurable and include several types of probabilistic submodels which can be independently combined, such as Maximal Dependence Decomposition trees and interpolated Markov models. Both programs have been used at TIGR for the annotation of the Aspergillus fumigatus and Toxoplasma gondii genomes. AVAILABILITY: Source code and documentation are available under the open source Artistic License from http://www.tigr.org/software/pirate  相似文献   

19.
Next generation sequencing technology is advancing genome sequencing at an unprecedented level. By unravelling the code within a pathogen’s genome, every possible protein (prior to post-translational modifications) can theoretically be discovered, irrespective of life cycle stages and environmental stimuli. Now more than ever there is a great need for high-throughput ab initio gene finding. Ab initio gene finders use statistical models to predict genes and their exon-intron structures from the genome sequence alone. This paper evaluates whether existing ab initio gene finders can effectively predict genes to deduce proteins that have presently missed capture by laboratory techniques. An aim here is to identify possible patterns of prediction inaccuracies for gene finders as a whole irrespective of the target pathogen. All currently available ab initio gene finders are considered in the evaluation but only four fulfil high-throughput capability: AUGUSTUS, GeneMark_hmm, GlimmerHMM, and SNAP. These gene finders require training data specific to a target pathogen and consequently the evaluation results are inextricably linked to the availability and quality of the data. The pathogen, Toxoplasma gondii, is used to illustrate the evaluation methods. The results support current opinion that predicted exons by ab initio gene finders are inaccurate in the absence of experimental evidence. However, the results reveal some patterns of inaccuracy that are common to all gene finders and these inaccuracies may provide a focus area for future gene finder developers.  相似文献   

20.
Your Gene structure Annotation Tool for Eukaryotes (yrGATE) provides an Annotation Tool and Community Utilities for worldwide web-based community genome and gene annotation. Annotators can evaluate gene structure evidence derived from multiple sources to create gene structure annotations. Administrators regulate the acceptance of annotations into published gene sets. yrGATE is designed to facilitate rapid and accurate annotation of emerging genomes as well as to confirm, refine, or correct currently published annotations. yrGATE is highly portable and supports different standard input and output formats. The yrGATE software and usage cases are available at .  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号