期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge

Krallinger M Morgan A Smith L Leitner F Tanabe L Wilbur J Hirschman L Valencia A 《Genome biology》2008,9(Z2):S1

Background:

Genome sciences have experienced an increasing demand for efficient text-processing tools that can extract biologically relevant information from the growing amount of published literature. In response, a range of text-mining and information-extraction tools have recently been developed specifically for the biological domain. Such tools are only useful if they are designed to meet real-life tasks and if their performance can be estimated and compared. The BioCreative challenge (Critical Assessment of Information Extraction in Biology) consists of a collaborative initiative to provide a common evaluation framework for monitoring and assessing the state-of-the-art of text-mining systems applied to biologically relevant problems.

Results:

The Second BioCreative assessment (2006 to 2007) attracted 44 teams from 13 countries worldwide, with the aim of evaluating current information-extraction/text-mining technologies developed for one or more of the three tasks defined for this challenge evaluation. These tasks included the recognition of gene mentions in abstracts (gene mention task); the extraction of a list of unique identifiers for human genes mentioned in abstracts (gene normalization task); and finally the extraction of physical protein-protein interaction annotation-relevant information (protein-protein interaction task). The 'gold standard' data used for evaluating submissions for the third task was provided by the interaction databases MINT (Molecular Interaction Database) and IntAct.

Conclusion:

The Second BioCreative assessment almost doubled the number of participants for each individual task when compared with the first BioCreative assessment. An overall improvement in terms of balanced precision and recall was observed for the best submissions for the gene mention (F score 0.87); for the gene normalization task, the best results were comparable (F score 0.81) compared with results obtained for similar tasks posed at the first BioCreative challenge. In case of the protein-protein interaction task, the importance and difficulties of experimentally confirmed annotation extraction from full-text articles were explored, yielding different results depending on the step of the annotation extraction workflow. A common characteristic observed in all three tasks was that the combination of system outputs could yield better results than any single system. Finally, the development of the first text-mining meta-server was promoted within the context of this community challenge.

相似文献

2.

Mining physical protein-protein interactions from the literature

Huang M Ding S Wang H Zhu X 《Genome biology》2008,9(Z2):S12

Background:

Deciphering physical protein-protein interactions is fundamental to elucidating both the functions of proteins and biological processes. The development of high-throughput experimental technologies such as the yeast two-hybrid screening has produced an explosion in data relating to interactions. Since manual curation is intensive in terms of time and cost, there is an urgent need for text-mining tools to facilitate the extraction of such information. The BioCreative (Critical Assessment of Information Extraction systems in Biology) challenge evaluation provided common standards and shared evaluation criteria to enable comparisons among different approaches.

Results:

During the benchmark evaluation of BioCreative 2006, all of our results ranked in the top three places. In the task of filtering articles irrelevant to physical protein interactions, our method contributes a precision of 75.07%, a recall of 81.07%, and an AUC (area under the receiver operating characteristic curve) of 0.847. In the task of identifying protein mentions and normalizing mentions to molecule identifiers, our method is competitive among runs submitted, with a precision of 34.83%, a recall of 24.10%, and an F₁ score of28.5%. In extracting protein interaction pairs, our profile-based method was competitive on the SwissProt-only subset (precision = 36.95%, recall = 32.68%, and F₁ score = 30.40%) and on the entire dataset (30.96%, 29.35%, and26.20%, respectively). From the biologist's point of view, however, these findings are far from satisfactory. The error analysis presented in this report provides insight into how performance could be improved: three-quarters of false negatives were due to protein normalization problems (532/698), and about one-quarter were due to problems with correctly extracting interactions for this system.

Conclusion:

We present a text-mining framework to extract physical protein-protein interactions from the literature. Three key issues are addressed, namely filtering irrelevant articles, identifying protein names and normalizing them to molecule identifiers, and extracting protein-protein interactions. Our system is among the top three performers in the benchmark evaluation of BioCreative 2006. The tool will be helpful for manual interaction curation and can greatly facilitate the process of extracting protein-protein interactions.

相似文献

3.

Overview of the protein-protein interaction annotation extraction task of BioCreative II

Krallinger M Leitner F Rodriguez-Penagos C Valencia A 《Genome biology》2008,9(Z2):S4

Background:

The biomedical literature is the primary information source for manual protein-protein interaction annotations. Text-mining systems have been implemented to extract binary protein interactions from articles, but a comprehensive comparison between the different techniques as well as with manual curation was missing.

Results:

We designed a community challenge, the BioCreative II protein-protein interaction (PPI) task, based on the main steps of a manual protein interaction annotation workflow. It was structured into four distinct subtasks related to: (a) detection of protein interaction-relevant articles; (b) extraction and normalization of protein interaction pairs; (c) retrieval of the interaction detection methods used; and (d) retrieval of actual text passages that provide evidence for protein interactions. A total of 26 teams submitted runs for at least one of the proposed subtasks. In the interaction article detection subtask, the top scoring team reached an F-score of 0.78. In the interaction pair extraction and mapping to SwissProt, a precision of 0.37 (with recall of 0.33) was obtained. For associating articles with an experimental interaction detection method, an F-score of 0.65 was achieved. As for the retrieval of the PPI passages best summarizing a given protein interaction in full-text articles, 19% of the submissions returned by one of the runs corresponded to curator-selected sentences. Curators extracted only the passages that best summarized a given interaction, implying that many of the automatically extracted ones could contain interaction information but did not correspond to the most informative sentences.

Conclusion:

The BioCreative II PPI task is the first attempt to compare the performance of text-mining tools specific for each of the basic steps of the PPI extraction pipeline. The challenges identified range from problems in full-text format conversion of articles to difficulties in detecting interactor protein pairs and then linking them to their database records. Some limitations were also encountered when using a single (and possibly incomplete) reference database for protein normalization or when limiting search for interactor proteins to co-occurrence within a single sentence, when a mention might span neighboring sentences. Finally, distinguishing between novel, experimentally verified interactions (annotation relevant) and previously known interactions adds additional complexity to these tasks.

相似文献

4.

OntoGene in BioCreative II

Rinaldi F Kappeler T Kaljurand K Schneider G Klenner M Clematide S Hess M von Allmen JM Parisot P Romacker M Vachon T 《Genome biology》2008,9(Z2):S13

Background:

Research scientists and companies working in the domains of biomedicine and genomics are increasingly faced with the problem of efficiently locating, within the vast body of published scientific findings, the critical pieces of information that are needed to direct current and future research investment.

Results:

In this report we describe approaches taken within the scope of the second BioCreative competition in order to solve two aspects of this problem: detection of novel protein interactions reported in scientific articles, and detection of the experimental method that was used to confirm the interaction. Our approach to the former problem is based on a high-recall protein annotation step, followed by two strict disambiguation steps. The remaining proteins are then combined according to a number of lexico-syntactic filters, which deliver high-precision results while maintaining reasonable recall. The detection of the experimental methods is tackled by a pattern matching approach, which has delivered the best results in the official BioCreative evaluation.

Conclusion:

Although the results of BioCreative clearly show that no tool is sufficiently reliable for fully automated annotations, a few of the proposed approaches (including our own) already perform at a competitive level. This makes them interesting either as standalone tools for preliminary document inspection, or as modules within an environment aimed at supporting the process of curation of biomedical literature.

相似文献

5.

Gene mention normalization and interaction extraction with context models and sentence motifs

Hakenberg J Plake C Royer L Strobelt H Leser U Schroeder M 《Genome biology》2008,9(Z2):S14

Background:

The goal of text mining is to make the information conveyed in scientific publications accessible to structured search and automatic analysis. Two important subtasks of text mining are entity mention normalization - to identify biomedical objects in text - and extraction of qualified relationships between those objects. We describe a method for identifying genes and relationships between proteins.

Results:

We present solutions to gene mention normalization and extraction of protein-protein interactions. For the first task, we identify genes by using background knowledge on each gene, namely annotations related to function, location, disease, and so on. Our approach currently achieves an f-measure of 86.4% on the BioCreative II gene normalization data. For the extraction of protein-protein interactions, we pursue an approach that builds on classical sequence analysis: motifs derived from multiple sequence alignments. The method achieves an f-measure of 24.4% (micro-average) in the BioCreative II interaction pair subtask.

Conclusion:

For gene mention normalization, our approach outperforms strategies that utilize only the matching of genes names against dictionaries, without invoking further knowledge on each gene. Motifs derived from alignments of sentences are successful at identifying protein interactions in text; the approach we present in this report is fully automated and performs similarly to systems that require human intervention at one or more stages.

Availability:

Our methods for gene, protein, and species identification, and extraction of protein-protein are available as part of the BioCreative Meta Services (BCMS), see http://bcms.bioinfo.cnio.es/.

相似文献

6.

Automating curation using a natural language processing pipeline

Alex B Grover C Haddow B Kabadjov M Klein E Matthews M Tobin R Wang X 《Genome biology》2008,9(Z2):S10

Background:

The tasks in BioCreative II were designed to approximate some of the laborious work involved in curating biomedical research papers. The approach to these tasks taken by the University of Edinburgh team was to adapt and extend the existing natural language processing (NLP) system that we have developed as part of a commercial curation assistant. Although this paper concentrates on using NLP to assist with curation, the system can be equally employed to extract types of information from the literature that is immediately relevant to biologists in general.

Results:

Our system was among the highest performing on the interaction subtasks, and competitive performance on the gene mention task was achieved with minimal development effort. For the gene normalization task, a string matching technique that can be quickly applied to new domains was shown to perform close to average.

Conclusion:

The technologies being developed were shown to be readily adapted to the BioCreative II tasks. Although high performance may be obtained on individual tasks such as gene mention recognition and normalization, and document classification, tasks in which a number of components must be combined, such as detection and normalization of interacting protein pairs, are still challenging for NLP systems.

相似文献

7.

Concept recognition for extracting protein interaction relations from biomedical text

Baumgartner WA Lu Z Johnson HL Caporaso JG Paquette J Lindemann A White EK Medvedeva O Cohen KB Hunter L 《Genome biology》2008,9(Z2):S9

Background:

Reliable information extraction applications have been a long sought goal of the biomedical text mining community, a goal that if reached would provide valuable tools to benchside biologists in their increasingly difficult task of assimilating the knowledge contained in the biomedical literature. We present an integrated approach to concept recognition in biomedical text. Concept recognition provides key information that has been largely missing from previous biomedical information extraction efforts, namely direct links to well defined knowledge resources that explicitly cement the concept's semantics. The BioCreative II tasks discussed in this special issue have provided a unique opportunity to demonstrate the effectiveness of concept recognition in the field of biomedical language processing.

Results:

Through the modular construction of a protein interaction relation extraction system, we present several use cases of concept recognition in biomedical text, and relate these use cases to potential uses by the benchside biologist.

Conclusion:

Current information extraction technologies are approaching performance standards at which concept recognition can begin to deliver high quality data to the benchside biologist. Our system is available as part of the BioCreative Meta-Server project and on the internet http://bionlp.sourceforge.net.

相似文献

8.

Automatic recognition of topic-classified relations between prostate cancer and genes using MEDLINE abstracts

Chun HW Tsuruoka Y Kim JD Shiba R Nagata N Hishiki T Tsujii J 《BMC bioinformatics》2006,7(Z3):S4

Background

Automatic recognition of relations between a specific disease term and its relevant genes or protein terms is an important practice of bioinformatics. Considering the utility of the results of this approach, we identified prostate cancer and gene terms with the ID tags of public biomedical databases. Moreover, considering that genetics experts will use our results, we classified them based on six topics that can be used to analyze the type of prostate cancers, genes, and their relations.

Methods

We developed a maximum entropy-based named entity recognizer and a relation recognizer and applied them to a corpus-based approach. We collected prostate cancer-related abstracts from MEDLINE, and constructed an annotated corpus of gene and prostate cancer relations based on six topics by biologists. We used it to train the maximum entropy-based named entity recognizer and relation recognizer.

Results

Topic-classified relation recognition achieved 92.1% precision for the relation (an increase of 11.0% from that obtained in a baseline experiment). For all topics, the precision was between 67.6 and 88.1%.

Conclusion

A series of experimental results revealed two important findings: a carefully designed relation recognition system using named entity recognition can improve the performance of relation recognition, and topic-classified relation recognition can be effectively addressed through a corpus-based approach using manual annotation and machine learning techniques.

相似文献

9.

Recon 2.2: from reconstruction to model of human metabolism

Neil Swainston Kieran Smallbone Hooman Hefzi Paul D. Dobson Judy Brewer Michael Hanscho Daniel C. Zielinski Kok Siong Ang Natalie J. Gardiner Jahir M. Gutierrez Sarantos Kyriakopoulos Meiyappan Lakshmanan Shangzhong Li Joanne K. Liu Veronica S. Martínez Camila A. Orellana Lake-Ee Quek Alex Thomas Juergen Zanghellini Nicole Borth Dong-Yup Lee Lars K. Nielsen Douglas B. Kell Nathan E. Lewis Pedro Mendes 《Metabolomics : Official journal of the Metabolomic Society》2016,12(7):109

相似文献

10.

Regular expressions of MS/MS spectra for partial annotation of metabolite features

Fumio Matsuda 《Metabolomics : Official journal of the Metabolomic Society》2016,12(7):113

相似文献

11.

Overview of BioCreative II gene normalization

Morgan AA Lu Z Wang X Cohen AM Fluck J Ruch P Divoli A Fundel K Leaman R Hakenberg J Sun C Liu HH Torres R Krauthammer M Lau WW Liu H Hsu CN Schuemie M Cohen KB Hirschman L 《Genome biology》2008,9(Z2):S3

Background:

The goal of the gene normalization task is to link genes or gene products mentioned in the literature to biological databases. This is a key step in an accurate search of the biological literature. It is a challenging task, even for the human expert; genes are often described rather than referred to by gene symbol and, confusingly, one gene name may refer to different genes (often from different organisms). For BioCreative II, the task was to list the Entrez Gene identifiers for human genes or gene products mentioned in PubMed/MEDLINE abstracts. We selected abstracts associated with articles previously curated for human genes. We provided 281 expert-annotated abstracts containing 684 gene identifiers for training, and a blind test set of 262 documents containing 785 identifiers, with a gold standard created by expert annotators. Inter-annotator agreement was measured at over 90%.

Results:

Twenty groups submitted one to three runs each, for a total of 54 runs. Three systems achieved F-measures (balanced precision and recall) between 0.80 and 0.81. Combining the system outputs using simple voting schemes and classifiers obtained improved results; the best composite system achieved an F-measure of 0.92 with 10-fold cross-validation. A 'maximum recall' system based on the pooled responses of all participants gave a recall of 0.97 (with precision 0.23), identifying 763 out of 785 identifiers.

Conclusion:

Major advances for the BioCreative II gene normalization task include broader participation (20 versus 8 teams) and a pooled system performance comparable to human experts, at over 90% agreement. These results show promise as tools to link the literature with biological databases.

相似文献

12.

Identification of protein complexes from multi-relationship protein interaction networks

Xueyong?Li Jianxin?Wang Email author Bihai?Zhao Email author Fang-Xiang?Wu Yi?Pan 《Human genomics》2016,10(2):17

Background

Protein complexes play an important role in biological processes. Recent developments in experiments have resulted in the publication of many high-quality, large-scale protein-protein interaction (PPI) datasets, which provide abundant data for computational approaches to the prediction of protein complexes. However, the precision of protein complex prediction still needs to be improved due to the incompletion and noise in PPI networks.

Results

There exist complex and diverse relationships among proteins after integrating multiple sources of biological information. Considering that the influences of different types of interactions are not the same weight for protein complex prediction, we construct a multi-relationship protein interaction network (MPIN) by integrating PPI network topology with gene ontology annotation information. Then, we design a novel algorithm named MINE (identifying protein complexes based on Multi-relationship protein Interaction NEtwork) to predict protein complexes with high cohesion and low coupling from MPIN.

Conclusions

The experiments on yeast data show that MINE outperforms the current methods in terms of both accuracy and statistical significance.

相似文献

13.

Production,purification, characterization,immobilization, and application of Serrapeptase: a review

Selvarajan Ethiraj Shreya Gopinath 《生物学前沿》2017,12(5):333-348

Background

Serrapeptase is a proteolytic enzyme with many favorable biological properties like anti-inflammatory, analgesic, anti-bacterial, fibrinolytic properties and hence, is widely used in clinical practice for the treatment of many diseases. Although Serrapeptase is widely used, there are very few published papers and the information available about the enzyme is very meagre. Hence this review article compiles all the information about this important enzyme Serrapeptase.

Methods

A literature search against various databases and search engines like PubMed, SpringerLink, Scopus etc. was performed.

Results

We gathered and highlight all the published information regarding the molecular aspects, properties, sources, production, purification, detection, optimizing yield, immobilization, clinical studies, pharmacology, interaction studies, formulation, dosage and safety of the enzyme Serrapeptase.

Conclusion

Serrapeptase is used in many clinical studies against various diseases for its anti-inflammatory, fibrinolytic and analgesic effects. There is insufficient data regarding the safety of the enzyme as a health supplement. Data about the antiatherosclerotic activity, safety, tolerability, efficacy and mechanism of action of the Serrapeptase are still required.

相似文献

14.

A bedr way of genomic interval processing

Syed?Haider Daryl?Waggott Emilie?Lalonde Clement?Fung Fei-Fei?Liu Paul?C.?Boutros Email author 《Source code for biology and medicine》2016,11(1):14

Background

Next-generation sequencing is making it critical to robustly and rapidly handle genomic ranges within standard pipelines. Standard use-cases include annotating sequence ranges with gene or other genomic annotation, merging multiple experiments together and subsequently quantifying and visualizing the overlap. The most widely-used tools for these tasks work at the command-line (e.g. BEDTools) and the small number of available R packages are either slow or have distinct semantics and features from command-line interfaces.

Results

To provide a robust R-based interface to standard command-line tools for genomic coordinate manipulation, we created bedr. This open-source R package can use either BEDTools or BEDOPS as a back-end and performs data-manipulation extremely quickly, creating R data structures that can be readily interfaced with existing computational pipelines. It includes data-visualization capabilities and a number of data-access functions that interface with standard databases like UCSC and COSMIC.

Conclusions

bedr package provides an open source solution to enable genomic interval data manipulation and restructuring in R programming language which is commonly used in bioinformatics, and therefore would be useful to bioinformaticians and genomic researchers.

相似文献

15.

A lost opportunity for science: journals promote data sharing in metabolomics but do not enforce it

Rachel A. Spicer Christoph Steinbeck 《Metabolomics : Official journal of the Metabolomic Society》2018,14(1):16

Introduction

Data sharing is being increasingly required by journals and has been heralded as a solution to the ‘replication crisis’.

Objectives

(i) Review data sharing policies of journals publishing the most metabolomics papers associated with open data and (ii) compare these journals’ policies to those that publish the most metabolomics papers.

Methods

A PubMed search was used to identify metabolomics papers. Metabolomics data repositories were manually searched for linked publications.

Results

Journals that support data sharing are not necessarily those with the most papers associated to open metabolomics data.

Conclusion

Further efforts are required to improve data sharing in metabolomics.

相似文献

16.

Dynamic cross-talk analysis among TNF-R,TLR-4 and IL-1R signalings in TNFα-induced inflammatory responses

Shih-Kuang Yang Yu-Chao Wang Chun-Cheih Chao Yung-Jen Chuang Chung-Yu Lan Bor-Sen Chen 《BMC medical genomics》2010,3(1):19

Background

Development in systems biology research has accelerated in recent years, and the reconstructions for molecular networks can provide a global view to enable in-depth investigation on numerous system properties in biology. However, we still lack a systematic approach to reconstruct the dynamic protein-protein association networks at different time stages from high-throughput data to further analyze the possible cross-talks among different signaling/regulatory pathways.

Methods

In this study we integrated protein-protein interactions from different databases to construct the rough protein-protein association networks (PPANs) during TNFα-induced inflammation. Next, the gene expression profiles of TNFα-induced HUVEC and a stochastic dynamic model were used to rebuild the significant PPANs at different time stages, reflecting the development and progression of endothelium inflammatory responses. A new cross-talk ranking method was used to evaluate the potential core elements in the related signaling pathways of toll-like receptor 4 (TLR-4) as well as receptors for tumor necrosis factor (TNF-R) and interleukin-1 (IL-1R).

Results

The highly ranked cross-talks which are functionally relevant to the TNFα pathway were identified. A bow-tie structure was extracted from these cross-talk pathways, suggesting the robustness of network structure, the coordination of signal transduction and feedback control for efficient inflammatory responses to different stimuli. Further, several characteristics of signal transduction and feedback control were analyzed.

Conclusions

A systematic approach based on a stochastic dynamic model is proposed to generate insight into the underlying defense mechanisms of inflammation via the construction of corresponding signaling networks upon specific stimuli. In addition, this systematic approach can be applied to other signaling networks under different conditions in different species. The algorithm and method proposed in this study could expedite prospective systems biology research when better experimental techniques for protein expression detection and microarray data with multiple sampling points become available in the future.

相似文献

17.

Model based dynamics analysis in live cell microtubule images

Altinok A Kiris E Peck AJ Feinstein SC Wilson L Manjunath BS Rose K 《BMC cell biology》2007,8(Z1):S4

Background

The dynamic growing and shortening behaviors of microtubules are central to the fundamental roles played by microtubules in essentially all eukaryotic cells. Traditionally, microtubule behavior is quantified by manually tracking individual microtubules in time-lapse images under various experimental conditions. Manual analysis is laborious, approximate, and often offers limited analytical capability in extracting potentially valuable information from the data.

Results

In this work, we present computer vision and machine-learning based methods for extracting novel dynamics information from time-lapse images. Using actual microtubule data, we estimate statistical models of microtubule behavior that are highly effective in identifying common and distinct characteristics of microtubule dynamic behavior.

Conclusion

Computational methods provide powerful analytical capabilities in addition to traditional analysis methods for studying microtubule dynamic behavior. Novel capabilities, such as building and querying microtubule image databases, are introduced to quantify and analyze microtubule dynamic behavior.

相似文献

18.

Discovery of A-type procyanidin dimers in yellow raspberries by untargeted metabolomics and correlation based data analysis 总被引：1，自引：0，他引：1

Elisabete Carvalho Pietro Franceschi Antje Feller Lorena Herrera Luisa Palmieri Panagiotis Arapitsas Samantha Riccadonna Stefan Martens 《Metabolomics : Official journal of the Metabolomic Society》2016,12(9):144

Introduction

Raspberries are becoming increasingly popular due to their reported health beneficial properties. Despite the presence of only trace amounts of anthocyanins, yellow varieties seems to show similar or better effects in comparison to conventional raspberries.

Objectives

The aim of this work is to characterize the metabolic differences between red and yellow berries, focussing on the compounds showing a higher concentration in yellow varieties.

Methods

The metabolomic profile of 13 red and 12 yellow raspberries (of different varieties, locations and collection dates) was determined by UPLC–TOF-MS. A novel approach based on Pearson correlation on the extracted ion chromatograms was implemented to extract the pseudospectra of the most relevant biomarkers from high energy LC–MS runs. The raw data will be made publicly available on MetaboLights (MTBLS333).

Results

Among the metabolites showing higher concentration in yellow raspberries it was possible to identify a series of compounds showing a pseudospectrum similar to that of A-type procyanidin polymers. The annotation of this group of compounds was confirmed by specific MS/MS experiments and performing standard injections.

Conclusions

In berries lacking anthocyanins the polyphenol metabolism might be shifted to the formation of a novel class of A-type procyanidin polymers.

相似文献

19.

GeneXplorer: an interactive web application for microarray data visualization and analysis

Christian?A?Rees Janos?Demeter John?C?Matese David?Botstein Gavin?Sherlock Email author 《BMC bioinformatics》2004,5(1):141

Background

When publishing large-scale microarray datasets, it is of great value to create supplemental websites where either the full data, or selected subsets corresponding to figures within the paper, can be browsed. We set out to create a CGI application containing many of the features of some of the existing standalone software for the visualization of clustered microarray data.

Results

We present GeneXplorer, a web application for interactive microarray data visualization and analysis in a web environment. GeneXplorer allows users to browse a microarray dataset in an intuitive fashion. It provides simple access to microarray data over the Internet and uses only HTML and JavaScript to display graphic and annotation information. It provides radar and zoom views of the data, allows display of the nearest neighbors to a gene expression vector based on their Pearson correlations and provides the ability to search gene annotation fields.

Conclusions

The software is released under the permissive MIT Open Source license, and the complete documentation and the entire source code are freely available for download from CPAN http://search.cpan.org/dist/Microarray-GeneXplorer/.

相似文献

20.

AUGUSTUS at EGASP: using EST,protein and genomic alignments for improved gene prediction in the human genome

Stanke M Tzvetkova A Morgenstern B 《Genome biology》2006,7(Z1):S11.1-S11.8

Background

A large number of gene prediction programs for the human genome exist. These annotation tools use a variety of methods and data sources. In the recent ENCODE genome annotation assessment project (EGASP), some of the most commonly used and recently developed gene-prediction programs were systematically evaluated and compared on test data from the human genome. AUGUSTUS was among the tools that were tested in this project.

Results

AUGUSTUS can be used as an ab initio program, that is, as a program that uses only one single genomic sequence as input information. In addition, it is able to combine information from the genomic sequence under study with external hints from various sources of information. For EGASP, we used genomic sequence alignments as well as alignments to expressed sequence tags (ESTs) and protein sequences as additional sources of information. Within the category of ab initio programs AUGUSTUS predicted significantly more genes correctly than any other ab initio program. At the same time it predicted the smallest number of false positive genes and the smallest number of false positive exons among all ab initio programs. The accuracy of AUGUSTUS could be further improved when additional extrinsic data, such as alignments to EST, protein and/or genomic sequences, was taken into account.

Conclusion

AUGUSTUS turned out to be the most accurate ab initio gene finder among the tested tools. Moreover it is very flexible because it can take information from several sources simultaneously into consideration.

相似文献