首页 | 本学科首页   官方微博 | 高级检索  
     


A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools
Authors:Karin Verspoor  Kevin Bretonnel Cohen  Arrick Lanfranchi  Colin Warner  Helen L Johnson  Christophe Roeder  Jinho D Choi  Christopher Funk  Yuriy Malenkiy  Miriam Eckert  Nianwen Xue  William A Baumgartner Jr  Michael Bada  Martha Palmer  Lawrence E Hunter
Affiliation:1. Computational Bioscience Program, U. Colorado School of Medicine, 12801 E 17th Ave, Aurora, MS 8303, CO, 80045, USA
2. Department of Linguistics, University of Colorado Boulder, Boulder, 290 Hellems, CO, 80309, USA
3. Institute of Cognitive Science, University of Colorado Boulder, Boulder, MUEN PSYCH Building D414, CO, 80309, USA
4. Department of Computer Science, Brandeis University, Waltham, MS 018, MA, 02454, USA
Abstract:ABSTRACT: BACKGROUND: We introduce the linguistic annotation of a corpus of 97 full-text biomedical publications, known as the Colorado Richly Annotated Full Text (CRAFT) corpus. We further assess the performance of existing tools for performing sentence splitting, tokenization, syntactic parsing, and named entity recognition on this corpus. RESULTS: Many biomedical natural language processing systems demonstrated large differences between their previously published results and their performance on the CRAFT corpus when tested with the publicly available models or rule sets. Trainable systems differed widely with respect to their ability to build high-performing models based on this data. CONCLUSIONS: The finding that some systems were able to train high-performing models based on this corpus is additional evidence, beyond high inter-annotator agreement, that the quality of the CRAFT corpus is high. The overall poor performance of various systems indicates that considerable work needs to be done to enable natural language processing systems to work well when the input is full-text journal articles. The CRAFT corpus provides avaluable resource to the biomedical natural language processing community for evaluation and training of new models for biomedical full text publications.
Keywords:
本文献已被 PubMed SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号