首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
MOTIVATION: With the rapid advancement of biomedical science and the development of high-throughput analysis methods, the extraction of various types of information from biomedical text has become critical. Since automatic functional annotations of genes are quite useful for interpreting large amounts of high-throughput data efficiently, the demand for automatic extraction of information related to gene functions from text has been increasing. RESULTS: We have developed a method for automatically extracting the biological process functions of genes/protein/families based on Gene Ontology (GO) from text using a shallow parser and sentence structure analysis techniques. When the gene/protein/family names and their functions are described in ACTOR (doer of action) and OBJECT (receiver of action) relationships, the corresponding GO-IDs are assigned to the genes/proteins/families. The gene/protein/family names are recognized using the gene/protein/family name dictionaries developed by our group. To achieve wide recognition of the gene/protein/family functions, we semi-automatically gather functional terms based on GO using co-occurrence, collocation similarities and rule-based techniques. A preliminary experiment demonstrated that our method has an estimated recall of 54-64% with a precision of 91-94% for actually described functions in abstracts. When applied to the PUBMED, it extracted over 190 000 gene-GO relationships and 150 000 family-GO relationships for major eukaryotes.  相似文献   

2.
Bell L  Chowdhary R  Liu JS  Niu X  Zhang J 《PloS one》2011,6(6):e21474
A significant part of our biological knowledge is centered on relationships between biological entities (bio-entities) such as proteins, genes, small molecules, pathways, gene ontology (GO) terms and diseases. Accumulated at an increasing speed, the information on bio-entity relationships is archived in different forms at scattered places. Most of such information is buried in scientific literature as unstructured text. Organizing heterogeneous information in a structured form not only facilitates study of biological systems using integrative approaches, but also allows discovery of new knowledge in an automatic and systematic way. In this study, we performed a large scale integration of bio-entity relationship information from both databases containing manually annotated, structured information and automatic information extraction of unstructured text in scientific literature. The relationship information we integrated in this study includes protein-protein interactions, protein/gene regulations, protein-small molecule interactions, protein-GO relationships, protein-pathway relationships, and pathway-disease relationships. The relationship information is organized in a graph data structure, named integrated bio-entity network (IBN), where the vertices are the bio-entities and edges represent their relationships. Under this framework, graph theoretic algorithms can be designed to perform various knowledge discovery tasks. We designed breadth-first search with pruning (BFSP) and most probable path (MPP) algorithms to automatically generate hypotheses--the indirect relationships with high probabilities in the network. We show that IBN can be used to generate plausible hypotheses, which not only help to better understand the complex interactions in biological systems, but also provide guidance for experimental designs.  相似文献   

3.
Connecting the dots between PubMed abstracts   总被引:1,自引:0,他引:1  
  相似文献   

4.
MOTIVATION: The MEDLINE database of biomedical abstracts contains scientific knowledge about thousands of interacting genes and proteins. Automated text processing can aid in the comprehension and synthesis of this valuable information. The fundamental task of identifying gene and protein names is a necessary first step towards making full use of the information encoded in biomedical text. This remains a challenging task due to the irregularities and ambiguities in gene and protein nomenclature. We propose to approach the detection of gene and protein names in scientific abstracts as part-of-speech tagging, the most basic form of linguistic corpus annotation. RESULTS: We present a method for tagging gene and protein names in biomedical text using a combination of statistical and knowledge-based strategies. This method incorporates automatically generated rules from a transformation-based part-of-speech tagger, and manually generated rules from morphological clues, low frequency trigrams, indicator terms, suffixes and part-of-speech information. Results of an experiment on a test corpus of 56K MEDLINE documents demonstrate that our method to extract gene and protein names can be applied to large sets of MEDLINE abstracts, without the need for special conditions or human experts to predetermine relevant subsets. AVAILABILITY: The programs are available on request from the authors.  相似文献   

5.
MOTIVATION: As research into disease pathology and cellular function continues to generate vast amounts of data pertaining to protein, gene and small molecule (PGSM) interactions, there exists a critical need to capture these results in structured formats allowing for computational analysis. Although many efforts have been made to create databases that store this information in computer readable form, populating these sources largely requires a manual process of interpreting and extracting interaction relationships from the biological research literature. Being able to efficiently and accurately automate the extraction of interactions from unstructured text, would greatly improve the content of these databases and provide a method for managing the continued growth of new literature being published. RESULTS: In this paper, we describe a system for extracting PGSM interactions from unstructured text. By utilizing a lexical analyzer and context free grammar (CFG), we demonstrate that efficient parsers can be constructed for extracting these relationships from natural language with high rates of recall and precision. Our results show that this technique achieved a recall rate of 83.5% and a precision rate of 93.1% for recognizing PGSM names and a recall rate of 63.9% and a precision rate of 70.2% for extracting interactions between these entities. In contrast to other published techniques, the use of a CFG significantly reduces the complexities of natural language processing by focusing on domain specific structure as opposed to analyzing the semantics of a given language. Additionally, our approach provides a level of abstraction for adding new rules for extracting other types of biological relationships beyond PGSM relationships. AVAILABILITY: The program and corpus are available by request from the authors.  相似文献   

6.
We have developed Textpresso, a new text-mining system for scientific literature whose capabilities go far beyond those of a simple keyword search engine. Textpresso's two major elements are a collection of the full text of scientific articles split into individual sentences, and the implementation of categories of terms for which a database of articles and individual sentences can be searched. The categories are classes of biological concepts (e.g., gene, allele, cell or cell group, phenotype, etc.) and classes that relate two objects (e.g., association, regulation, etc.) or describe one (e.g., biological process, etc.). Together they form a catalog of types of objects and concepts called an ontology. After this ontology is populated with terms, the whole corpus of articles and abstracts is marked up to identify terms of these categories. The current ontology comprises 33 categories of terms. A search engine enables the user to search for one or a combination of these tags and/or keywords within a sentence or document, and as the ontology allows word meaning to be queried, it is possible to formulate semantic queries. Full text access increases recall of biological data types from 45% to 95%. Extraction of particular biological facts, such as gene-gene interactions, can be accelerated significantly by ontologies, with Textpresso automatically performing nearly as well as expert curators to identify sentences; in searches for two uniquely named genes and an interaction term, the ontology confers a 3-fold increase of search efficiency. Textpresso currently focuses on Caenorhabditis elegans literature, with 3,800 full text articles and 16,000 abstracts. The lexicon of the ontology contains 14,500 entries, each of which includes all versions of a specific word or phrase, and it includes all categories of the Gene Ontology database. Textpresso is a useful curation tool, as well as search engine for researchers, and can readily be extended to other organism-specific corpora of text. Textpresso can be accessed at http://www.textpresso.org or via WormBase at http://www.wormbase.org.  相似文献   

7.
An architecture for biological information extraction and representation   总被引:1,自引:0,他引:1  
Motivations: Technological advances in biomedical research are generating a plethora of heterogeneous data at a high rate. There is a critical need for extraction, integration and management tools for information discovery and synthesis from these heterogeneous data. RESULTS: In this paper, we present a general architecture, called ALFA, for information extraction and representation from diverse biological data. The ALFA architecture consists of: (i) a networked, hierarchical, hyper-graph object model for representing information from heterogeneous data sources in a standardized, structured format; and (ii) a suite of integrated, interactive software tools for information extraction and representation from diverse biological data sources. As part of our research efforts to explore this space, we have currently prototyped the ALFA object model and a set of interactive software tools for searching, filtering, and extracting information from scientific text. In particular, we describe BioFerret, a meta-search tool for searching and filtering relevant information from the web, and ALFA Text Viewer, an interactive tool for user-guided extraction, disambiguation, and representation of information from scientific text. We further demonstrate the potential of our tools in integrating the extracted information with experimental data and diagrammatic biological models via the common underlying ALFA representation. CONTACT: aditya_vailaya@agilent.com.  相似文献   

8.
MOTIVATION: Since their initial development, integration and construction of databases for molecular-level data have progressed. Though biological molecules are related to each other and form a complex system, the information is stored in the vast archives of the literature or in diverse databases. There is no unified naming convention for biological object, and biological terms may be ambiguous or polysemic. This makes the integration and interaction of databases difficult. In order to eliminate these problems, machine-readable natural language resources appear to be quite promising. We have developed a workbench for protein name abbreviation dictionary (PNAD) building. RESULTS: We have developed PNAD Construction Support System (PNAD-CSS), which offers various convenient facilities to decrease the construction costs of a protein name abbreviation dictionary of which entries are collected from abstracts in biomedical papers. The system allows the users to concentrate on higher level interpretation by removing some troublesome tasks, e.g. management of abstracts, extracting protein names and their abbreviations, and so on. To extract a pair of protein names and abbreviations, we have developed a hybrid system composed of the PROPER System and the PNAD System. The PNAD System can extract the pairs from parenthetical-paraphrases involved in protein names, the PROPER System identified these paris, with 98.95% precision, 95.56% recall and 97.58% complete precision. AVAILABILITY: PROPER System is freely available from http://www.hgc.inc.u-tokyo.ac.jp/service/tooldoc /KeX/intro.html. The other software are also available on request. Contact the authors. CONTACT: mikio@ims.u-tokyo.ac.jp  相似文献   

9.
MOTIVATION: Each protein performs its functions within some specific locations in a cell. This subcellular location is important for understanding protein function and for facilitating its purification. There are now many computational techniques for predicting location based on sequence analysis and database information from homologs. A few recent techniques use text from biological abstracts: our goal is to improve the prediction accuracy of such text-based techniques. We identify three techniques for improving text-based prediction: a rule for ambiguous abstract removal, a mechanism for using synonyms from the Gene Ontology (GO) and a mechanism for using the GO hierarchy to generalize terms. We show that these three techniques can significantly improve the accuracy of protein subcellular location predictors that use text extracted from PubMed abstracts whose references are recorded in Swiss-Prot.  相似文献   

10.
DIP: the database of interacting proteins   总被引:24,自引:3,他引:21  
The Database of Interacting Proteins (DIP; http://dip.doe-mbi.ucla.edu) is a database that documents experimentally determined protein-protein interactions. This database is intended to provide the scientific community with a comprehensive and integrated tool for browsing and efficiently extracting information about protein interactions and interaction networks in biological processes. Beyond cataloging details of protein-protein interactions, the DIP is useful for understanding protein function and protein-protein relationships, studying the properties of networks of interacting proteins, benchmarking predictions of protein-protein interactions, and studying the evolution of protein-protein interactions.  相似文献   

11.
12.

Background  

The majority of information in the biological literature resides in full text articles, instead of abstracts. Yet, abstracts remain the focus of many publicly available literature data mining tools. Most literature mining tools rely on pre-existing lexicons of biological names, often extracted from curated gene or protein databases. This is a limitation, because such databases have low coverage of the many name variants which are used to refer to biological entities in the literature.  相似文献   

13.
Genes that are indispensable for survival are termed essential genes. The analysis and identification of essential genes are very important for understanding the minimal requirements of cellular survival and for practical purposes. Proteins do not exert their function in isolation of one another but rather interact together in PPI networks. A global analysis of protein interaction networks provides an effective way to elucidate the relationships between proteins. With the recent large-scale identifications of essential genes and the production of large amounts of PPIs in humans, we are able to investigate the topological properties and biological properties of essential genes. However, until recently, no one has ever investigated human essential genes using topological and biological properties. In this study, for the first time, 28 topological properties and 22 biological properties were used to investigate the characteristics of essential and non-essential genes in humans. Most of the properties were statistically discriminative between essential and non-essential genes. The F-score was used to estimate the essentiality of each property. The GO-enrichment analysis was performed to investigate the functions of the essential and non-essential genes. Finally, based on the topological features and the biological characteristics, a machine-learning classifier was constructed to predict the essential genes. The results of the jackknife test and 10-fold cross validation test are encouraging, indicating that our classifier is an effective human essential gene discovery method.  相似文献   

14.
MOTIVATION: Full-text documents potentially hold more information than their abstracts, but require more resources for processing. We investigated the added value of full text over abstracts in terms of information content and occurrences of gene symbol--gene name combinations that can resolve gene-symbol ambiguity. RESULTS: We analyzed a set of 3902 biomedical full-text articles. Different keyword measures indicate that information density is highest in abstracts, but that the information coverage in full texts is much greater than in abstracts. Analysis of five different standard sections of articles shows that the highest information coverage is located in the results section. Still, 30-40% of the information mentioned in each section is unique to that section. Only 30% of the gene symbols in the abstract are accompanied by their corresponding names, and a further 8% of the gene names are found in the full text. In the full text, only 18% of the gene symbols are accompanied by their gene names.  相似文献   

15.
A web-based version of the RLIMS-P literature mining system was developed for online mining of protein phosphorylation information from MEDLINE abstracts. The online tool presents extracted phosphorylation objects (phosphorylated proteins, phosphorylation sites and protein kinases) in summary tables and full reports with evidence-tagged abstracts. The tool further allows mapping of phosphorylated proteins to protein entries in the UniProt Knowledgebase based on PubMed ID and/or protein name. The literature mining, coupled with database association, allows retrieval of rich biological information for the phosphorylated proteins and facilitates database annotation of phosphorylation features.  相似文献   

16.
RelEx--relation extraction using dependency parse trees   总被引:4,自引:0,他引:4  
MOTIVATION: The discovery of regulatory pathways, signal cascades, metabolic processes or disease models requires knowledge on individual relations like e.g. physical or regulatory interactions between genes and proteins. Most interactions mentioned in the free text of biomedical publications are not yet contained in structured databases. RESULTS: We developed RelEx, an approach for relation extraction from free text. It is based on natural language preprocessing producing dependency parse trees and applying a small number of simple rules to these trees. We applied RelEx on a comprehensive set of one million MEDLINE abstracts dealing with gene and protein relations and extracted approximately 150,000 relations with an estimated performance of both 80% precision and 80% recall. AVAILABILITY: The used natural language preprocessing tools are free for use for academic research. Test sets and relation term lists are available from our website (http://www.bio.ifi.lmu.de/publications/RelEx/).  相似文献   

17.
MOTIVATION: A large volume of experimental data on protein phosphorylation is buried in the fast-growing PubMed literature. While of great value, such information is limited in databases owing to the laborious process of literature-based curation. Computational literature mining holds promise to facilitate database curation. RESULTS: A rule-based system, RLIMS-P (Rule-based LIterature Mining System for Protein Phosphorylation), was used to extract protein phosphorylation information from MEDLINE abstracts. An annotation-tagged literature corpus developed at PIR was used to evaluate the system for finding phosphorylation papers and extracting phosphorylation objects (kinases, substrates and sites) from abstracts. RLIMS-P achieved a precision and recall of 91.4 and 96.4% for paper retrieval, and of 97.9 and 88.0% for extraction of substrates and sites. Coupling the high recall for paper retrieval and high precision for information extraction, RLIMS-P facilitates literature mining and database annotation of protein phosphorylation.  相似文献   

18.
The central problems of vision are often divided into object identification and localization. Object identification, at least at fine levels of discrimination, may require the application of top-down knowledge to resolve ambiguous image information. Utilizing top-down knowledge, however, may require the initial rapid access of abstract object categories based on low-level image cues. Does object localization require a different set of operating principles than object identification or is category determination also part of the perception of depth and spatial layout? Three-dimensional graphics movies of objects and their cast shadows are used to argue that identifying perceptual categories is important for determining the relative depths of objects. Processes that can identify the causal class (e.g. the kind of material) that generates the image data can provide information to determine the spatial relationships between surfaces. Changes in the blurriness of an edge may be characteristically associated with shadows caused by relative motion between two surfaces. The early identification of abstract events such as moving object/shadow pairs may also be important for depth from shadows. Knowledge of how correlated motion in the image relates to an object and its shadow may provide a reliable cue to access such event categories.  相似文献   

19.
Extraction of regulatory gene/protein networks from Medline   总被引:2,自引:0,他引:2  
MOTIVATION: We have previously developed a rule-based approach for extracting information on the regulation of gene expression in yeast. The biomedical literature, however, contains information on several other equally important regulatory mechanisms, in particular phosphorylation, which we now expanded for our rule-based system also to extract. RESULTS: This paper presents new results for extraction of relational information from biomedical text. We have improved our system, STRING-IE, to capture both new types of linguistic constructs as well as new types of biological information [i.e. (de-)phosphorylation]. The precision remains stable with a slight increase in recall. From almost one million PubMed abstracts related to four model organisms, we manage to extract regulatory networks and binary phosphorylations comprising 3,319 relation chunks. The accuracy is 83-90% and 86-95% for gene expression and (de-)phosphorylation relations, respectively. To achieve this, we made use of an organism-specific resource of gene/protein names considerably larger than those used in most other biology related information extraction approaches. These names were included in the lexicon when retraining the part-of-speech (POS) tagger on the GENIA corpus. For the domain in question, an accuracy of 96.4% was attained on POS tags. It should be noted that the rules were developed for yeast and successfully applied to both abstracts and full-text articles related to other organisms with comparable accuracy. AVAILABILITY: The revised GENIA corpus, the POS tagger, the extraction rules and the full sets of extracted relations are available from http://www.bork.embl.de/Docu/STRING-IE  相似文献   

20.
Protein–protein interaction extraction through biological literature curation is widely employed for proteome analysis. There is a strong need for a tool that can assist researchers in extracting comprehensive PPI information through literature curation, which is critical in research on protein, for example, construction of protein interaction network, identification of protein signaling pathway, and discovery of meaningful protein interaction. However, most of current tools can only extract PPI relations. None of them are capable of extracting other important PPI information, such as interaction directions, effects, and functional annotations. To address these issues, this paper proposes PPICurator, a novel tool for extracting comprehensive PPI information with a variety of logic and syntax features based on a new support vector machine classifier. PPICurator provides a friendly web‐based user interface. It is a platform that automates the extraction of comprehensive PPI information through literature, including PPI relations, as well as their confidential scores, interaction directions, effects, and functional annotations. Thus, PPICurator is more comprehensive than state‐of‐the‐art tools. Moreover, it outperforms state‐of‐the‐art tools in the accuracy of PPI relation extraction measured by F‐score and recall on the widely used open datasets. PPICurator is available at https://ppicurator.hupo.org.cn .  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号