首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
RelEx--relation extraction using dependency parse trees   总被引:4,自引:0,他引:4  
MOTIVATION: The discovery of regulatory pathways, signal cascades, metabolic processes or disease models requires knowledge on individual relations like e.g. physical or regulatory interactions between genes and proteins. Most interactions mentioned in the free text of biomedical publications are not yet contained in structured databases. RESULTS: We developed RelEx, an approach for relation extraction from free text. It is based on natural language preprocessing producing dependency parse trees and applying a small number of simple rules to these trees. We applied RelEx on a comprehensive set of one million MEDLINE abstracts dealing with gene and protein relations and extracted approximately 150,000 relations with an estimated performance of both 80% precision and 80% recall. AVAILABILITY: The used natural language preprocessing tools are free for use for academic research. Test sets and relation term lists are available from our website (http://www.bio.ifi.lmu.de/publications/RelEx/).  相似文献   

2.
To reduce the increasing amount of time spent on literature search in the life sciences, several methods for automated knowledge extraction have been developed. Co-occurrence based approaches can deal with large text corpora like MEDLINE in an acceptable time but are not able to extract any specific type of semantic relation. Semantic relation extraction methods based on syntax trees, on the other hand, are computationally expensive and the interpretation of the generated trees is difficult. Several natural language processing (NLP) approaches for the biomedical domain exist focusing specifically on the detection of a limited set of relation types. For systems biology, generic approaches for the detection of a multitude of relation types which in addition are able to process large text corpora are needed but the number of systems meeting both requirements is very limited. We introduce the use of SENNA (“Semantic Extraction using a Neural Network Architecture”), a fast and accurate neural network based Semantic Role Labeling (SRL) program, for the large scale extraction of semantic relations from the biomedical literature. A comparison of processing times of SENNA and other SRL systems or syntactical parsers used in the biomedical domain revealed that SENNA is the fastest Proposition Bank (PropBank) conforming SRL program currently available. 89 million biomedical sentences were tagged with SENNA on a 100 node cluster within three days. The accuracy of the presented relation extraction approach was evaluated on two test sets of annotated sentences resulting in precision/recall values of 0.71/0.43. We show that the accuracy as well as processing speed of the proposed semantic relation extraction approach is sufficient for its large scale application on biomedical text. The proposed approach is highly generalizable regarding the supported relation types and appears to be especially suited for general-purpose, broad-scale text mining systems. The presented approach bridges the gap between fast, cooccurrence-based approaches lacking semantic relations and highly specialized and computationally demanding NLP approaches.  相似文献   

3.
The syntax-first model and the parallel/interactive models make different predictions regarding whether syntactic category processing has a temporal and functional primacy over semantic processing. To further resolve this issue, an event-related potential experiment was conducted on 24 Chinese speakers reading Chinese passive sentences with the passive marker BEI (NP1 + BEI + NP2 + Verb). This construction was selected because it is the most-commonly used Chinese passive and very much resembles German passives, upon which the syntax-first hypothesis was primarily based. We manipulated semantic consistency (consistent vs. inconsistent) and syntactic category (noun vs. verb) of the critical verb, yielding four conditions: CORRECT (correct sentences), SEMANTIC (semantic anomaly), SYNTACTIC (syntactic category anomaly), and COMBINED (combined anomalies). Results showed both N400 and P600 effects for sentences with semantic anomaly, with syntactic category anomaly, or with combined anomalies. Converging with recent findings of Chinese ERP studies on various constructions, our study provides further evidence that syntactic category processing does not precede semantic processing in reading Chinese.  相似文献   

4.
The KEGG pathway maps are widely used as a reference data set for inferring high-level functions of the organism or the ecosystem from its genome or metagenome sequence data. The KEGG modules, which are tighter functional units often corresponding to subpathways in the KEGG pathway maps, are designed for better automation of genome interpretation. Each KEGG module is represented by a simple Boolean expression of KEGG Orthology (KO) identifiers (K numbers), enabling automatic evaluation of the completeness of genes in the genome. Here we focus on metabolic functions and introduce reaction modules for improving annotation and signature modules for inferring metabolic capacity. We also describe how genome annotation is performed in KEGG using the manually created KO database and the computationally generated SSDB database. The resulting KEGG GENES database with KO (K number) annotation is a reference sequence database to be compared for automated annotation and interpretation of newly determined genomes.  相似文献   

5.
Many computational problems and methods have been proposed for analysis of biological pathways. Among them, this paper focuses on extraction of mapping rules of atoms from enzymatic reaction data, which is useful for drug design, simulation of tracer experiments, and consistency checking of pathway databases. Most of existing methods for this problem are based on maximal common subgraph algorithms. In this paper, we propose a novel approach based on graph partition and graph isomorphism. We show that this problem is NP-hard in general, but can be solved in polynomial time for wide classes of enzymatic reactions. We also present an O(n(1.5)) time algorithm for a special but fundamental class of reactions, where n is the maximum size of compounds appearing in a reaction. We develop practical polynomial-time algorithms in which the Morgan algorithm is used for computing the normal form of a graph, where it is known that the Morgan algorithm works correctly for most chemical structures. Computational experiments are performed for these practical algorithms using the chemical reaction data stored in the KEGG/LIGAND database. The results of computational experiments suggest that practical algorithms are useful in many cases.  相似文献   

6.

Background

Text mining is increasingly used in the biomedical domain because of its ability to automatically gather information from large amount of scientific articles. One important task in biomedical text mining is relation extraction, which aims to identify designated relations among biological entities reported in literature. A relation extraction system achieving high performance is expensive to develop because of the substantial time and effort required for its design and implementation. Here, we report a novel framework to facilitate the development of a pattern-based biomedical relation extraction system. It has several unique design features: (1) leveraging syntactic variations possible in a language and automatically generating extraction patterns in a systematic manner, (2) applying sentence simplification to improve the coverage of extraction patterns, and (3) identifying referential relations between a syntactic argument of a predicate and the actual target expected in the relation extraction task.

Results

A relation extraction system derived using the proposed framework achieved overall F-scores of 72.66% for the Simple events and 55.57% for the Binding events on the BioNLP-ST 2011 GE test set, comparing favorably with the top performing systems that participated in the BioNLP-ST 2011 GE task. We obtained similar results on the BioNLP-ST 2013 GE test set (80.07% and 60.58%, respectively). We conducted additional experiments on the training and development sets to provide a more detailed analysis of the system and its individual modules. This analysis indicates that without increasing the number of patterns, simplification and referential relation linking play a key role in the effective extraction of biomedical relations.

Conclusions

In this paper, we present a novel framework for fast development of relation extraction systems. The framework requires only a list of triggers as input, and does not need information from an annotated corpus. Thus, we reduce the involvement of domain experts, who would otherwise have to provide manual annotations and help with the design of hand crafted patterns. We demonstrate how our framework is used to develop a system which achieves state-of-the-art performance on a public benchmark corpus.  相似文献   

7.
8.
We introduce a novel computer implementation of the Unification-Space parser (Vosse and Kempen in Cognition 75:105–143, 2000) in the form of a localist neural network whose dynamics is based on interactive activation and inhibition. The wiring of the network is determined by Performance Grammar (Kempen and Harbusch in Verb constructions in German and Dutch. Benjamins, Amsterdam, 2003), a lexicalist formalism with feature unification as binding operation. While the network is processing input word strings incrementally, the evolving shape of parse trees is represented in the form of changing patterns of activation in nodes that code for syntactic properties of words and phrases, and for the grammatical functions they fulfill. The system is capable, at least qualitatively and rudimentarily, of simulating several important dynamic aspects of human syntactic parsing, including garden-path phenomena and reanalysis, effects of complexity (various types of clause embeddings), fault-tolerance in case of unification failures and unknown words, and predictive parsing (expectation-based analysis, surprisal effects). English is the target language of the parser described.  相似文献   

9.
MOTIVATION: Neurodegenerative disorders (NDDs) are progressive and fatal disorders, which are commonly characterized by the intracellular or extracellular presence of abnormal protein aggregates. The identification and verification of proteins interacting with causative gene products are effective ways to understand their physiological and pathological functions. The objective of this research is to better understand common molecular pathogenic mechanisms in NDDs by employing protein-protein interaction networks, the domain characteristics commonly identified in NDDs and correlation among NDDs based on domain information. RESULTS: By reviewing published literatures in PubMed, we created pathway maps in Kyoto Encyclopedia of Genes and Genomes (KEGG) for the protein-protein interactions in six NDDs: Alzheimer's disease (AD), Parkinson's disease (PD), amyotrophic lateral sclerosis (ALS), Huntington's disease (HD), dentatorubral-pallidoluysian atrophy (DRPLA) and prion disease (PRION). We also collected data on 201 interacting proteins and 13 compounds with 282 interactions from the literature. We found 19 proteins common to these six NDDs. These common proteins were mainly involved in the apoptosis and MAPK signaling pathways. We expanded the interaction network by adding protein interaction data from the Human Protein Reference Database and gene expression data from the Human Gene Expression Index Database. We then carried out domain analysis on the extended network and found the characteristic domains, such as 14-3-3 protein, phosphotyrosine interaction domain and caspase domain, for the common proteins. Moreover, we found a relatively high correlation between AD, PD, HD and PRION, but not ALS or DRPLA, in terms of the protein domain distributions. AVAILABILITY: http://www.genome.jp/kegg/pathway/hsa/hsa01510.html (KEGG pathway maps for NDDs).  相似文献   

10.
Word Sense Disambiguation (WSD) is the task of determining which sense of an ambiguous word (word with multiple meanings) is chosen in a particular use of that word, by considering its context. A sentence is considered ambiguous if it contains ambiguous word(s). Practically, any sentence that has been classified as ambiguous usually has multiple interpretations, but just one of them presents the correct interpretation. We propose an unsupervised method that exploits knowledge based approaches for word sense disambiguation using Harmony Search Algorithm (HSA) based on a Stanford dependencies generator (HSDG). The role of the dependency generator is to parse sentences to obtain their dependency relations. Whereas, the goal of using the HSA is to maximize the overall semantic similarity of the set of parsed words. HSA invokes a combination of semantic similarity and relatedness measurements, i.e., Jiang and Conrath (jcn) and an adapted Lesk algorithm, to perform the HSA fitness function. Our proposed method was experimented on benchmark datasets, which yielded results comparable to the state-of-the-art WSD methods. In order to evaluate the effectiveness of the dependency generator, we perform the same methodology without the parser, but with a window of words. The empirical results demonstrate that the proposed method is able to produce effective solutions for most instances of the datasets used.  相似文献   

11.
Kernel approaches for genic interaction extraction   总被引:2,自引:0,他引:2  
  相似文献   

12.

Background

A crucial question for understanding sentence comprehension is the openness of syntactic and semantic processes for other sources of information. Using event-related potentials in a dual task paradigm, we had previously found that sentence processing takes into consideration task relevant sentence-external semantic but not syntactic information. In that study, internal and external information both varied within the same linguistic domain—either semantic or syntactic. Here we investigated whether across-domain sentence-external information would impact within-sentence processing.

Methodology

In one condition, adjectives within visually presented sentences of the structure [Det]-[Noun]-[Adjective]-[Verb] were semantically correct or incorrect. Simultaneously with the noun, auditory adjectives were presented that morphosyntactically matched or mismatched the visual adjectives with respect to gender.

Findings

As expected, semantic violations within the sentence elicited N400 and P600 components in the ERP. However, these components were not modulated by syntactic matching of the sentence-external auditory adjective. In a second condition, syntactic within-sentence correctness-variations were combined with semantic matching variations between the auditory and the visual adjective. Here, syntactic within-sentence violations elicited a LAN and a P600 that did not interact with semantic matching of the auditory adjective. However, semantic mismatching of the latter elicited a frontocentral positivity, presumably related to an increase in discourse level complexity.

Conclusion

The current findings underscore the open versus algorithmic nature of semantic and syntactic processing, respectively, during sentence comprehension.  相似文献   

13.
已知一种药物可用于治疗某疾病,则该药物可能对与该疾病具有相似表型的其他疾病有疗效。因此,大规模地计算疾病表型相似性可辅助发现的疾病新的治疗方法。我们从OMIM下载了3742种疾病的表型信息,从Mesh词库下载13721个关联解剖学和疾病症状的注释词。我们将以上的Mesh词逐一在3742种疾病的表型信息文本中搜索,得到每种疾病涉及的Mesh词汇列表,进而基于语义分析的方法系统地计算了疾病表型的两两相似性矩阵。我们发现疾病关联生物通路最多的有肿瘤生物通路,胰岛素信号通路,肥大心肌病通路和细胞粘附通路等。随疾病对表型相似度的增加,其更涉及相同KEGG生物通路的概率亦增加,证明了本文方法的可靠性。疾病表型相似性可作为疾病在基因水平相似性的补充,有望为药物发现研究提供一条新途径。  相似文献   

14.
To elucidate the relationships between syntactic and semantic processes, one interesting question is how syntactic structures are constructed by the argument structure of a verb, where each argument corresponds to a semantic role of each noun phrase (NP). Here we examined the effects of possessivity [sentences with or without a possessor] and canonicity [canonical or noncanonical word orders] using Japanese ditransitive sentences. During a syntactic decision task, the syntactic structure of each sentence would be constructed in an incremental manner based on the predicted argument structure of the ditransitive verb in a verb-final construction. Using magnetoencephalography, we found a significant canonicity effect on the current density in the left inferior frontal gyrus (IFG) at 530-550 ms after the verb onset. This effect was selective to canonical sentences, and significant even when the precedent NP was physically identical. We suggest that the predictive effects associated with syntactic processing became larger for canonical sentences, where the NPs and verb were merged with a minimum structural distance, leading to the left IFG activations. For monotransitive and intransitive verbs, in which structural computation of the sentences was simpler than that of ditransitive sentences, we observed a significant effect selective to noncanonical sentences in the temporoparietal regions during 480-670 ms. This effect probably reflects difficulty in semantic processing of noncanonical sentences. These results demonstrate that the left IFG plays a predictive role in syntactic processing, which depends on the canonicity determined by argument structures, whereas other temporoparietal regions would subserve more semantic aspects of sentence processing.  相似文献   

15.
MOTIVATION: We seek to determine the accuracy of computational methods for predicting metabolic pathways in sequenced genomes, and to understand the contributions of both the prediction algorithms, and the reference pathway databases used by those algorithms, to the prediction accuracy. RESULTS: The comparisons we performed were as follows. (1) We compared two predictions of the pathway complements of Helicobacter pylori that were computed by an early version of our pathway-prediction algorithm: prediction A used the EcoCyc E. coli pathway DB as the reference database (DB) for prediction, and prediction B used the MetaCyc pathway DB (a superset of EcoCyc) as the reference pathway DB. The MetaCyc-based prediction contained 75% more pathway predictions, but we believe a significant number of those predictions were false positives. (2) We compared two predictions of the pathway complement of H. pylori that used MetaCyc as the reference pathway DB, but that used different algorithms: the original PathoLogic algorithm, and an enhanced version of the algorithm designed to eliminate false-positive pathway predictions. The improved algorithm predicted 30\% fewer metabolic pathways than the original algorithm; all of the eliminated pathways are believed to be false-positive predictions. (3) We compared the 98 pathways predicted by the enhanced algorithm with the results of a manual analysis of the pathways of H. pylori. Results: 40 of the computationally predicted pathways were consistent with the manual analysis, 13 pathways are considered false-positive predictions, and four pathways had partially overlapping topologies. Twenty-six predicted pathways were not mentioned in the manual analysis; we believe these are correct predictions by PathoLogic that were not found by the manual analysis. Five pathways from the manual analysis were not found computationally. Agreement between the computational and manual predictions was good overall, with the computational analysis inferring many pathways that the manual analysis did not identify. Ultimately the manual analysis is also partially speculative, and therefore is not an absolute measure of correctness. The algorithm is designed to err on the side of more false positives to bring more potential pathways to the user's attention. The resulting H. pylori pathway DB is freely available at http://ecocyc.org:1555/HPY/organism-summary?object=HPY. AVAILABILITY: The Pathway Tools software is freely available to academic users, and is available to commercial users for a fee. Contact pkarp@ai.sri.com for information on obtaining the software.  相似文献   

16.
Shang Y  Li Y  Lin H  Yang Z 《PloS one》2011,6(8):e23862
Automatic text summarization for a biomedical concept can help researchers to get the key points of a certain topic from large amount of biomedical literature efficiently. In this paper, we present a method for generating text summary for a given biomedical concept, e.g., H1N1 disease, from multiple documents based on semantic relation extraction. Our approach includes three stages: 1) We extract semantic relations in each sentence using the semantic knowledge representation tool SemRep. 2) We develop a relation-level retrieval method to select the relations most relevant to each query concept and visualize them in a graphic representation. 3) For relations in the relevant set, we extract informative sentences that can interpret them from the document collection to generate text summary using an information retrieval based method. Our major focus in this work is to investigate the contribution of semantic relation extraction to the task of biomedical text summarization. The experimental results on summarization for a set of diseases show that the introduction of semantic knowledge improves the performance and our results are better than the MEAD system, a well-known tool for text summarization.  相似文献   

17.
SUMMARY: LinkinPath is a pathway mapping and analysis tool that enables users to explore and visualize the list of gene/protein sequences through various Flash-driven interactive web interfaces including KEGG pathway maps, functional composition maps (TreeMaps), molecular interaction/reaction networks and pathway-to-pathway networks. Users can submit single or multiple datasets of gene/protein sequences to LinkinPath to (i) determine the co-occurrence and co-absence of genes/proteins on animated KEGG pathway maps; (ii) compare functional compositions within and among the datasets using TreeMaps; (iii) analyze the statistically enriched pathways across the datasets; (iv) build the pathway-to-pathway networks for each dataset; (v) explore potential interaction/reaction paths between pathways; and (vi) identify common pathway-to-pathway networks across the datasets. AVAILABILITY: LinkinPath is freely available to all interested users at http://www.biotec.or.th/isl/linkinpath/.  相似文献   

18.
Gene regulatory networks are a crucial aspect of systems biology in describing molecular mechanisms of the cell. Various computational models rely on random gene selection to infer such networks from microarray data. While incorporation of prior knowledge into data analysis has been deemed important, in practice, it has generally been limited to referencing genes in probe sets and using curated knowledge bases. We investigate the impact of augmenting microarray data with semantic relations automatically extracted from the literature, with the view that relations encoding gene/protein interactions eliminate the need for random selection of components in non-exhaustive approaches, producing a more accurate model of cellular behavior. A genetic algorithm is then used to optimize the strength of interactions using microarray data and an artificial neural network fitness function. The result is a directed and weighted network providing the individual contribution of each gene to its target. For testing, we used invasive ductile carcinoma of the breast to query the literature and a microarray set containing gene expression changes in these cells over several time points. Our model demonstrates significantly better fitness than the state-of-the-art model, which relies on an initial random selection of genes. Comparison to the component pathways of the KEGG Pathways in Cancer map reveals that the resulting networks contain both known and novel relationships. The p53 pathway results were manually validated in the literature. 60% of non-KEGG relationships were supported (74% for highly weighted interactions). The method was then applied to yeast data and our model again outperformed the comparison model. Our results demonstrate the advantage of combining gene interactions extracted from the literature in the form of semantic relations with microarray analysis in generating contribution-weighted gene regulatory networks. This methodology can make a significant contribution to understanding the complex interactions involved in cellular behavior and molecular physiology.  相似文献   

19.

Background  

Interest is growing in the application of syntactic parsers to natural language processing problems in biology, but assessing their performance is difficult because differences in linguistic convention can falsely appear to be errors. We present a method for evaluating their accuracy using an intermediate representation based on dependency graphs, in which the semantic relationships important in most information extraction tasks are closer to the surface. We also demonstrate how this method can be easily tailored to various application-driven criteria.  相似文献   

20.
KEGG Mapper for inferring cellular functions from protein sequences   总被引:1,自引:0,他引:1  
KEGG is a reference knowledge base for biological interpretation of large‐scale molecular datasets, such as genome and metagenome sequences. It accumulates experimental knowledge about high‐level functions of the cell and the organism represented in terms of KEGG molecular networks, including KEGG pathway maps, BRITE hierarchies, and KEGG modules. By the process called KEGG mapping, a set of protein coding genes in the genome, for example, can be converted to KEGG molecular networks enabling interpretation of cellular functions and other high‐level features. Here we report a new version of KEGG Mapper, a suite of KEGG mapping tools available at the KEGG website ( https://www.kegg.jp/ or https://www.genome.jp/kegg/ ), together with the KOALA family tools for automatic assignment of KO (KEGG Orthology) identifiers used in the mapping.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号