首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 750 毫秒
1.
2.
A high proportion of life science researches are gene-oriented, in which scientists aim to investigate the roles that genes play in biological processes, and their involvement in biological mechanisms. As a result, gene names and their related information turn out to be one of the main objects of interest in biomedical literatures. While the capability of recognizing gene mentions has made significant progress, the results of recognition are still insufficient for direct use due to the ambiguity of gene names. Gene normalization (GN) goes beyond the recognition task by linking a gene mention to a database ID. Unlike most previous works, we approach GN on the instance-level and evaluate its overall performance on the recognition and normalization steps in abstracts and full texts. We release the first instance-level gene normalization (IGN) corpus in the BioC format, which includes annotations for the boundaries of all gene mentions and the corresponding IDs for human gene mentions. Species information, along with existing co-reference chains and full name/abbreviation pairs are also provided for each gene mention. Using the released corpus, we have designed a collective instance-level GN approach using not only the contextual information of each individual instance, but also the relations among instances and the inherent characteristics of full-text sections. Our experimental results show that our collective approach can achieve an F-score of 0.743. The proposed approach that exploits section characteristics in full-text articles can improve the F-scores of information lacking sections by up to 1.8%. In addition, using the proposed refinement process improved the F-score of gene mention recognition by 0.125 and that of GN by 0.03. Whereas current experimental results are limited to the human species, we seek to continue updating the annotations of the IGN corpus and observe how the proposed approach can be extended to other species.  相似文献   

3.
There are millions of public posts to medical message boards by users seeking support and information on a wide range of medical conditions. It has been shown that these posts can be used to gain a greater understanding of patients' experiences and concerns. As investigators continue to explore large corpora of medical discussion board data for research purposes, protecting the privacy of the members of these online communities becomes an important challenge that needs to be met. Extant entity recognition methods used for more structured text are not sufficient because message posts present additional challenges: the posts contain many typographical errors, larger variety of possible names, terms and abbreviations specific to Internet posts or a particular message board, and mentions of the authors' personal lives. The main contribution of this paper is a system to de-identify the authors of message board posts automatically, taking into account the aforementioned challenges. We demonstrate our system on two different message board corpora, one on breast cancer and another on arthritis. We show that our approach significantly outperforms other publicly available named entity recognition and de-identification systems, which have been tuned for more structured text like operative reports, pathology reports, discharge summaries, or newswire.  相似文献   

4.
MOTIVATION: The scientific literature contains a wealth of information about biological systems. Manual curation lacks the scalability to extract this information due to the ever-increasing numbers of papers being published. The development and application of text mining technologies has been proposed as a way of dealing with this problem. However, the inter-species ambiguity of the genomic nomenclature makes mapping of gene mentions identified in text to their corresponding Entrez gene identifiers an extremely difficult task. We propose a novel method, which transforms a MEDLINE record into a mixture of adjacency matrices; by performing a random walkover the resulting graph, we can perform multi-class supervised classification allowing the assignment of taxonomy identifiers to individual gene mentions. The ability to achieve good performance at this task has a direct impact on the performance of normalizing gene mentions to Entrez gene identifiers. Such graph mixtures add flexibility and allow us to generate probabilistic classification schemes that naturally reflect the uncertainties inherent, even in literature-derived data. RESULTS: Our method performs well in terms of both micro- and macro-averaged performance, achieving micro-F(1) of 0.76 and macro-F(1) of 0.36 on the publicly available DECA corpus. Re-curation of the DECA corpus was performed, with our method achieving 0.88 micro-F(1) and 0.51 macro-F(1). Our method improves over standard classification techniques [such as support vector machines (SVMs)] in a number of ways: flexibility, interpretability and its resistance to the effects of class bias in the training data. Good performance is achieved without the need for computationally expensive parse tree generation or 'bag of words classification'.  相似文献   

5.
Opinion mining is a well-known problem in natural language processing that has attracted increasing attention in recent years. Existing approaches are mainly limited to the identification of direct opinions and are mostly dedicated to explicit opinions. However, in some domains such as medical, the opinions about an entity are not usually expressed by opinion words directly, but they are expressed indirectly by describing the effect of that entity on other ones. Therefore, ignoring indirect opinions can lead to the loss of valuable information and noticeable decline in overall accuracy of opinion mining systems. In this paper, we first introduce the task of indirect opinion mining. Then, we present a novel approach to construct a knowledge base of indirect opinions, called OpinionKB, which aims to be a resource for automatically classifying people’s opinions about drugs. Using our approach, we have extracted 896 quadruples of indirect opinions at a precision of 88.08 percent. Furthermore, experiments on drug reviews demonstrate that our approach can achieve 85.25 percent precision in polarity detection task, and outperforms the state-of-the-art opinion mining methods. We also build a corpus of indirect opinions about drugs, which can be used as a basis for supervised indirect opinion mining. The proposed approach for corpus construction achieves the precision of 88.42 percent.  相似文献   

6.
The influence of genetic variations on diseases or cellular processes is the main focus of many investigations, and results of biomedical studies are often only accessible through scientific publications. Automatic extraction of this information requires recognition of the gene names and the accompanying allelic variant information. In a previous work, the OSIRIS system for the detection of allelic variation in text based on a query expansion approach was communicated. Challenges associated with this system are the relatively low recall for variation mentions and gene name recognition. To tackle this challenge, we integrate the ProMiner system developed for the recognition and normalization of gene and protein names with a conditional random field (CRF)-based recognition of variation terms in biomedical text. Following the newly developed normalization of variation entities, we can link textual entities to Single Nucleotide Polymorphism database (dbSNP) entries. The performance of this novel approach is evaluated, and improved results in comparison to state-of-the-art systems are reported.  相似文献   

7.
We developed a visualization approach for the identification of protein isoforms, precursor/mature protein combinations, and fragments from LC-MS/MS analysis of multidimensional fractionation of serum and plasma proteins. We also describe a pattern recognition algorithm to automatically detect and flag potentially heterogeneous species of proteins in proteomic experiments that involve extensive fractionation and result in a large number of identified serum or plasma proteins in an experiment. Examples are given of proteins with known isoforms that validate our approach and present a subset of precursor/mature protein pairs that were detected with this approach. Potential applications include identification of differentially expressed isoforms in disease states.  相似文献   

8.
Among the plethora of affinity biosensor systems based on biomolecular recognition and labeling assays, magnetic labeling and detection is emerging as a promising new approach. Magnetic labels can be non-invasively detected by a wide range of methods, are physically and chemically stable, relatively inexpensive to produce, and can be easily made biocompatible. Here we provide an overview of the various approaches developed for magnetic labeling and detection as applied to biosensing. We illustrate the challenges to integrating one such approach into a complete sensing system with a more detailed discussion of the compact Bead Array Sensor System developed at the U.S. Naval Research Laboratory, the first system to use magnetic labels and microchip-based detection.  相似文献   

9.
Conventional methods for point mutation detection are usually multi-stage, laborious, and need to use radioactive isotopes or other hazardous materials, and the assay results are often semi-quantitative. In this work, a protocol for quantitative detection of H-ras point mutation was developed. Electrochemiluminescence (ECL) assay was coupled with restriction endonuclease digestion directly from PCR products. Only the wild-type amplicon containing the endonuclease's recognition site can be cut off, and thus cannot be detected by ECL assay. Using the PCR-ECL method, 30 bladder cancer samples were analyzed for possible point mutation at codon 12 of H-ras oncogene. The results show that the detection limit for H-ras amplicon is 100 fmol and the linear range is more than three orders of magnitude. The point mutation was found in 14 (46.7%) out of 30 bladder cancer samples. The experiment results demonstrate that the PCR-ECL method is a feasible quantitative approach for point mutation detection due to its safety, high sensitivity, and simplicity.  相似文献   

10.
11.
Gao YD  Huang JF 《动物学研究》2011,32(3):262-266
非键相互作用对于生物体系中的分子识别和结合过程起着关键作用。然而,传统的方法并不能在残基水平自动批量计算非键相互作用。近年来,已经发展了一些方法和工具进行非键相互作用的计算分析。该文研究发展了一种可以自动计算残基间非键相互作用的方法,即用Perl脚本调用Discovery Studio 2.0(DS 2.0,Accelrys Inc.)底层模块中的非键相互作用协议,实现了直接利用命令行批量计算非键相互作用能量,而无需通过DS2.0的图形界面。该方法扩展了DS2.0的计算模块,并于近期运用到了复合结构的研究分析中。  相似文献   

12.
Linking gene and protein names mentioned in the literature to unique identifiers in referent genomic databases is an essential step in accessing and integrating knowledge in the biomedical domain. However, it remains a challenging task due to lexical and terminological variation, and ambiguity of gene name mentions in documents. We present a generic and effective rule-based approach to link gene mentions in the literature to referent genomic databases, where pre-processing of both gene synonyms in the databases and gene mentions in text are first applied. The mapping method employs a cascaded approach, which combines exact, exact-like and token-based approximate matching by using flexible representations of a gene synonym dictionary and gene mentions generated during the pre-processing phase. We also consider multi-gene name mentions and permutation of components in gene names. A systematic evaluation of the suggested methods has identified steps that are beneficial for improving either precision or recall in gene name identification. The results of the experiments on the BioCreAtIvE2 data sets (identification of human gene names) demonstrated that our methods achieved highly encouraging results with F-measure of up to 81.20%.  相似文献   

13.

Background  

To train chunkers in recognizing noun phrases and verb phrases in biomedical text, an annotated corpus is required. The creation of gold standard corpora (GSCs), however, is expensive and time-consuming. GSCs therefore tend to be small and to focus on specific subdomains, which limits their usefulness. We investigated the use of a silver standard corpus (SSC) that is automatically generated by combining the outputs of multiple chunking systems. We explored two use scenarios: one in which chunkers are trained on an SSC in a new domain for which a GSC is not available, and one in which chunkers are trained on an available, although small GSC but supplemented with an SSC.  相似文献   

14.
Determination of the binding specificity of SH3 domain, a peptide recognition module (PRM), is important to understand their biological functions and reconstruct the SH3-mediated protein-protein interaction network. In the present study, the SH3-peptide interactions for both class I and II SH3 domains were characterized by the intermolecular residue-residue interaction network. We developed generic MIEC-SVM models to infer SH3 domain-peptide recognition specificity that achieved satisfactory prediction accuracy. By investigating the domain-peptide recognition mechanisms at the residue level, we found that the class-I and class-II binding peptides have different binding modes even though they occupy the same binding site of SH3. Furthermore, we predicted the potential binding partners of SH3 domains in the yeast proteome and constructed the SH3-mediated protein-protein interaction network. Comparison with the experimentally determined interactions confirmed the effectiveness of our approach. This study showed that our sophisticated computational approach not only provides a powerful platform to decipher protein recognition code at the molecular level but also allows identification of peptide-mediated protein interactions at a proteomic scale. We believe that such an approach is general to be applicable to other domain-peptide interactions.  相似文献   

15.
Chemistry text mining tools should be interoperable and adaptable regardless of system-level implementation, installation or even programming issues. We aim to abstract the functionality of these tools from the underlying implementation via reconfigurable workflows for automatically identifying chemical names. To achieve this, we refactored an established named entity recogniser (in the chemistry domain), OSCAR and studied the impact of each component on the net performance. We developed two reconfigurable workflows from OSCAR using an interoperable text mining framework, U-Compare. These workflows can be altered using the drag-&-drop mechanism of the graphical user interface of U-Compare. These workflows also provide a platform to study the relationship between text mining components such as tokenisation and named entity recognition (using maximum entropy Markov model (MEMM) and pattern recognition based classifiers). Results indicate that, for chemistry in particular, eliminating noise generated by tokenisation techniques lead to a slightly better performance than others, in terms of named entity recognition (NER) accuracy. Poor tokenisation translates into poorer input to the classifier components which in turn leads to an increase in Type I or Type II errors, thus, lowering the overall performance. On the Sciborg corpus, the workflow based system, which uses a new tokeniser whilst retaining the same MEMM component, increases the F-score from 82.35% to 84.44%. On the PubMed corpus, it recorded an F-score of 84.84% as against 84.23% by OSCAR.  相似文献   

16.
Following the Evidence Based Medicine (EBM) practice, practitioners make use of the existing evidence to make therapeutic decisions. This evidence, in the form of scientific statements, is usually found in scholarly publications such as randomised control trials and systematic reviews. However, finding such information in the overwhelming amount of published material is particularly challenging. Approaches have been proposed to automatically extract scientific artefacts in EBM using standardised schemas. Our work takes this stream a step forward and looks into consolidating extracted artefacts—i.e., quantifying their degree of similarity based on the assumption that they carry the same rhetorical role. By semantically connecting key statements in the literature of EBM, practitioners are not only able to find available evidence more easily, but also can track the effects of different treatments/outcomes in a number of related studies. We devise a regression model based on a varied set of features and evaluate it both on a general English corpus (the SICK corpus), as well as on an EBM corpus (the NICTA-PIBOSO corpus). Experimental results show that our approach performs on par with the state of the art on the general English and achieves encouraging results on the biomedical text when compared against human judgement.  相似文献   

17.
18.
We present a new fast approach for segmentation of thin branching structures, like vascular trees, based on Fast-Marching (FM) and Level Set (LS) methods. FM allows segmentation of tubular structures by inflating a "long balloon" from a user given single point. However, when the tubular shape is rather long, the front propagation may blow up through the boundary of the desired shape close to the starting point. Our contribution is focused on a method to propagate only the useful part of the front while freezing the rest of it. We demonstrate its ability to segment quickly and accurately tubular and tree-like structures. We also develop a useful stopping criterion for the causal front propagation. We finally derive an efficient algorithm for extracting an underlying 1D skeleton of the branching objects, with minimal path techniques. Each branch being represented by its centerline, we automatically detect the bifurcations, leading to the "Minimal Tree" representation. This so-called "Minimal Tree" is very useful for visualization and quantification of the pathologies in our anatomical data sets. We illustrate our algorithms by applying it to several arteries datasets.  相似文献   

19.
Biochemical processes in cells are governed by complex networks of many chemical species interacting stochastically in diverse ways and on different time scales. Constructing microscopically accurate models of such networks is often infeasible. Instead, here we propose a systematic framework for building phenomenological models of such networks from experimental data, focusing on accurately approximating the time it takes to complete the process, the First Passage (FP) time. Our phenomenological models are mixtures of Gamma distributions, which have a natural biophysical interpretation. The complexity of the models is adapted automatically to account for the amount of available data and its temporal resolution. The framework can be used for predicting behavior of FP systems under varying external conditions. To demonstrate the utility of the approach, we build models for the distribution of inter-spike intervals of a morphologically complex neuron, a Purkinje cell, from experimental and simulated data. We demonstrate that the developed models can not only fit the data, but also make nontrivial predictions. We demonstrate that our coarse-grained models provide constraints on more mechanistically accurate models of the involved phenomena.  相似文献   

20.
Recently, automated observation systems for animals using artificial intelligence have been proposed. In the wild, animals are difficult to detect and track automatically because of lamination and occlusions. Our study proposes a new approach to automatically detect and track wild Japanese macaques (Macaca fuscata) using deep learning and a particle filter algorithm. Macaque likelihood is derived through deep learning and used as an observation model in a particle filter to predict the macaques’ position and size in an image. By using deep learning as an observation model, it is possible to simplify the observation model and improve the accuracy of the classifier. We investigated whether the algorithm could find body regions of macaques in video recordings of free‐ranging groups at Katsuyama, Japan to evaluate our model. Experimental results showed that our method with deep learning as an observation model had higher tracking accuracy than a method that uses a support vector machine. More generally, our study will help researchers to develop automatic observation systems for animals in the wild.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号