首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 7 毫秒
1.
MOTIVATION: Wnt signaling is a very active area of research with highly relevant publications appearing at a rate of more than one per day. Building and maintaining databases describing signal transduction networks is a time-consuming and demanding task that requires careful literature analysis and extensive domain-specific knowledge. For instance, more than 50 factors involved in Wnt signal transduction have been identified as of late 2003. In this work we describe a natural language processing (NLP) system that is able to identify references to biological interaction networks in free text and automatically assembles a protein association and interaction map. RESULTS: A 'gold standard' set of names and assertions was derived by manual scanning of the Wnt genes website (http://www.stanford.edu/~rnusse/wntwindow.html) including 53 interactions involved in Wnt signaling. This system was used to analyze a corpus of peer-reviewed articles related to Wnt signaling including 3369 Pubmed and 1230 full text papers. Names for key Wnt-pathway associated proteins and biological entities are identified using a chi-squared analysis of noun phrases over-represented in the Wnt literature as compared to the general signal transduction literature. Interestingly, we identified several instances where generic terms were used on the website when more specific terms occur in the literature, and one typographic error on the Wnt canonical pathway. Using the named entity list and performing an exhaustive assertion extraction of the corpus, 34 of the 53 interactions in the 'gold standard' Wnt signaling set were successfully identified (64% recall). In addition, the automated extraction found several interactions involving key Wnt-related molecules which were missing or different from those in the canonical diagram, and these were confirmed by manual review of the text. These results suggest that a combination of NLP techniques for information extraction can form a useful first-pass tool for assisting human annotation and maintenance of signal pathway databases. AVAILABILITY: The pipeline software components are freely available on request to the authors. CONTACT: dstates@umich.edu SUPPLEMENTARY INFORMATION: http://stateslab.bioinformatics.med.umich.edu/software.html.  相似文献   

2.
Genomics and natural language processing   总被引:1,自引:0,他引:1  
  相似文献   

3.
We propose a model of memory reconsolidation that can output new sentences with additional meaning after refining information from input sentences and integrating them with related prior experience. Our model uses available technology to first disambiguate the meanings of words and extracts information from the sentences into a structure that is an extension to semantic networks. Within our long-term memory we introduce an action relationships database reminiscent of the way symbols are associated in brain, and propose an adaptive mechanism for linking these actions with the different scenarios. The model then fills in the implicit context of the input and predicts relevant activities that could occur in the context based on a statistical action relationship database. The new data both of the more complete scenario and of the statistical relationships of the activities are reconsolidated into memory. Experiments show that our model improves upon the existing reasoning tool suggested by MIT Media lab, known as ConceptNet.  相似文献   

4.
《BIOSILICO》2003,1(2):69-80
The information age has made the electronic storage of large amounts of data effortless. The proliferation of documents available on the Internet, corporate intranets, news wires and elsewhere is overwhelming. Search engines only exacerbate this overload problem by making increasingly more documents available in only a few keystrokes. This information overload also exists in the biomedical field, where scientific publications, and other forms of text-based data are produced at an unprecedented rate. Text mining is the combined, automated process of analyzing unstructured, natural language text to discover information and knowledge that are typically difficult to retrieve. Here, we focus on text mining as applied to the biomedical literature. We focus in particular on finding relationships among genes, proteins, drugs and diseases, to facilitate an understanding and prediction of complex biological processes. The LitMiner™ system, developed specifically for this purpose; is described in relation to the Knowledge Discovery and Data Mining Cup 2002, which serves as a formal evaluation of the system.  相似文献   

5.
Integration of healthcare records into a single application is still a challenging process There are additional issues when data becomes heterogeneous, and its application based on users does not appear to be the same. Hence, we propose an application called MEDSHARE which is a web-based application that integrates the data from various sources and helps the patient to access all their health records in a single point of source. Apart just from the collection of data, this portal enables the process of diagnosis using Natural language processing. The process is carried out by fuzzy logic ruleset which is generated by using NLP packages. The resulted information is given to the SVM classifier which helps in the prediction of diseases resulting in 89% of accuracy and standing the best compared to other classifiers. Finally, the observations resulted are sent to the front end application and the concerned user mobile through text message in their own native language for which translation package is been used.  相似文献   

6.
MedScan,a natural language processing engine for MEDLINE abstracts   总被引:2,自引:0,他引:2  
MOTIVATION: The importance of extracting biomedical information from scientific publications is well recognized. A number of information extraction systems for the biomedical domain have been reported, but none of them have become widely used in practical applications. Most proposals to date make rather simplistic assumptions about the syntactic aspect of natural language. There is an urgent need for a system that has broad coverage and performs well in real-text applications. RESULTS: We present a general biomedical domain-oriented NLP engine called MedScan that efficiently processes sentences from MEDLINE abstracts and produces a set of regularized logical structures representing the meaning of each sentence. The engine utilizes a specially developed context-free grammar and lexicon. Preliminary evaluation of the system's performance, accuracy, and coverage exhibited encouraging results. Further approaches for increasing the coverage and reducing parsing ambiguity of the engine, as well as its application for information extraction are discussed.  相似文献   

7.

Background

Precise identification of three-dimensional genome organization, especially enhancer-promoter interactions (EPIs), is important to deciphering gene regulation, cell differentiation and disease mechanisms. Currently, it is a challenging task to distinguish true interactions from other nearby non-interacting ones since the power of traditional experimental methods is limited due to low resolution or low throughput.

Results

We propose a novel computational framework EP2vec to assay three-dimensional genomic interactions. We first extract sequence embedding features, defined as fixed-length vector representations learned from variable-length sequences using an unsupervised deep learning method in natural language processing. Then, we train a classifier to predict EPIs using the learned representations in supervised way. Experimental results demonstrate that EP2vec obtains F1 scores ranging from 0.841~?0.933 on different datasets, which outperforms existing methods. We prove the robustness of sequence embedding features by carrying out sensitivity analysis. Besides, we identify motifs that represent cell line-specific information through analysis of the learned sequence embedding features by adopting attention mechanism. Last, we show that even superior performance with F1 scores 0.889~?0.940 can be achieved by combining sequence embedding features and experimental features.

Conclusions

EP2vec sheds light on feature extraction for DNA sequences of arbitrary lengths and provides a powerful approach for EPIs identification.
  相似文献   

8.
从信息处理的角度来看,生物信息学与自然语言处理中的许多问题是非常相似的,因此,可以将一些自然语言处理中的经典方法应用到生物信息学文字中。本文介绍了自然语言处理和生物信息学中共有的问题,如比对、分类、预测等,以及这些问题的解决方法。通过对两个领域形似问题的分析可知,优秀的自然语言处理技术也可用来解决生物信息学方面的问题,并且一些还未在生物信息学领域得到应用的自然语言理解技术也有其潜在的应用价值。最后给出了一个分类问题的解决方案,演示了如何在生物数据上应用算法进行实验。  相似文献   

9.
Air pollution remains a severe concern in European countries, especially in Western Balkan, where the air monitoring data point to harmful ambient pollution. The public concern with this issue becomes particularly critical during the fall and winter months, when the contamination is more visible, provoking a series of reactions directed principally to the government authorities as the responsible entities for regulating air pollution levels. Since citizen-contributed data are generally considered valuable additional information for assessing the impacts of air pollution, the public contribution could act as a tool for increasing awareness and response about air pollution. Consequently, this study's objective focuses on researching public awareness of air pollution in Western Balkan. The study assumes that citizens' reactions will grow more intensely during the months with an increase in air pollution levels, principally due to winter heating. Therefore, Twitter activity and news articles related to air pollution have been investigated for the case of Macedonia, Serbia, Bosnia and Herzegovina and Montenegro, from November 2021 to March 2022. Natural Language Processing techniques such as sentiment analysis, topic modelling, and cross-correlations statistical analysis were employed to determine the relationship between Twitter discussions and news with actual PM10 levels measured by official air monitoring stations. The aim was to observe whether tweets and news teasers reflect the realistic air pollution situation. The results affirm that social media discussions, mainly with a negative connotation, can serve as a measure of public awareness of temporal changes in the PM10 concentration in the air and the negative consequences. The content of the resources reveals several topics of concern, contributing to better identification of public opinion and possibilities for tracking news trends. Nevertheless, attention should be paid to news interpretation, provided that sometimes they might offer a more neutral understanding of the situation, failing, in this way, to present the actual air conditions and possibly impacting society in forming an unrealistic opinion. Additionally, the public might not be able to obtain sufficient or accurate information about the primary sources of air pollution, emphasizing the need for more transparent communication and greater education regarding air pollution monitoring. Finally, the study provides deeper insights into the content of the data and helps detect the reasons for skepticism towards pro-environmental behavior occurring in social media discussions. Explicitly, personal disappointment with the air quality should be taken as an inflection point by responsible parties to intervene in improving citizens' quality of life.  相似文献   

10.
Cryptococcus neoformans is responsible for life-threatening infections that primarily affect immunocompromised individuals and has an estimated worldwide burden of 220,000 new cases each year—with 180,000 resulting deaths—mostly in sub-Saharan Africa. Surprisingly, little is known about the ecological niches occupied by C. neoformans in nature. To expand our understanding of the distribution and ecological associations of this pathogen we implement a Natural Language Processing approach to better describe the niche of C. neoformans. We use a Latent Dirichlet Allocation model to de novo topic model sets of metagenetic research articles written about varied subjects which either explicitly mention, inadvertently find, or fail to find C. neoformans. These articles are all linked to NCBI Sequence Read Archive datasets of 18S ribosomal RNA and/or Internal Transcribed Spacer gene-regions. The number of topics was determined based on the model coherence score, and articles were assigned to the created topics via a Machine Learning approach with a Random Forest algorithm. Our analysis provides support for a previously suggested linkage between C. neoformans and soils associated with decomposing wood. Our approach, using a search of single-locus metagenetic data, gathering papers connected to the datasets, de novo determination of topics, the number of topics, and assignment of articles to the topics, illustrates how such an analysis pipeline can harness large-scale datasets that are published/available but not necessarily fully analyzed, or whose metadata is not harmonized with other studies. Our approach can be applied to a variety of systems to assert potential evidence of environmental associations.  相似文献   

11.
BackgroundOrphanet aims to provide rare disease information to healthcare professionals, patients, and their relatives.ObjectiveThe objective of this work is to evaluate two methodologies (Unified Medical Languages Systems [UMLS] and manual Orphanet-ICD-10 link-based mapping & string-based matching) used to map Orphanet thesaurus to the MeSH thesaurus.ResultsOn a corpus of 375 mappings, the string-based matching provides significantly better results than the UMLS and manual Orphanet-ICD-10 link-based mapping.ConclusionString-based matching could be applied to any biomedical terminology in French not yet included into UMLS.  相似文献   

12.
In this paper, using Word2vec, a widely-used natural language processing method, we demonstrate that protein domains may have a learnable implicit semantic “meaning” in the context of their functional contributions to the multi-domain proteins in which they are found. Word2vec is a group of models which can be used to produce semantically meaningful embeddings of words or tokens in a fixed-dimension vector space. In this work, we treat multi-domain proteins as “sentences” where domain identifiers are tokens which may be considered as “words.” Using all InterPro (Finn et al. 2017) pfam domain assignments we observe that the embedding could be used to suggest putative GO assignments for Pfam (Finn et al. 2016) domains of unknown function.  相似文献   

13.
lumi: a pipeline for processing Illumina microarray   总被引:2,自引:0,他引:2  
Illumina microarray is becoming a popular microarray platform. The BeadArray technology from Illumina makes its preprocessing and quality control different from other microarray technologies. Unfortunately, most other analyses have not taken advantage of the unique properties of the BeadArray system, and have just incorporated preprocessing methods originally designed for Affymetrix microarrays. lumi is a Bioconductor package especially designed to process the Illumina microarray data. It includes data input, quality control, variance stabilization, normalization and gene annotation portions. In specific, the lumi package includes a variance-stabilizing transformation (VST) algorithm that takes advantage of the technical replicates available on every Illumina microarray. Different normalization method options and multiple quality control plots are provided in the package. To better annotate the Illumina data, a vendor independent nucleotide universal identifier (nuID) was devised to identify the probes of Illumina microarray. The nuID annotation packages and output of lumi processed results can be easily integrated with other Bioconductor packages to construct a statistical data analysis pipeline for Illumina data. Availability: The lumi Bioconductor package, www.bioconductor.org  相似文献   

14.
NMPP: a user-customized NimbleGen microarray data processing pipeline   总被引:1,自引:0,他引:1  
NMPP package is a bundle of user-customized tools based on established algorithms and methods to process self-designed NimbleGen microarray data. It features a command-line-based integrative processing procedure that comprises five major functional components, namely the raw microarray data parsing and integrating module, the array spatial effect smoothing and visualization module, the probe-level multi-array normalization module, the gene expression intensity summarization module and the gene expression status inference module. AVAILABILITY: http://plantgenomics.biology.yale.edu/nmpp  相似文献   

15.
This paper explores the use of the resources in the National Library of Medicine's Unified Medical Language System (UMLS) for the construction of a lexicon useful for processing texts in the field of molecular biology. A lexicon is constructed from overlapping terms in the UMLS SPECIALIST lexicon and the UMLS Metathesaurus to obtain both morphosyntactic and semantic information for terms, and the coverage of a domain corpus is assessed. Over 77% of tokens in the domain corpus are found in the constructed lexicon, validating the lexicon's coverage of the most frequent terms in the domain and indicating that the constructed lexicon is potentially an important resource for biological text processing.  相似文献   

16.
17.
Multimedia data held by Natural History Museums and Universities are presently not readily accessible, even within the natural history community itself. The EU project OpenUp! is an effort to mobilise scientific biological multimedia resources and open them to a wider audience using the EUROPEANA data standards and portal. The connection between natural history and EUROPEANA is accomplished using well established BioCASe and GBIF technologies. This is complemented with a system for data quality control, data transformation and semantic enrichment. With this approach, OpenUp! will provide at least 1,1 Million multimedia objects to EUROPEANA by 2014. Its lean infrastructure is sustainable within the natural history community and will remain functional and effective in the post-project phase.  相似文献   

18.
Inefficient coding and manipulation of pedigree data have often hindered the progress of genetic studies. In this paper we present the methodology for interfacing a data base management system (DBMS) called MEGADATS with a linkage analysis program called LIPED. Two families that segregate a dominant trait and one test marker were used in a simulated exercise to demonstrate how a DBMS can be used to automate tedious clerical steps and improve the efficiency of a genetic analysis. The merits of this approach to data management are discussed. We conclude that a standardized format for genetic analysis programs would greatly facilitate data analysis.  相似文献   

19.
Bilingual and multilingual language processing   总被引:2,自引:0,他引:2  
This chapter addresses the interesting question on the neurolinguistics of bilingualism and the representation of language in the brain in bilingual and multilingual subjects. A fundamental issue is whether the cerebral representation of language in bi- and multilinguals differs from that of monolinguals, and if so, in which specific way. This is an interdisciplinary question which needs to identify and differentiate different levels involved in the neural representation of languages, such as neuroanatomical, neurofunctional, biochemical, psychological and linguistic levels. Furthermore, specific factors such as age, manner of acquisition and environmental factors seem to affect the neural representation. We examined the question whether verbal memory processing in two unrelated languages is mediated by a common neural system or by distinct cortical areas. Subjects were Finnish-English adult multilinguals who had acquired the second language after the age of ten. They were PET-scanned whilst either encoding or retrieving word pairs in their mother tongue (Finnish) or in a foreign language (English). Within each language, subjects had to encode and retrieve four sets of 12 visually presented paired word associates which were not semantically related. Two sets consisted of highly imaginable words and the other two sets of abstract words. Presentation of pseudo-words served as a reference condition. An emission scan was recorded after each intravenous administration of O-15 water. Encoding was associated with prefrontal and hippocampal activation. During memory retrieval, precuneus showed a consistent activation in both languages and for both highly imaginable and abstract words. Differential activations were found in Broca's area and in the cerebellum as well as in the angular/supramarginal gyri according to the language used. The findings advance our understanding of the neural representation that underlies multiple language functions. Further studies are needed to elucidate the neuronal mechanisms of bi/multilingual language processing. A promising perspective for future bi/multilingual research is an integrative approach using brain imaging studies with a high spatial resolution such as fMRI, combined with techniques with a high temporal resolution, such as magnetoencephalography (MEG).  相似文献   

20.
MOTIVATION: A new method is developed to query a relational database in natural language (NL). RESULTS: The method, based on a semantic approach, interprets grammatical and lexical units of a natural language into concepts of subject domain, which are given in a conceptual scheme. The conceptual scheme is mapped formally onto the logical scheme. We applied the method to query the FlyEx database in natural language. FlyEx contains information on the expression of segmentation genes in Drosophila melanogaster. The method allows formulation of queries in various natural languages simultaneously, and is adaptive to changes in the knowledge domain and user's views. It provides optimal transformation of queries from natural language to SQL, as well as visualization of information as a hyperscheme. The method does not require specification of all possible language constructions as well as a standard grammar accuracy in formulation of NL queries.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号