首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 78 毫秒
1.

Background

Determining the semantic relatedness of two biomedical terms is an important task for many text-mining applications in the biomedical field. Previous studies, such as those using ontology-based and corpus-based approaches, measured semantic relatedness by using information from the structure of biomedical literature, but these methods are limited by the small size of training resources. To increase the size of training datasets, the outputs of search engines have been used extensively to analyze the lexical patterns of biomedical terms.

Methodology/Principal Findings

In this work, we propose the Mutually Reinforcing Lexical Pattern Ranking (ReLPR) algorithm for learning and exploring the lexical patterns of synonym pairs in biomedical text. ReLPR employs lexical patterns and their pattern containers to assess the semantic relatedness of biomedical terms. By combining sentence structures and the linking activities between containers and lexical patterns, our algorithm can explore the correlation between two biomedical terms.

Conclusions/Significance

The average correlation coefficient of the ReLPR algorithm was 0.82 for various datasets. The results of the ReLPR algorithm were significantly superior to those of previous methods.  相似文献   

2.

Motivation

Biomedical entities, their identifiers and names, are essential in the representation of biomedical facts and knowledge. In the same way, the complete set of biomedical and chemical terms, i.e. the biomedical “term space” (the “Lexeome”), forms a key resource to achieve the full integration of the scientific literature with biomedical data resources: any identified named entity can immediately be normalized to the correct database entry. This goal does not only require that we are aware of all existing terms, but would also profit from knowing all their senses and their semantic interpretation (ambiguities, nestedness).

Result

This study compiles a resource for lexical terms of biomedical interest in a standard format (called “LexEBI”), determines the overall number of terms, their reuse in different resources and the nestedness of terms. LexEBI comprises references for protein and gene entries and their term variants and chemical entities amongst other terms. In addition, disease terms have been identified from Medline and PubmedCentral and added to LexEBI. Our analysis demonstrates that the baseforms of terms from the different semantic types show only little polysemous use. Nonetheless, the term variants of protein and gene names (PGNs) frequently contain species mentions, which should have been avoided according to protein annotation guidelines. Furthermore, the protein and gene entities as well as the chemical entities, both do comprise enzymes leading to hierarchical polysemy, and a large portion of PGNs make reference to a chemical entity. Altogether, according to our analysis based on the Medline distribution, 401,869 unique PGNs in the documents contain a reference to 25,022 chemical entities, 3,125 disease terms or 1,576 species mentions.

Conclusion

LexEBI delivers the complete biomedical and chemical Lexeome in a standardized representation (http://www.ebi.ac.uk/Rebholz-srv/LexEBI/). The resource provides the disease terms as open source content, and fully interlinks terms across resources.  相似文献   

3.

Background  

Information Extraction (IE) is a component of text mining that facilitates knowledge discovery by automatically locating instances of interesting biomedical events from huge document collections. As events are usually centred on verbs and nominalised verbs, understanding the syntactic and semantic behaviour of these words is highly important. Corpora annotated with information concerning this behaviour can constitute a valuable resource in the training of IE components and resources.  相似文献   

4.

Objective

Word finding depends on the processing of semantic and lexical information, and it involves an intermediate level for mapping semantic-to-lexical information which also subserves lexical-to-semantic mapping during word comprehension. However, the brain regions implementing these components are still controversial and have not been clarified via a comprehensive lesion model encompassing the whole range of language-related cortices. Primary progressive aphasia (PPA), for which anomia is thought to be the most common sign, provides such a model, but the exploration of cortical areas impacting naming in its three main variants and the underlying processing mechanisms is still lacking.

Methods

We addressed this double issue, related to language structure and PPA, with thirty patients (11 semantic, 12 logopenic, 7 agrammatic variant) using a picture-naming task and voxel-based morphometry for anatomo-functional correlation. First, we analyzed correlations for each of the three variants to identify the regions impacting naming in PPA and to disentangle the core regions of word finding. We then combined the three variants and correlation analyses for naming (semantic-to-lexical mapping) and single-word comprehension (lexical-to-semantic mapping), predicting an overlap zone corresponding to a bidirectional lexical-semantic hub.

Results and Conclusions

Our results showed that superior portions of the left temporal pole and left posterior temporal cortices impact semantic and lexical naming mechanisms in semantic and logopenic PPA, respectively. In agrammatic PPA naming deficits were rare, and did not correlate with any cortical region. Combined analyses revealed a cortical overlap zone in superior/middle mid-temporal cortices, distinct from the two former regions, impacting bidirectional binding of lexical and semantic information. Altogether, our findings indicate that lexical/semantic word processing depends on an anterior-posterior axis within lateral-temporal cortices, including an anatomically intermediate hub dedicated to lexical-semantic integration. Within this axis our data reveal the underpinnings of anomia in the PPA variants, which is of relevance for both diagnosis and future therapy strategies.  相似文献   

5.

Purpose

In life cycle assessment (LCA), literature suggests accounting for land as a resource either by what it delivers (e.g., biomass content) or the time and space needed to produce biomass (land occupation), in order to avoid double-counting. This paper proposes and implements a new framework to calculate exergy-based spatial explicit characterization factors (CF) for land as a resource, which deals with both biomass and area occupied on the global scale.

Methods

We created a schematic overview of the Earth, dividing it into two systems (human-made and natural), making it possible to account for what is actually extracted from nature, i.e., the biomass content was set as the elementary flow to be accounted at natural systems and the land occupation (through the potential natural net primary production) was set as the elementary flow at human-made systems. Through exergy, we were able to create CF for land resources for these two different systems. The relevancy of the new CF was tested for a number of biobased products.

Results and discussion

Site-generic CF were created for land as a resource for natural systems providing goods to humans, and site-generic and site-dependent CF (at grid, region, country, and continent level) were created for land as a resource within human-made systems. This framework differed from other methods in the sense of accounting for both land occupation and biomass content but without double-counting. It is set operationally for LCA and able to account for land resources with more completeness, allowing spatial differentiation. When site-dependent CF were considered for land resources, the overall resource consumption of certain products increased up to 77 % in comparison with site-generic CF-based data.

Conclusions

This paper clearly distinguished the origin of the resource (natural or human-made systems), allowing consistent accounting for land as a resource. Site-dependent CF for human-made systems allowed spatial differentiation, which was not considered in other resource accounting life cycle impact assessment methods.  相似文献   

6.

Background

Identification of discourse relations, such as causal and contrastive relations, between situations mentioned in text is an important task for biomedical text-mining. A biomedical text corpus annotated with discourse relations would be very useful for developing and evaluating methods for biomedical discourse processing. However, little effort has been made to develop such an annotated resource.

Results

We have developed the Biomedical Discourse Relation Bank (BioDRB), in which we have annotated explicit and implicit discourse relations in 24 open-access full-text biomedical articles from the GENIA corpus. Guidelines for the annotation were adapted from the Penn Discourse TreeBank (PDTB), which has discourse relations annotated over open-domain news articles. We introduced new conventions and modifications to the sense classification. We report reliable inter-annotator agreement of over 80% for all sub-tasks. Experiments for identifying the sense of explicit discourse connectives show the connective itself as a highly reliable indicator for coarse sense classification (accuracy 90.9% and F1 score 0.89). These results are comparable to results obtained with the same classifier on the PDTB data. With more refined sense classification, there is degradation in performance (accuracy 69.2% and F1 score 0.28), mainly due to sparsity in the data. The size of the corpus was found to be sufficient for identifying the sense of explicit connectives, with classifier performance stabilizing at about 1900 training instances. Finally, the classifier performs poorly when trained on PDTB and tested on BioDRB (accuracy 54.5% and F1 score 0.57).

Conclusion

Our work shows that discourse relations can be reliably annotated in biomedical text. Coarse sense disambiguation of explicit connectives can be done with high reliability by using just the connective as a feature, but more refined sense classification requires either richer features or more annotated data. The poor performance of a classifier trained in the open domain and tested in the biomedical domain suggests significant differences in the semantic usage of connectives across these domains, and provides robust evidence for a biomedical sublanguage for discourse and the need to develop a specialized biomedical discourse annotated corpus. The results of our cross-domain experiments are consistent with related work on identifying connectives in BioDRB.  相似文献   

7.

Purpose

In life cycle assessment (LCA), resource availability is currently evaluated by means of models based on depletion time, surplus energy, etc. Economic aspects influencing the security of supply and affecting availability of resources for human use are neglected. The aim of this work is the development of a new model for the assessment of resource provision capability from an economic angle, complementing existing LCA models. The inclusion of criteria affecting the economic system enables an identification of potential supply risks associated with resource use. In step with actual practice, such an assessment provides added value compared to conventional (environmental) resource assessment within LCA. Analysis of resource availability including economic information is of major importance to sustain industrial production.

Methods

New impact categories and characterization models are developed for the assessment of economic resource availability based on existing LCA methodology and terminology. A single score result can be calculated providing information about the economic resource scarcity potential (ESP) of different resources. Based on a life cycle perspective, the supply risk associated with resource use can be assessed, and bottlenecks within the supply chain can be identified. The analysis can be conducted in connection with existing LCA procedures and in line with current resource assessment practice and facilitates easy implementation on an organizational level.

Results and discussion

A portfolio of 17 metals is assessed based on different impact categories. Different impact factors are calculated, enabling identification of high-risk metals. Furthermore, a comparison of ESP and abiotic depletion potential (ADP) is conducted. Availability of resources differs significantly when economic aspects are taken into account in addition to geologic availability. Resources assumed uncritical based on ADP results, such as rare earths, turn out to be associated with high supply risks.

Conclusions

The model developed in this work allows for a more realistic assessment of resource availability beyond geologic finiteness. The new impact categories provide organizations with a practical measure to identify supply risks associated with resources. The assessment delivers a basis for developing appropriate mitigation measures and for increasing resilience towards supply disruptions. By including an economic dimension into resource availability assessment, a contribution towards life cycle sustainability assessment (LCSA) is achieved.  相似文献   

8.

Background

The ability to query many independent biological databases using a common ontology-based semantic model would facilitate deeper integration and more effective utilization of these diverse and rapidly growing resources. Despite ongoing work moving toward shared data formats and linked identifiers, significant problems persist in semantic data integration in order to establish shared identity and shared meaning across heterogeneous biomedical data sources.

Results

We present five processes for semantic data integration that, when applied collectively, solve seven key problems. These processes include making explicit the differences between biomedical concepts and database records, aggregating sets of identifiers denoting the same biomedical concepts across data sources, and using declaratively represented forward-chaining rules to take information that is variably represented in source databases and integrating it into a consistent biomedical representation. We demonstrate these processes and solutions by presenting KaBOB (the Knowledge Base Of Biomedicine), a knowledge base of semantically integrated data from 18 prominent biomedical databases using common representations grounded in Open Biomedical Ontologies. An instance of KaBOB with data about humans and seven major model organisms can be built using on the order of 500 million RDF triples. All source code for building KaBOB is available under an open-source license.

Conclusions

KaBOB is an integrated knowledge base of biomedical data representationally based in prominent, actively maintained Open Biomedical Ontologies, thus enabling queries of the underlying data in terms of biomedical concepts (e.g., genes and gene products, interactions and processes) rather than features of source-specific data schemas or file formats. KaBOB resolves many of the issues that routinely plague biomedical researchers intending to work with data from multiple data sources and provides a platform for ongoing data integration and development and for formal reasoning over a wealth of integrated biomedical data.

Electronic supplementary material

The online version of this article (doi:10.1186/s12859-015-0559-3) contains supplementary material, which is available to authorized users.  相似文献   

9.
10.

Background

Genes and gene products are frequently annotated with Gene Ontology concepts based on the evidence provided in genomics articles. Manually locating and curating information about a genomic entity from the biomedical literature requires vast amounts of human effort. Hence, there is clearly a need forautomated computational tools to annotate the genes and gene products with Gene Ontology concepts by computationally capturing the related knowledge embedded in textual data.

Results

In this article, we present an automated genomic entity annotation system, GEANN, which extracts information about the characteristics of genes and gene products in article abstracts from PubMed, and translates the discoveredknowledge into Gene Ontology (GO) concepts, a widely-used standardized vocabulary of genomic traits. GEANN utilizes textual "extraction patterns", and a semantic matching framework to locate phrases matching to a pattern and produce Gene Ontology annotations for genes and gene products. In our experiments, GEANN has reached to the precision level of 78% at therecall level of 61%. On a select set of Gene Ontology concepts, GEANN either outperforms or is comparable to two other automated annotation studies. Use of WordNet for semantic pattern matching improves the precision and recall by 24% and 15%, respectively, and the improvement due to semantic pattern matching becomes more apparent as the Gene Ontology terms become more general.

Conclusion

GEANN is useful for two distinct purposes: (i) automating the annotation of genomic entities with Gene Ontology concepts, and (ii) providing existing annotations with additional "evidence articles" from the literature. The use of textual extraction patterns that are constructed based on the existing annotations achieve high precision. The semantic pattern matching framework provides a more flexible pattern matching scheme with respect to "exactmatching" with the advantage of locating approximate pattern occurrences with similar semantics. Relatively low recall performance of our pattern-based approach may be enhanced either by employing a probabilistic annotation framework based on the annotation neighbourhoods in textual data, or, alternatively, the statistical enrichment threshold may be adjusted to lower values for applications that put more value on achieving higher recall values.  相似文献   

11.
12.

Background

The proliferation of the scientific literature in the field of biomedicine makes it difficult to keep abreast of current knowledge, even for domain experts. While general Web search engines and specialized information retrieval (IR) systems have made important strides in recent decades, the problem of accurate knowledge extraction from the biomedical literature is far from solved. Classical IR systems usually return a list of documents that have to be read by the user to extract relevant information. This tedious and time-consuming work can be lessened with automatic Question Answering (QA) systems, which aim to provide users with direct and precise answers to their questions. In this work we propose a novel methodology for QA based on semantic relations extracted from the biomedical literature.

Results

We extracted semantic relations with the SemRep natural language processing system from 122,421,765 sentences, which came from 21,014,382 MEDLINE citations (i.e., the complete MEDLINE distribution up to the end of 2012). A total of 58,879,300 semantic relation instances were extracted and organized in a relational database. The QA process is implemented as a search in this database, which is accessed through a Web-based application, called SemBT (available at http://sembt.mf.uni-lj.si). We conducted an extensive evaluation of the proposed methodology in order to estimate the accuracy of extracting a particular semantic relation from a particular sentence. Evaluation was performed by 80 domain experts. In total 7,510 semantic relation instances belonging to 2,675 distinct relations were evaluated 12,083 times. The instances were evaluated as correct 8,228 times (68%).

Conclusions

In this work we propose an innovative methodology for biomedical QA. The system is implemented as a Web-based application that is able to provide precise answers to a wide range of questions. A typical question is answered within a few seconds. The tool has some extensions that make it especially useful for interpretation of DNA microarray results.

Electronic supplementary material

The online version of this article (doi:10.1186/s12859-014-0365-3) contains supplementary material, which is available to authorized users.  相似文献   

13.
14.
Sequencing and analysis of an Irish human genome   总被引:1,自引:0,他引:1  

Background

Recent studies generating complete human sequences from Asian, African and European subgroups have revealed population-specific variation and disease susceptibility loci. Here, choosing a DNA sample from a population of interest due to its relative geographical isolation and genetic impact on further populations, we extend the above studies through the generation of 11-fold coverage of the first Irish human genome sequence.

Results

Using sequence data from a branch of the European ancestral tree as yet unsequenced, we identify variants that may be specific to this population. Through comparisons with HapMap and previous genetic association studies, we identified novel disease-associated variants, including a novel nonsense variant putatively associated with inflammatory bowel disease. We describe a novel method for improving SNP calling accuracy at low genome coverage using haplotype information. This analysis has implications for future re-sequencing studies and validates the imputation of Irish haplotypes using data from the current Human Genome Diversity Cell Line Panel (HGDP-CEPH). Finally, we identify gene duplication events as constituting significant targets of recent positive selection in the human lineage.

Conclusions

Our findings show that there remains utility in generating whole genome sequences to illustrate both general principles and reveal specific instances of human biology. With increasing access to low cost sequencing we would predict that even armed with the resources of a small research group a number of similar initiatives geared towards answering specific biological questions will emerge.  相似文献   

15.
BACKGROUND: The semantic integration of biomedical resources is still a challenging issue which is required for effective information processing and data analysis. The availability of comprehensive knowledge resources such as biomedical ontologies and integrated thesauri greatly facilitates this integration effort by means of semantic annotation, which allows disparate data formats and contents to be expressed under a common semantic space. In this paper, we propose a multidimensional representation for such a semantic space, where dimensions regard the different perspectives in biomedical research (e.g., population, disease, anatomy and protein/genes). RESULTS: This paper presents a novel method for building multidimensional semantic spaces from semantically annotated biomedical data collections. This method consists of two main processes: knowledge and data normalization. The former one arranges the concepts provided by a reference knowledge resource (e.g., biomedical ontologies and thesauri) into a set of hierarchical dimensions for analysis purposes. The latter one reduces the annotation set associated to each collection item into a set of points of the multidimensional space. Additionally, we have developed a visual tool, called 3D-Browser, which implements OLAP-like operators over the generated multidimensional space. The method and the tool have been tested and evaluated in the context of the Health-e-Child (HeC) project. Automatic semantic annotation was applied to tag three collections of abstracts taken from PubMed, one for each target disease of the project, the Uniprot database, and the HeC patient record database. We adopted the UMLS Meta-thesaurus 2010AA as the reference knowledge resource. CONCLUSIONS: Current knowledge resources and semantic-aware technology make possible the integration of biomedical resources. Such an integration is performed through semantic annotation of the intended biomedical data resources. This paper shows how these annotations can be exploited for integration, exploration, and analysis tasks. Results over a real scenario demonstrate the viability and usefulness of the approach, as well as the quality of the generated multidimensional semantic spaces.  相似文献   

16.

Background

Biomedical literature is expanding rapidly, and tools that help locate information of interest are needed. To this end, a multitude of different approaches for classifying sentences in biomedical publications according to their coarse semantic and rhetoric categories (e.g., Background, Methods, Results, Conclusions) have been devised, with recent state-of-the-art results reported for a complex deep learning model. Recent evidence showed that shallow and wide neural models such as fastText can provide results that are competitive or superior to complex deep learning models while requiring drastically lower training times and having better scalability. We analyze the efficacy of the fastText model in the classification of biomedical sentences in the PubMed 200k RCT benchmark, and introduce a simple pre-processing step that enables the application of fastText on sentence sequences. Furthermore, we explore the utility of two unsupervised pre-training approaches in scenarios where labeled training data are limited.

Results

Our fastText-based methodology yields a state-of-the-art F1 score of.917 on the PubMed 200k benchmark when sentence ordering is taken into account, with a training time of only 73 s on standard hardware. Applying fastText on single sentences, without taking sentence ordering into account, yielded an F1 score of.852 (training time 13 s). Unsupervised pre-training of N-gram vectors greatly improved the results for small training set sizes, with an increase of F1 score of.21 to.74 when trained on only 1000 randomly picked sentences without taking sentence ordering into account.

Conclusions

Because of it’s ease of use and performance, fastText should be among the first choices of tools when tackling biomedical text classification problems with large corpora. Unsupervised pre-training of N-gram vectors on domain-specific corpora also makes it possible to apply fastText when labeled training data are limited.
  相似文献   

17.

Background:

Reliable information extraction applications have been a long sought goal of the biomedical text mining community, a goal that if reached would provide valuable tools to benchside biologists in their increasingly difficult task of assimilating the knowledge contained in the biomedical literature. We present an integrated approach to concept recognition in biomedical text. Concept recognition provides key information that has been largely missing from previous biomedical information extraction efforts, namely direct links to well defined knowledge resources that explicitly cement the concept's semantics. The BioCreative II tasks discussed in this special issue have provided a unique opportunity to demonstrate the effectiveness of concept recognition in the field of biomedical language processing.

Results:

Through the modular construction of a protein interaction relation extraction system, we present several use cases of concept recognition in biomedical text, and relate these use cases to potential uses by the benchside biologist.

Conclusion:

Current information extraction technologies are approaching performance standards at which concept recognition can begin to deliver high quality data to the benchside biologist. Our system is available as part of the BioCreative Meta-Server project and on the internet http://bionlp.sourceforge.net.
  相似文献   

18.

Purpose

Political interest in the future availability of natural resources has spiked recently, with new documents from the European Union, United Nations Environment Programme and the US National Research Council assessing the supply situation of key raw materials. As resource efficiency is considered a key element for sustainable development, suitable methods to address sustainability of resource use are increasingly needed. Life cycle thinking and assessment may play a principal role here. Nonetheless, the extent to which current life cycle impact assessment methods are capable to answer to resource sustainability challenges is widely debated. The aim of this paper is to present key elements of the ongoing discussion, contributing to the future development of more robust and comprehensive methods for evaluating resources in the life cycle assessment (LCA) context.

Methods

We systematically review current impact assessment methods dealing with resources, identifying areas of improvement. Three key issues for sustainability assessment of resources are examined: renewability, recyclability and criticality; this is complemented by a cross-comparison of methodological features and completeness of resource coverage.

Results and discussion

The approach of LCA to resource depletion is characterised by a lack of consensus on methodology and on the relative ranking of resource depletion impacts as can be seen from a comparison of characterisation factors. The examined models yield vastly different characterisations of the impacts from resource depletion and show gaps in the number and types of resources covered.

Conclusions

Key areas of improvement are identified and discussed. Firstly, biotic resources and their renewal rates have so far received relatively little regard within LCA; secondly, the debate on critical raw materials and the opportunity of introducing criticality within LCA is controversial and requires further effort for a conciliating vision and indicators. We identify points where current methods can be expanded to accommodate these issues and cover a wider range of natural resources.  相似文献   

19.

Background  

Bioinformatics tools for automatic processing of biomedical literature are invaluable for both the design and interpretation of large-scale experiments. Many information extraction (IE) systems that incorporate natural language processing (NLP) techniques have thus been developed for use in the biomedical field. A key IE task in this field is the extraction of biomedical relations, such as protein-protein and gene-disease interactions. However, most biomedical relation extraction systems usually ignore adverbial and prepositional phrases and words identifying location, manner, timing, and condition, which are essential for describing biomedical relations. Semantic role labeling (SRL) is a natural language processing technique that identifies the semantic roles of these words or phrases in sentences and expresses them as predicate-argument structures. We construct a biomedical SRL system called BIOSMILE that uses a maximum entropy (ME) machine-learning model to extract biomedical relations. BIOSMILE is trained on BioProp, our semi-automatic, annotated biomedical proposition bank. Currently, we are focusing on 30 biomedical verbs that are frequently used or considered important for describing molecular events.  相似文献   

20.

Background

We investigate the accuracy of different similarity approaches for clustering over two million biomedical documents. Clustering large sets of text documents is important for a variety of information needs and applications such as collection management and navigation, summary and analysis. The few comparisons of clustering results from different similarity approaches have focused on small literature sets and have given conflicting results. Our study was designed to seek a robust answer to the question of which similarity approach would generate the most coherent clusters of a biomedical literature set of over two million documents.

Methodology

We used a corpus of 2.15 million recent (2004-2008) records from MEDLINE, and generated nine different document-document similarity matrices from information extracted from their bibliographic records, including titles, abstracts and subject headings. The nine approaches were comprised of five different analytical techniques with two data sources. The five analytical techniques are cosine similarity using term frequency-inverse document frequency vectors (tf-idf cosine), latent semantic analysis (LSA), topic modeling, and two Poisson-based language models – BM25 and PMRA (PubMed Related Articles). The two data sources were a) MeSH subject headings, and b) words from titles and abstracts. Each similarity matrix was filtered to keep the top-n highest similarities per document and then clustered using a combination of graph layout and average-link clustering. Cluster results from the nine similarity approaches were compared using (1) within-cluster textual coherence based on the Jensen-Shannon divergence, and (2) two concentration measures based on grant-to-article linkages indexed in MEDLINE.

Conclusions

PubMed''s own related article approach (PMRA) generated the most coherent and most concentrated cluster solution of the nine text-based similarity approaches tested, followed closely by the BM25 approach using titles and abstracts. Approaches using only MeSH subject headings were not competitive with those based on titles and abstracts.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号