首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
《BIOSILICO》2003,1(2):69-80
The information age has made the electronic storage of large amounts of data effortless. The proliferation of documents available on the Internet, corporate intranets, news wires and elsewhere is overwhelming. Search engines only exacerbate this overload problem by making increasingly more documents available in only a few keystrokes. This information overload also exists in the biomedical field, where scientific publications, and other forms of text-based data are produced at an unprecedented rate. Text mining is the combined, automated process of analyzing unstructured, natural language text to discover information and knowledge that are typically difficult to retrieve. Here, we focus on text mining as applied to the biomedical literature. We focus in particular on finding relationships among genes, proteins, drugs and diseases, to facilitate an understanding and prediction of complex biological processes. The LitMiner™ system, developed specifically for this purpose; is described in relation to the Knowledge Discovery and Data Mining Cup 2002, which serves as a formal evaluation of the system.  相似文献   

2.
MOTIVATION: Much current research in biomedical text mining is concerned with serving biologists by extracting certain information from scientific text. We note that there is no 'average biologist' client; different users have distinct needs. For instance, as noted in past evaluation efforts (BioCreative, TREC, KDD) database curators are often interested in sentences showing experimental evidence and methods. Conversely, lab scientists searching for known information about a protein may seek facts, typically stated with high confidence. Text-mining systems can target specific end-users and become more effective, if the system can first identify text regions rich in the type of scientific content that is of interest to the user, retrieve documents that have many such regions, and focus on fact extraction from these regions. Here, we study the ability to characterize and classify such text automatically. We have recently introduced a multi-dimensional categorization and annotation scheme, developed to be applicable to a wide variety of biomedical documents and scientific statements, while intended to support specific biomedical retrieval and extraction tasks. RESULTS: The annotation scheme was applied to a large corpus in a controlled effort by eight independent annotators, where three individual annotators independently tagged each sentence. We then trained and tested machine learning classifiers to automatically categorize sentence fragments based on the annotation. We discuss here the issues involved in this task, and present an overview of the results. The latter strongly suggest that automatic annotation along most of the dimensions is highly feasible, and that this new framework for scientific sentence categorization is applicable in practice.  相似文献   

3.
Categorization of biomedical articles is a central task for supporting various curation efforts. It can also form the basis for effective biomedical text mining. Automatic text classification in the biomedical domain is thus an active research area. Contests organized by the KDD Cup (2002) and the TREC Genomics track (since 2003) defined several annotation tasks that involved document classification, and provided training and test data sets. So far, these efforts focused on analyzing only the text content of documents. However, as was noted in the KDD'02 text mining contest-where figure-captions proved to be an invaluable feature for identifying documents of interest-images often provide curators with critical information. We examine the possibility of using information derived directly from image data, and of integrating it with text-based classification, for biomedical document categorization. We present a method for obtaining features from images and for using them-both alone and in combination with text-to perform the triage task introduced in the TREC Genomics track 2004. The task was to determine which documents are relevant to a given annotation task performed by the Mouse Genome Database curators. We show preliminary results, demonstrating that the method has a strong potential to enhance and complement traditional text-based categorization methods.  相似文献   

4.
Shang Y  Li Y  Lin H  Yang Z 《PloS one》2011,6(8):e23862
Automatic text summarization for a biomedical concept can help researchers to get the key points of a certain topic from large amount of biomedical literature efficiently. In this paper, we present a method for generating text summary for a given biomedical concept, e.g., H1N1 disease, from multiple documents based on semantic relation extraction. Our approach includes three stages: 1) We extract semantic relations in each sentence using the semantic knowledge representation tool SemRep. 2) We develop a relation-level retrieval method to select the relations most relevant to each query concept and visualize them in a graphic representation. 3) For relations in the relevant set, we extract informative sentences that can interpret them from the document collection to generate text summary using an information retrieval based method. Our major focus in this work is to investigate the contribution of semantic relation extraction to the task of biomedical text summarization. The experimental results on summarization for a set of diseases show that the introduction of semantic knowledge improves the performance and our results are better than the MEAD system, a well-known tool for text summarization.  相似文献   

5.
MOTIVATION: Full-text documents potentially hold more information than their abstracts, but require more resources for processing. We investigated the added value of full text over abstracts in terms of information content and occurrences of gene symbol--gene name combinations that can resolve gene-symbol ambiguity. RESULTS: We analyzed a set of 3902 biomedical full-text articles. Different keyword measures indicate that information density is highest in abstracts, but that the information coverage in full texts is much greater than in abstracts. Analysis of five different standard sections of articles shows that the highest information coverage is located in the results section. Still, 30-40% of the information mentioned in each section is unique to that section. Only 30% of the gene symbols in the abstract are accompanied by their corresponding names, and a further 8% of the gene names are found in the full text. In the full text, only 18% of the gene symbols are accompanied by their gene names.  相似文献   

6.
The choice of an efficient document preparation system is an important decision for any academic researcher. To assist the research community, we report a software usability study in which 40 researchers across different disciplines prepared scholarly texts with either Microsoft Word or LaTeX. The probe texts included simple continuous text, text with tables and subheadings, and complex text with several mathematical equations. We show that LaTeX users were slower than Word users, wrote less text in the same amount of time, and produced more typesetting, orthographical, grammatical, and formatting errors. On most measures, expert LaTeX users performed even worse than novice Word users. LaTeX users, however, more often report enjoying using their respective software. We conclude that even experienced LaTeX users may suffer a loss in productivity when LaTeX is used, relative to other document preparation systems. Individuals, institutions, and journals should carefully consider the ramifications of this finding when choosing document preparation strategies, or requiring them of authors.  相似文献   

7.
Historical text archives constitute a rich and diverse source of information, which is becoming increasingly readily accessible, due to large-scale digitisation efforts. However, it can be difficult for researchers to explore and search such large volumes of data in an efficient manner. Text mining (TM) methods can help, through their ability to recognise various types of semantic information automatically, e.g., instances of concepts (places, medical conditions, drugs, etc.), synonyms/variant forms of concepts, and relationships holding between concepts (which drugs are used to treat which medical conditions, etc.). TM analysis allows search systems to incorporate functionality such as automatic suggestions of synonyms of user-entered query terms, exploration of different concepts mentioned within search results or isolation of documents in which concepts are related in specific ways. However, applying TM methods to historical text can be challenging, according to differences and evolutions in vocabulary, terminology, language structure and style, compared to more modern text. In this article, we present our efforts to overcome the various challenges faced in the semantic analysis of published historical medical text dating back to the mid 19th century. Firstly, we used evidence from diverse historical medical documents from different periods to develop new resources that provide accounts of the multiple, evolving ways in which concepts, their variants and relationships amongst them may be expressed. These resources were employed to support the development of a modular processing pipeline of TM tools for the robust detection of semantic information in historical medical documents with varying characteristics. We applied the pipeline to two large-scale medical document archives covering wide temporal ranges as the basis for the development of a publicly accessible semantically-oriented search system. The novel resources are available for research purposes, while the processing pipeline and its modules may be used and configured within the Argo TM platform.  相似文献   

8.

Background

Web-based, free-text documents on science and technology have been increasing growing on the web. However, most of these documents are not immediately processable by computers slowing down the acquisition of useful information. Computational ontologies might represent a possible solution by enabling semantically machine readable data sets. But, the process of ontology creation, instantiation and maintenance is still based on manual methodologies and thus time and cost intensive.

Method

We focused on a large corpus containing information on researchers, research fields, and institutions. We based our strategy on traditional entity recognition, social computing and correlation. We devised a semi automatic approach for the recognition, correlation and extraction of named entities and relations from textual documents which are then used to create, instantiate, and maintain an ontology.

Results

We present a prototype demonstrating the applicability of the proposed strategy, along with a case study describing how direct and indirect relations can be extracted from academic and professional activities registered in a database of curriculum vitae in free-text format. We present evidence that this system can identify entities to assist in the process of knowledge extraction and representation to support ontology maintenance. We also demonstrate the extraction of relationships among ontology classes and their instances.

Conclusion

We have demonstrated that our system can be used for the conversion of research information in free text format into database with a semantic structure. Future studies should test this system using the growing number of free-text information available at the institutional and national levels.  相似文献   

9.
Electrical penetration graph (EPG) technique is a powerful tool to investigate the hidden feeding behavior of piercing–sucking insects allowing to link recorded EPG waveforms to stylet penetration and complex behaviors related to feeding activities occurring within plant tissue. Calculating the numerous EPG parameters necessary to unravel the complex insect–plant interactions is very time consuming, and few tools have been developed to automate it. EPG-Calc is a rich internet application intended to fill this gap, providing a fast and user-friendly web-based interface that uses analysis files from dedicated software (STYLET+) or database-compatible CSV text files containing waveform codes and cumulative time as input, and produces output files in database-compatible CSV text or Microsoft Excel® XLS format that are directly usable by different statistical analysis softwares. EPG-Calc greatly reduces the time needed for EPG parameters calculation and allows to calculate more than 100 different parameters based on standardized definitions and calculus methods in such a way that avoid confusion between all kinds of definitions and calculations by individual authors.  相似文献   

10.
Topic models and neural networks can discover meaningful low-dimensional latent representations of text corpora; as such, they have become a key technology of document representation. However, such models presume all documents are non-discriminatory, resulting in latent representation dependent upon all other documents and an inability to provide discriminative document representation. To address this problem, we propose a semi-supervised manifold-inspired autoencoder to extract meaningful latent representations of documents, taking the local perspective that the latent representation of nearby documents should be correlative. We first determine the discriminative neighbors set with Euclidean distance in observation spaces. Then, the autoencoder is trained by joint minimization of the Bernoulli cross-entropy error between input and output and the sum of the square error between neighbors of input and output. The results of two widely used corpora show that our method yields at least a 15% improvement in document clustering and a nearly 7% improvement in classification tasks compared to comparative methods. The evidence demonstrates that our method can readily capture more discriminative latent representation of new documents. Moreover, some meaningful combinations of words can be efficiently discovered by activating features that promote the comprehensibility of latent representation.  相似文献   

11.
Biological diversity in the patent system is an enduring focus of controversy but empirical analysis of the presence of biodiversity in the patent system has been limited. To address this problem we text mined 11 million patent documents for 6 million Latin species names from the Global Names Index (GNI) established by the Global Biodiversity Information Facility (GBIF) and Encyclopedia of Life (EOL). We identified 76,274 full Latin species names from 23,882 genera in 767,955 patent documents. 25,595 species appeared in the claims section of 136,880 patent documents. This reveals that human innovative activity involving biodiversity in the patent system focuses on approximately 4% of taxonomically described species and between 0.8–1% of predicted global species. In this article we identify the major features of the patent landscape for biological diversity by focusing on key areas including pharmaceuticals, neglected diseases, traditional medicines, genetic engineering, foods, biocides, marine genetic resources and Antarctica. We conclude that the narrow focus of human innovative activity and ownership of genetic resources is unlikely to be in the long term interest of humanity. We argue that a broader spectrum of biodiversity needs to be opened up to research and development based on the principles of equitable benefit-sharing, respect for the objectives of the Convention on Biological Diversity, human rights and ethics. Finally, we argue that alternative models of innovation, such as open source and commons models, are required to open up biodiversity for research that addresses actual and neglected areas of human need. The research aims to inform the implementation of the 2010 Nagoya Protocol on Access to Genetic Resources and the Equitable Sharing of Benefits Arising from their Utilization and international debates directed to the governance of genetic resources. Our research also aims to inform debates under the Intergovernmental Committee on Intellectual Property and Genetic Resources, Traditional Knowledge and Folklore at the World Intellectual Property Organization.  相似文献   

12.
Parallel corpora have become an essential resource for work in multi lingual natural language processing. However, sentence aligned parallel corpora are more efficient than non-aligned parallel corpora for cross language information retrieval and machine translation applications. In this paper, we present a new approach to align sentences in bilingual parallel corpora based on feed forward neural network classifier. A feature parameter vector is extracted from the text pair under consideration. This vector contains text features such as length, punctuate score, and cognate score values. A set of manually prepared training data has been assigned to train the feed forward neural network. Another set of data was used for testing. Using this new approach, we could achieve an error reduction of 60% over length based approach when applied on English-Arabic parallel documents. Moreover this new approach is valid for any language pair and it is quite flexible approach since the feature parameter vector may contain more/less or different features than that we used in our system such as lexical match feature.  相似文献   

13.

Background  

The availability of biomedical literature in electronic format has made it possible to implement automatic text processing methods to expose implicit relationships among different documents, and more importantly, the functional relationships among the molecules and processes that these documents describe.  相似文献   

14.
SUMMARY: BioCaster is an ontology-based text mining system for detecting and tracking the distribution of infectious disease outbreaks from linguistic signals on the Web. The system continuously analyzes documents reported from over 1700 RSS feeds, classifies them for topical relevance and plots them onto a Google map using geocoded information. The background knowledge for bridging the gap between Layman's terms and formal-coding systems is contained in the freely available BioCaster ontology which includes information in eight languages focused on the epidemiological role of pathogens as well as geographical locations with their latitudes/longitudes. The system consists of four main stages: topic classification, named entity recognition (NER), disease/location detection and event recognition. Higher order event analysis is used to detect more precisely specified warning signals that can then be notified to registered users via email alerts. Evaluation of the system for topic recognition and entity identification is conducted on a gold standard corpus of annotated news articles. AVAILABILITY: The BioCaster map and ontology are freely available via a web portal at http://www.biocaster.org.  相似文献   

15.
Jerne’s idiotypic network theory stresses the importance of antibody-to-antibody interactions and provides possible explanations for self-tolerance and increased diversity in the immune repertoire. In this paper, we use an immune network model to build a user profile for adaptive information filtering. Antibody-to-antibody interactions in the profile’s network model correlations between words in text. The user profile has to be able to represent a user’s multiple interests and adapt to changes in them over time. This is a complex and dynamic engineering problem with clear analogies to the immune process of self-assertion. We present a series of experiments investigating the effect of term correlations on the user’s profile performance. The results show that term correlations can encode additional information, which has a positive effect on the profile’s ability to assess the relevance of documents to the user’s interests and to adapt to changes in them.  相似文献   

16.
We present results involving an approach to acridine orange staining of intact cells based on basic physicochemical considerations. We show by static microfluorometry of several in vitro and in vivo cell lines that the important parameters for such staining are the molar ratio (Formula: see text), and molar concentration of acridine orange. Differential nuclear DNA and cytoplasmic RNA staining are totally controlled by these two parameters. We show this by a physicochemical model of cell-dye interaction. Finally, we use the method to study the growth parameters of complex in vivo cell populations by automated multiparameter flow microfluorometry. We have explored also, both by static and flow systems, the effect on AO-cell staining of various cell pretreatments such as Triton X-100 and chelating agents.  相似文献   

17.
MOTIVATION: Research on roles of gene products in cells is accumulating and changing rapidly, but most of the results are still reported in text form and are not directly accessible by computers. To expedite the progress of functional bioinformatics, it is, therefore, important to efficiently process large amounts of biomedical literature and transform the knowledge extracted into a structured format usable by biologists and medical researchers. Our aim was to develop an intelligent text-mining system that will extract from biomedical documents knowledge about the functions of gene products and thus facilitate computing with function. RESULTS: We have developed an ontology-based text-mining system to efficiently extract from biomedical literature knowledge about the functions of gene products. We also propose methods of sentence alignment and sentence classification to discover the functions of gene products discussed in digital texts. AVAILABILITY: http://ismp.csie.ncku.edu.tw/~yuhc/meke/  相似文献   

18.

Background

In recent years high throughput methods have led to a massive expansion in the free text literature on molecular biology. Automated text mining has developed as an application technology for formalizing this wealth of published results into structured database entries. However, database curation as a task is still largely done by hand, and although there have been many studies on automated approaches, problems remain in how to classify documents into top-level categories based on the type of organism being investigated. Here we present a comparative analysis of state of the art supervised models that are used to classify both abstracts and full text articles for three model organisms.

Results

Ablation experiments were conducted on a large gold standard corpus of 10,000 abstracts and full papers containing data on three model organisms (fly, mouse and yeast). Among the eight learner models tested, the best model achieved an F-score of 97.1% for fly, 88.6% for mouse and 85.5% for yeast using a variety of features that included gene name, organism frequency, MeSH headings and term-species associations. We noted that term-species associations were particularly effective in improving classification performance. The benefit of using full text articles over abstracts was consistently observed across all three organisms.

Conclusions

By comparing various learner algorithms and features we presented an optimized system that automatically detects the major focus organism in full text articles for fly, mouse and yeast. We believe the method will be extensible to other organism types.
  相似文献   

19.
The relationship between species richness and the prevalence of vector-borne disease has been widely studied with a range of outcomes. Increasing the number of host species for a pathogen may decrease infection prevalence (dilution effect), increase it (amplification), or have no effect. We derive a general model, and a specific implementation, which show that when the number of vector feeding sites on each host is limiting, the effects on pathogen dynamics of host population size are more complex than previously thought. The model examines vector-borne disease in the presence of different host species that are either competent or incompetent (i.e. that cannot transmit the pathogen to vectors) as reservoirs for the pathogen. With a single host species present, the basic reproduction ratio R(0) is a non-monotonic function of the population size of host individuals (H), i.e. a value [Formula: see text] exists that maximises R(0). Surprisingly, if [Formula: see text] a reduction in host population size may actually increase R(0). Extending this model to a two-host species system, incompetent individuals from the second host species can alter the value of [Formula: see text] which may reverse the effect on pathogen prevalence of host population reduction. We argue that when vector-feeding sites on hosts are limiting, the net effect of increasing host diversity might not be correctly predicted using simple frequency-dependent epidemiological models.  相似文献   

20.
We introduce and make publicly available a large corpus of digitized primary source human rights documents which are published annually by monitoring agencies that include Amnesty International, Human Rights Watch, the Lawyers Committee for Human Rights, and the United States Department of State. In addition to the digitized text, we also make available and describe document-term matrices, which are datasets that systematically organize the word counts from each unique document by each unique term within the corpus of human rights documents. To contextualize the importance of this corpus, we describe the development of coding procedures in the human rights community and several existing categorical indicators that have been created by human coding of the human rights documents contained in the corpus. We then discuss how the new human rights corpus and the existing human rights datasets can be used with a variety of statistical analyses and machine learning algorithms to help scholars understand how human rights practices and reporting have evolved over time. We close with a discussion of our plans for dataset maintenance, updating, and availability.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号