共查询到20条相似文献,搜索用时 15 毫秒
1.
Textpresso: an ontology-based information retrieval and extraction system for biological literature 下载免费PDF全文
We have developed Textpresso, a new text-mining system for scientific literature whose capabilities go far beyond those of a simple keyword search engine. Textpresso's two major elements are a collection of the full text of scientific articles split into individual sentences, and the implementation of categories of terms for which a database of articles and individual sentences can be searched. The categories are classes of biological concepts (e.g., gene, allele, cell or cell group, phenotype, etc.) and classes that relate two objects (e.g., association, regulation, etc.) or describe one (e.g., biological process, etc.). Together they form a catalog of types of objects and concepts called an ontology. After this ontology is populated with terms, the whole corpus of articles and abstracts is marked up to identify terms of these categories. The current ontology comprises 33 categories of terms. A search engine enables the user to search for one or a combination of these tags and/or keywords within a sentence or document, and as the ontology allows word meaning to be queried, it is possible to formulate semantic queries. Full text access increases recall of biological data types from 45% to 95%. Extraction of particular biological facts, such as gene-gene interactions, can be accelerated significantly by ontologies, with Textpresso automatically performing nearly as well as expert curators to identify sentences; in searches for two uniquely named genes and an interaction term, the ontology confers a 3-fold increase of search efficiency. Textpresso currently focuses on Caenorhabditis elegans literature, with 3,800 full text articles and 16,000 abstracts. The lexicon of the ontology contains 14,500 entries, each of which includes all versions of a specific word or phrase, and it includes all categories of the Gene Ontology database. Textpresso is a useful curation tool, as well as search engine for researchers, and can readily be extended to other organism-specific corpora of text. Textpresso can be accessed at http://www.textpresso.org or via WormBase at http://www.wormbase.org. 相似文献
2.
Palakal M Mukhopadhyay S Mostafa J Raje R N'Cho M Mishra S 《Bioinformatics (Oxford, England)》2002,18(10):1283-1288
MOTIVATION: As biomedical researchers are amassing a plethora of information in a variety of forms resulting from the advancements in biomedical research, there is a critical need for innovative information management and knowledge discovery tools to sift through these vast volumes of heterogeneous data and analysis tools. In this paper we present a general model for an information management system that is adaptable and scalable, followed by a detailed design and implementation of one component of the model. The prototype, called BioSifter, was applied to problems in the bioinformatics area. RESULTS: BioSifter was tested using 500 documents obtained from PubMed database on two biological problems related to genetic polymorphism and extracorporal shockwave lithotripsy. The results indicate that BioSifter is a powerful tool for biological researchers to automatically retrieve relevant text documents from biological literature based on their interest profile. The results also indicate that the first stage of information management process, i.e. data to information transformation, significantly reduces the size of the information space. The filtered data obtained through BioSifter is relevant as well as much smaller in dimension compared to all the retrieved data. This would in turn significantly reduce the complexity associated with the next level transformation, i.e. information to knowledge. 相似文献
3.
4.
Gathering data has always presented a considerable challenge for scientists, whether it is done in laboratories or through reading specialized journals. It becomes particularly difficult in fields that generate a lot of publications, such as marine microbiology. Professionals interested in environments of seas and oceans, and organisms inhabiting them are faced with a flood of information on a daily basis. It is so because new papers on genetic samples and metagenomes of marine origin are being issued, and that is why the participants of the MetaFunctions project found it necessary to develop a new efficient way of extracting, processing, and storing data. As a result a new system, called Poseidon, was created, offering a convenient alternative to manual extraction of marine microbiology-related information. 相似文献
5.
Protein structures and information extraction from biological texts: the PASTA system 总被引:1,自引:0,他引:1
MOTIVATION: The rapid increase in volume of protein structure literature means useful information may be hidden or lost in the published literature and the process of finding relevant material, sometimes the rate-determining factor in new research, may be arduous and slow. RESULTS: We describe the Protein Active Site Template Acquisition (PASTA) system, which addresses these problems by performing automatic extraction of information relating to the roles of specific amino acid residues in protein molecules from online scientific articles and abstracts. Both the terminology recognition and extraction capabilities of the system have been extensively evaluated against manually annotated data and the results compare favourably with state-of-the-art results obtained in less challenging domains. PASTA is the first information extraction (IE) system developed for the protein structure domain and one of the most thoroughly evaluated IE system operating on biological scientific text to date. AVAILABILITY: PASTA makes its extraction results available via a browser-based front end: http://www.dcs.shef.ac.uk/nlp/pasta/. The evaluation resources (manually annotated corpora) are also available through the website: http://www.dcs.shef.ac.uk/nlp/pasta/results.html. 相似文献
6.
Battail G 《Bio Systems》2004,76(1-3):279-290
We develop ideas on genome replication introduced in Battail [Europhys. Lett. 40 (1997) 343]. Starting with the hypothesis that the genome replication process uses error-correcting means, and the auxiliary one that nested codes are used to this end, we first review the concepts of redundancy and error-correcting codes. Then we show that these hypotheses imply that: distinct species exist with a hierarchical taxonomy, there is a trend of evolution towards complexity, and evolution proceeds by discrete jumps. At least the first two features above may be considered as biological facts so, in the absence of direct evidence, they provide an indirect proof in favour of the hypothesized error-correction system. The very high redundancy of genomes makes it possible. In order to explain how it is implemented, we suggest that soft codes and replication decoding, to be briefly described, are plausible candidates. Experimentally proven properties of long-range correlation of the DNA message substantiate this claim. 相似文献
7.
面向对象的优势树种类型信息提取技术 总被引:1,自引:0,他引:1
森林植被优势树种类型信息的提取是遥感影像分类中的难点.面向对象分类方法是用高空间分辨率遥感数据实现精确类型信息提取的新方法.本文以2013年Quickbird影像作为基础数据,选择福建省三明市将乐林场为研究区,采用面向对象多尺度分割方法提取耕地、灌草地、未成林造林地、马尾松、杉木和阔叶树等类型信息.分类特征融合植被的光谱、纹理和多种植被指数3类特征信息,建立类层次结构,对不同层次分别用隶属度函数和决策树分类规则,最终完成分类,并与只用纹理与光谱特征相结合的方法进行对比.结果表明:融合纹理、光谱、多种植被指数的面向对象的分类方法提取研究区优势树种类型信息的精度为91.3%,比只用纹理和光谱的方法精度提高了5.7%. 相似文献
8.
Automated extraction of information on protein-protein interactions from the biological literature 总被引:6,自引:0,他引:6
MOTIVATION: To understand biological process, we must clarify how proteins interact with each other. However, since information about protein-protein interactions still exists primarily in the scientific literature, it is not accessible in a computer-readable format. Efficient processing of large amounts of interactions therefore needs an intelligent information extraction method. Our aim is to develop an efficient method for extracting information on protein-protein interaction from scientific literature. RESULTS: We present a method for extracting information on protein-protein interactions from the scientific literature. This method, which employs only a protein name dictionary, surface clues on word patterns and simple part-of-speech rules, achieved high recall and precision rates for yeast (recall = 86.8% and precision = 94.3%) and Escherichia coli (recall = 82.5% and precision = 93.5%). The result of extraction suggests that our method should be applicable to any species for which a protein name dictionary is constructed. AVAILABILITY: The program is available on request from the authors. 相似文献
9.
Information is represented and processed in neural systems in various ways. The rate coding, population coding, and temporal coding are typical examples of representation. It is a hot issue in neuroscience what kinds of coding is used in real neural systems. Different regions of the brain may resort to different coding strategies. Moreover, recent studies suggest the possibility of dual or multiple codes, in which different modes of information are embedded in one neural system. The present paper reviews various possibilities of neural codes focusing on dual codes. 相似文献
10.
The names used by biologists to label the observations they make are imprecise. This is an issue as workers increasingly seek to exploit data gathered from multiple, unrelated sources on line. Even when the international codes of nomenclature are followed strictly the resulting names (Taxon Names) do not uniquely identify the taxa (Taxon Concepts) that have been described by taxonomists but merely groups of type specimens. A standard data model for exchange of taxonomic information is described. It addresses this issue by facilitating explicit communication of information about Taxon Concepts and their associated names. A representation of this model as a XML Schema is introduced and the implications of the use of Globally Unique Identifiers discussed. 相似文献
11.
Masahiro Tanaka 《Journal of theoretical biology》1980,85(4):789-806
An evolving population of two alleles which generates another new allele is investigated within the framework of information theory. A communication system model of self-reproducing organisms which reproduce the genetic information by the self-reproduction is proposed and the concept of distortion when the genetic information is reproduced is introduced. What genetic information should be transmitted to the next generation? This question may be answered by so called rate distortion theory. The theory is applied to our model and the exact solution is obtained. Numerical results show that in low distortion region, new allele cannot survive, hence no evolution occurs; in intermediate distortion region, all alleles coexist; and in high distortion region, minor allele cannot survive. Moreover, diversity of the progeny takes its maximum value in the intermediate distortion region. This suggests that there may exist an optimal distortion for a population to evolve. 相似文献
12.
13.
The need for extracting general biological interactions of arbitrary types from the rapidly growing volume of the biomedical literature is drawing increased attention, while the need for this much diversity also requires both a robust treatment of complex linguistic phenomena and a method to consistently characterize the results. We present a biomedical information extraction system, BioIE, to address both of these needs by utilizing a full-fledged English grammar formalism, or a combinatory categorial grammar, and by annotating the results with the terms of Gene Ontology, which provides a common and controlled vocabulary. BioIE deals with complex linguistic phenomena such as coordination, relative structures, acronyms, appositive structures, and anaphoric expressions. In order to deal with real-world syntactic variations of ontological terms, BioIE utilizes the syntactic dependencies between words in sentences as well, based on the observation that the component words in an ontological term usually appear in a sentence with known patterns of syntactic dependencies. 相似文献
14.
15.
Leitner F Krallinger M Rodriguez-Penagos C Hakenberg J Plake C Kuo CJ Hsu CN Tsai RT Hung HC Lau WW Johnson CA Saetre R Yoshida K Chen YH Kim S Shin SY Zhang BT Baumgartner WA Hunter L Haddow B Matthews M Wang X Ruch P Ehrler F Ozgür A Erkan G Radev DR Krauthammer M Luong T Hoffmann R Sander C Valencia A 《Genome biology》2008,9(Z2):S6
We introduce the first meta-service for information extraction in molecular biology, the BioCreative MetaServer (BCMS; http://bcms.bioinfo.cnio.es/). This prototype platform is a joint effort of 13 research groups and provides automatically generated annotations for PubMed/Medline abstracts. Annotation types cover gene names, gene IDs, species, and protein-protein interactions. The annotations are distributed by the meta-server in both human and machine readable formats (HTML/XML). This service is intended to be used by biomedical researchers and database annotators, and in biomedical language processing. The platform allows direct comparison, unified access, and result aggregation of the annotations. 相似文献
16.
17.
Novel high-throughput measurement techniques in vivo are beginning to produce dense high-quality time series which can be used to investigate the structure and regulation of biochemical networks. We propose an automated information extraction procedure which takes advantage of the unique S-system structure and supports model building from time traces, curve fitting, model selection, and structure identification based on parameter estimation. The procedure comprises of three modules: model Generation, parameter estimation or model Fitting, and model Selection (GFS algorithm). The GFS algorithm has been implemented in MATLAB and returns a list of candidate S-systems which adequately explain the data and guides the search to the most plausible model for the time series under study. By combining two strategies (namely decoupling and limiting connectivity) with methods of data smoothing, the proposed algorithm is scalable up to realistic situations of moderate size. We illustrate the proposed methodology with a didactic example. 相似文献
18.
With the increased interest in understanding biological networks, such as protein-protein interaction networks and gene regulatory networks, methods for representing and communicating such networks in both human- and machine-readable form have become increasingly important. Although there has been significant progress in machine-readable representation of networks, as exemplified by the Systems Biology Mark-up Language (SBML) (http://www.sbml.org) issues in human-readable representation have been largely ignored. This article discusses human-readable diagrammatic representations and proposes a set of notations that enhances the formality and richness of the information represented. The process diagram is a fully state transition-based diagram that can be translated into machine-readable forms such as SBML in a straightforward way. It is supported by CellDesigner, a diagrammatic network editing software (http://www.celldesigner.org/), and has been used to represent a variety of networks of various sizes (from only a few components to several hundred components). 相似文献
19.
Summary. A novel approach to visualize biological sequences is developed based on cellular automata (Wolfram, S. Nature 1984, 311, 419–424), a set of discrete dynamical systems in which space and time are discrete. By transforming the symbolic sequence codes into the digital codes, and using some optimal space-time evolvement rules of cellular automata, a biological sequence can be represented by a unique image, the so-called cellular automata image. Many important features, which are originally hidden in a long and complicated biological sequence, can be clearly revealed thru its cellular automata image. With biological sequences entering into databanks rapidly increasing in the post-genomic era, it is anticipated that the cellular automata image will become a very useful vehicle for investigation into their key features, identification of their function, as well as revelation of their fingerprint. It is anticipated that by using the concept of the pseudo amino acid composition (Chou, K.C. Proteins: Structure, Function, and Genetics, 2001, 43, 246–255), the cellular automata image approach can also be used to improve the quality of predicting protein attributes, such as structural class and subcellular location. 相似文献
20.
Khrennikov A 《Journal of theoretical biology》2004,231(4):597-613
We present for mental processes the program of mathematical mapping which has been successfully realized for physical processes. We emphasize that our project is not about mathematical simulation of the brain's functioning as a complex physical system, i.e., mapping of physical and chemical processes in the brain on mathematical spaces. The project is about mapping of purely mental processes on mathematical spaces. We present various arguments--philosophic, mathematical, information, and neurophysiological--in favor of the p-adic model of mental space. p-adic spaces have structures of hierarchic trees and in our model such a tree hierarchy is considered as an image of neuronal hierarchy. Hierarchic neural pathways are considered as fundamental units of information processing. As neural pathways can go through the whole body, the mental space is produced by the whole neural system. Finally, we develop the probabilistic neural pathway model in that mental states are represented by probability distributions on mental space. 相似文献