首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
FACTA is a text search engine for MEDLINE abstracts, which is designed particularly to help users browse biomedical concepts (e.g. genes/proteins, diseases, enzymes and chemical compounds) appearing in the documents retrieved by the query. The concepts are presented to the user in a tabular format and ranked based on the co-occurrence statistics. Unlike existing systems that provide similar functionality, FACTA pre-indexes not only the words but also the concepts mentioned in the documents, which enables the user to issue a flexible query (e.g. free keywords or Boolean combinations of keywords/concepts) and receive the results immediately even when the number of the documents that match the query is very large. The user can also view snippets from MEDLINE to get textual evidence of associations between the query terms and the concepts. The concept IDs and their names/synonyms for building the indexes were collected from several biomedical databases and thesauri, such as UniProt, BioThesaurus, UMLS, KEGG and DrugBank. AVAILABILITY: The system is available at http://www.nactem.ac.uk/software/facta/  相似文献   

2.
Perez-Iratxeta C  Keer HS  Bork P  Andrade MA 《BioTechniques》2002,32(6):1380-2, 1384-5
The increase of information in biology makes it difficult for researchers in any field to keep current with the literature. The MEDLINE database of scientific abstracts can be quickly scanned using electronic mechanisms. Potentially interesting abstracts can be selected by matching words joined by Boolean operators. However this means of selecting documents is not optimal. Nonspecific queries have to be effected, resulting in large numbers of irrelevant abstracts that have to be manually scanned To facilitate this analysis, we have developed a system that compiles a summary of subjects and related documents on the results of a MEDLINE query. For this, we have applied a fuzzy binary relation formalism that deduces relations between words present in a set of abstracts preprocessed with a standard grammatical tagger. Those relations are used to derive ensembles of related words and their associated subsets of abstracts. The algorithm can be used publicly at http:// www.bork.embl-heidelberg.de/xplormed/.  相似文献   

3.
Researchers, hindered by a lack of standard gene and protein-naming conventions, endure long, sometimes fruitless, literature searches. A system that is able to automatically assign gene names to their LocusLink ID (LLID) in previously unseen MEDLINE abstracts is described. The system is based on supervised learning and builds a model for each LLID. The training sets for all LLIDs are extracted automatically from MEDLINE references in the LocusLink and SwissProt databases. A validation was done of the performance for all 20,546 human genes with LLIDs. Of these, 7344 produced good quality models (F-measure >0.7, nearly 60% of which were >0.9) and 13,202 did not, mainly due to insufficient numbers of known document references. A hand validation of MEDLINE documents for a set of 66 genes agreed well with the system's internal accuracy assessment. It is concluded that it is possible to achieve high quality gene disambiguation using scaleable automated techniques.  相似文献   

4.
MOTIVATION: The MEDLINE database of biomedical abstracts contains scientific knowledge about thousands of interacting genes and proteins. Automated text processing can aid in the comprehension and synthesis of this valuable information. The fundamental task of identifying gene and protein names is a necessary first step towards making full use of the information encoded in biomedical text. This remains a challenging task due to the irregularities and ambiguities in gene and protein nomenclature. We propose to approach the detection of gene and protein names in scientific abstracts as part-of-speech tagging, the most basic form of linguistic corpus annotation. RESULTS: We present a method for tagging gene and protein names in biomedical text using a combination of statistical and knowledge-based strategies. This method incorporates automatically generated rules from a transformation-based part-of-speech tagger, and manually generated rules from morphological clues, low frequency trigrams, indicator terms, suffixes and part-of-speech information. Results of an experiment on a test corpus of 56K MEDLINE documents demonstrate that our method to extract gene and protein names can be applied to large sets of MEDLINE abstracts, without the need for special conditions or human experts to predetermine relevant subsets. AVAILABILITY: The programs are available on request from the authors.  相似文献   

5.
R Lawson 《BioTechniques》1990,8(6):680-683
PaperChase is a computer program which provides an efficient interface to the National Library of Medicine's MEDLINE database of references to the biomedical literature. The database includes references (citations) and abstracts compiled from Index Medicus, the International Nursing Index and the Index to Dental Literature. PaperChase may be accessed using any computer terminal or personal computer with modem. No special knowledge of computers or biomedical terms is necessary. Simple menus enable the novice to search the biomedical literature without training. A command language speeds searching for the experienced user. PaperChase does not require the user to know the database's indexing terminology, called Medical Subject Headings. Everyday language may be used and PaperChase will translate, or "map", the user's search term into the required Medical Subject Heading. PaperChase monitors a search in progress and suggests additional Medical Subject Headings which can be used to broaden or narrow a search. The searcher can order a full-text photocopy of any reference found in PaperChase. Support documentation and a subscriber newsletter are provided at no charge. Trained search specialists are available to offer assistance and to answer questions.  相似文献   

6.
7.
As the pace of life science discovery increases, so do the demands on researchers. To remain competitive in the life science industry, researchers must use every tool at their disposal to keep up with new products, protocols, news, and literature in their field. While there are now myriad Web sites that assist researchers with this problem, many suffer from confusing user interfaces, poorly designed search engines, and a narrow information focus. Here, we present LabVelocity, a user-friendly Web site that provides a free multidisciplinary information-gathering service for the life science research community. Using LabVelocity, a researcher can quickly find the products, protocols, technical references, news, MEDLINE abstracts, and interactive software tools necessary for an experiment. This aggregation of information can streamline experimental planning and is especially useful when researchers want to set up a new laboratory or to venture outside their field of expertise.  相似文献   

8.
A web-based version of the RLIMS-P literature mining system was developed for online mining of protein phosphorylation information from MEDLINE abstracts. The online tool presents extracted phosphorylation objects (phosphorylated proteins, phosphorylation sites and protein kinases) in summary tables and full reports with evidence-tagged abstracts. The tool further allows mapping of phosphorylated proteins to protein entries in the UniProt Knowledgebase based on PubMed ID and/or protein name. The literature mining, coupled with database association, allows retrieval of rich biological information for the phosphorylated proteins and facilitates database annotation of phosphorylation features.  相似文献   

9.
ADAM: another database of abbreviations in MEDLINE   总被引:1,自引:0,他引:1  
MOTIVATION: Abbreviations are an important type of terminology in the biomedical domain. Although several groups have already created databases of biomedical abbreviations, these are either not public, or are not comprehensive, or focus exclusively on acronym-type abbreviations. We have created another abbreviation database, ADAM, which covers commonly used abbreviations and their definitions (or long-forms) within MEDLINE titles and abstracts, including both acronym and non-acronym abbreviations. RESULTS: A model of recognizing abbreviations and their long-forms from titles and abstracts of MEDLINE (2006 baseline) was employed. After grouping morphological variants, 59 405 abbreviation/long-form pairs were identified. ADAM shows high precision (97.4%) and includes most of the frequently used abbreviations contained in the Unified Medical Language System (UMLS) Lexicon and the Stanford Abbreviation Database. Conversely, one-third of abbreviations in ADAM are novel insofar as they are not included in either database. About 19% of the novel abbreviations are non-acronym-type and these cover at least seven different types of short-form/long-form pairs. AVAILABILITY: A free, public query interface to ADAM is available at http://arrowsmith.psych.uic.edu, and the entire database can be downloaded as a text file.  相似文献   

10.
The recognition and normalization of gene mentions in biomedical literature are crucial steps in biomedical text mining. We present a system for extracting gene names from biomedical literature and normalizing them to gene identifiers in databases. The system consists of four major components: gene name recognition, entity mapping, disambiguation and filtering. The first component is a gene name recognizer based on dictionary matching and semi-supervised learning, which utilizes the co-occurrence information of a large amount of unlabeled MEDLINE abstracts to enhance feature representation of gene named entities. In the stage of entity mapping, we combine the strategies of exact match and approximate match to establish linkage between gene names in the context and the EntrezGene database. For the gene names that map to more than one database identifiers, we develop a disambiguation method based on semantic similarity derived from the Gene Ontology and MEDLINE abstracts. To remove the noise produced in the previous steps, we design a filtering method based on the confidence scores in the dictionary used for NER. The system is able to adjust the trade-off between precision and recall based on the result of filtering. It achieves an F-measure of 83% (precision: 82.5% recall: 83.5%) on BioCreative II Gene Normalization (GN) dataset, which is comparable to the current state-of-the-art.  相似文献   

11.
MOTIVATION: Natural language processing (NLP) methods are regarded as being useful to raise the potential of text mining from biological literature. The lack of an extensively annotated corpus of this literature, however, causes a major bottleneck for applying NLP techniques. GENIA corpus is being developed to provide reference materials to let NLP techniques work for bio-textmining. RESULTS: GENIA corpus version 3.0 consisting of 2000 MEDLINE abstracts has been released with more than 400,000 words and almost 100,000 annotations for biological terms.  相似文献   

12.
GenBank.   总被引:4,自引:1,他引:3       下载免费PDF全文
The GenBank sequence database incorporates DNA sequences from all available public sources, primarily through the direct submission of sequence data from authors and from large-scale sequencing projects. Data exchange with the EMBL Data Library and the DNA Data Bank of Japan helps ensure comprehensive coverage. GenBank continues to focus on quality control and annotation while expanding data coverage and retrieval services. An integrated retrieval system, known asEntrez, incorporates data from the major DNA and protein sequence databases, along with genome maps and protein structure information. MEDLINE abstracts from published articles describing the sequences are also included as an additional source of biological annotation. Sequence similarity searching is offered through the BLAST family of programs. All of NCBI's services are offered through the World Wide Web. In addition, there are specialized server/client versions as well as FTP and e-mail server access.  相似文献   

13.
14.
BACKGROUND: Most clinicians read only the abstract of papers in scientific journals. Therefore, it is very important that abstracts contain as much information as possible, to summarize the data succinctly. Our objectives were to evaluate the quality of information in abstracts reporting human fetal outcomes following drug exposure during pregnancy. METHODS: We developed quality criteria based on previous work, modifying them for use with pregnancy outcomes. Quality scores were calculated as present/absent for all of the equally weighted criteria, then expressed as percentages (present/[present + absent]). We examined a random sample of 100 abstracts obtained through searches of MEDLINE, EMBASE, and the Web of Science databases from 1990 to 2005. Average quality scores were compared across designs (cohort, case-control, meta-analysis, and mixed design) Using Kruskal-Wallis ANOVA and structured/unstructured formats using Student's t test. RESULTS: The overall average quality was 59.2% +/- 14% (median, 61.5%; range, 15.4-83.3%). Quality was not significantly different across designs (P = .16) or between structured and unstructured abstracts (P = .44). Quality scores increased over time (Rho = 0.23, P = .02). Most frequently absent were baseline risk (94%), drug dose (91%), nonsignificant P values (72%), confounders (69%), significant P values (57%), and risk difference (48%). CONCLUSIONS: Abstracts provide insufficient information, particularly baseline risk values, for readers to make evidence-based decisions regarding drug use during pregnancy. Efforts need to be made to improve the quality of abstracts and include critical information such as baseline risk.  相似文献   

15.
TXTGate: profiling gene groups with text-based information   总被引:1,自引:1,他引:0  
We implemented a framework called TXTGate that combines literature indices of selected public biological resources in a flexible text-mining system designed towards the analysis of groups of genes. By means of tailored vocabularies, term- as well as gene-centric views are offered on selected textual fields and MEDLINE abstracts used in LocusLink and the Saccharomyces Genome Database. Subclustering and links to external resources allow for in-depth analysis of the resulting term profiles.  相似文献   

16.
By exploring time-series data from MEDLINE abstracts, we observe that only a few genes have been quoted with increasing frequency during the past 25 years. This is probably the result of selective pressure by the scientific community. Over the years, this selection has produced an extreme power law distribution of the information available for individual genes. Interestingly, those genes that are successfully selected are not necessarily the most important genes to the cell. To stress the implication of this finding we show that there is no correlation between a gene's impact in the scientific literature and its centrality in protein-interaction networks.  相似文献   

17.

Background

We investigate the accuracy of different similarity approaches for clustering over two million biomedical documents. Clustering large sets of text documents is important for a variety of information needs and applications such as collection management and navigation, summary and analysis. The few comparisons of clustering results from different similarity approaches have focused on small literature sets and have given conflicting results. Our study was designed to seek a robust answer to the question of which similarity approach would generate the most coherent clusters of a biomedical literature set of over two million documents.

Methodology

We used a corpus of 2.15 million recent (2004-2008) records from MEDLINE, and generated nine different document-document similarity matrices from information extracted from their bibliographic records, including titles, abstracts and subject headings. The nine approaches were comprised of five different analytical techniques with two data sources. The five analytical techniques are cosine similarity using term frequency-inverse document frequency vectors (tf-idf cosine), latent semantic analysis (LSA), topic modeling, and two Poisson-based language models – BM25 and PMRA (PubMed Related Articles). The two data sources were a) MeSH subject headings, and b) words from titles and abstracts. Each similarity matrix was filtered to keep the top-n highest similarities per document and then clustered using a combination of graph layout and average-link clustering. Cluster results from the nine similarity approaches were compared using (1) within-cluster textual coherence based on the Jensen-Shannon divergence, and (2) two concentration measures based on grant-to-article linkages indexed in MEDLINE.

Conclusions

PubMed''s own related article approach (PMRA) generated the most coherent and most concentrated cluster solution of the nine text-based similarity approaches tested, followed closely by the BM25 approach using titles and abstracts. Approaches using only MeSH subject headings were not competitive with those based on titles and abstracts.  相似文献   

18.
GenBank.   总被引:5,自引:2,他引:3       下载免费PDF全文
The GenBank sequence database continues to expand its data coverage, quality control, annotation content and retrieval services. GenBank is comprised of DNA sequences submitted directly by authors as well as sequences from the other major public databases. An integrated retrieval system, known as Entrez, contains data from GenBank and from the major protein sequence and structural databases, as well as related MEDLINE abstracts. Users may access GenBank over the Internet through the World Wide Web and through special client-server programs for text and sequence similarity searching. FTP, CD-ROM and e-mail servers are alternate means of access.  相似文献   

19.
Entries in biological databases are usually linked to scientific references. To generate those links and to keep them up-to-date, database maintainers have to continuously scan the scientific literature to select references that are relevant for each single database entry. The continuous growth of both the corpus of scientific literature and the size of biological databases makes this task very hard. We present a protocol intended to assist the updating of an existing set of literature (abstract) links from a single database entry with new references. It consists of taking the set of MEDLINE neighbour references of the existing linked abstracts and evaluating their relevance according to the existing set of abstracts. To test the applicability of the algorithm, we did a simple benchmark of the system using the references associated with the entries of a protein domain database. Human experts found the references that the algorithm scored highly were more relevant to the database entry than those scored lowly, suggesting that the algorithm was useful.  相似文献   

20.
Building an abbreviation dictionary using a term recognition approach   总被引:1,自引:0,他引:1  
MOTIVATION: Acronyms result from a highly productive type of term variation and trigger the need for an acronym dictionary to establish associations between acronyms and their expanded forms. RESULTS: We propose a novel method for recognizing acronym definitions in a text collection. Assuming a word sequence co-occurring frequently with a parenthetical expression to be a potential expanded form, our method identifies acronym definitions in a similar manner to the statistical term recognition task. Applied to the whole MEDLINE (7 811 582 abstracts), the implemented system extracted 886 755 acronym candidates and recognized 300 954 expanded forms in reasonable time. Our method outperformed base-line systems, achieving 99% precision and 82-95% recall on our evaluation corpus that roughly emulates the whole MEDLINE. AVAILABILITY AND SUPPLEMENTARY INFORMATION: The implementations and supplementary information are available at our web site: http://www.chokkan.org/research/acromine/  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号