首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 125 毫秒
1.
Recent years have seen a huge increase in the amount of biomedical information that is available in electronic format. Consequently, for biomedical researchers wishing to relate their experimental results to relevant data lurking somewhere within this expanding universe of on-line information, the ability to access and navigate biomedical information sources in an efficient manner has become increasingly important. Natural language and text processing techniques can facilitate this task by making the information contained in textual resources such as MEDLINE more readily accessible and amenable to computational processing. Names of biological entities such as genes and proteins provide critical links between different biomedical information sources and researchers' experimental data. Therefore, automatic identification and classification of these terms in text is an essential capability of any natural language processing system aimed at managing the wealth of biomedical information that is available electronically. To support term recognition in the biomedical domain, we have developed Termino, a large-scale terminological resource for text processing applications, which has two main components: first, a database into which very large numbers of terms can be loaded from resources such as UMLS, and stored together with various kinds of relevant information; second, a finite state recognizer, for fast and efficient identification and mark-up of terms within text. Since many biomedical applications require this functionality, we have made Termino available to the community as a web service, which allows for its integration into larger applications as a remotely located component, accessed through a standardized interface over the web.  相似文献   

2.
SUMMARY: A number of freely available text mining tools have been put together to extract highly reliable Drosophila gene interaction data from text. The system has been tested with The Interactive Fly, showing low recall (27-34%), but very high precision (93-97%). AVAILABILITY: The extracted data and a web interface for submission of texts to GIFT analysis are available at http://gift.cryst.bbk.ac.uk/gift CONTACT: n.domedel_puig@cryst.bbk.ac.uk SUPPLEMENTARY INFORMATION: Additional documentation, such as the dictionaries and the reference sets, are available at the GIFT website.  相似文献   

3.
Rich information on point mutation studies is scattered across heterogeneous data sources. This paper presents an automated workflow for mining mutation annotations from full-text biomedical literature using natural language processing (NLP) techniques as well as for their subsequent reuse in protein structure annotation and visualization. This system, called mSTRAP (Mutation extraction and STRucture Annotation Pipeline), is designed for both information aggregation and subsequent brokerage of the mutation annotations. It facilitates the coordination of semantically related information from a series of text mining and sequence analysis steps into a formal OWL-DL ontology. The ontology is designed to support application-specific data management of sequence, structure, and literature annotations that are populated as instances of object and data type properties. mSTRAPviz is a subsystem that facilitates the brokerage of structure information and the associated mutations for visualization. For mutated sequences without any corresponding structure available in the Protein Data Bank (PDB), an automated pipeline for homology modeling is developed to generate the theoretical model. With mSTRAP, we demonstrate a workable system that can facilitate automation of the workflow for the retrieval, extraction, processing, and visualization of mutation annotations -- tasks which are well known to be tedious, time-consuming, complex, and error-prone. The ontology and visualization tool are available at (http://datam.i2r.a-star.edu.sg/mstrap).  相似文献   

4.
With the growing interest in the field of proteomics, the amount of publicly available proteome resources has also increased dramatically. This means that there are many useful resources available for almost all aspects of a proteomics experiment. However, it remains vital to use the right resource, for the right purpose, at the right time. This review is therefore meant to aid the reader in obtaining an overview of the available resources and their application, thus providing the necessary background to choose the appropriate resources for the experiment at hand. Many of the resources are also taking advantage of so-called crowdsourcing to maximize the potential of the resource. What this means and how this can improve future experiments will also be discussed. The text roughly follows the steps involved in a proteomics experiment, starting with the planning of the experiment, via the processing of the data and the analysis of the results, to the community-wide sharing of the produced data.  相似文献   

5.
SUMMARY: New additional methods are presented for processing and visualizing mass spectrometry based molecular profile data, implemented as part of the recently introduced MZmine software. They include new features and extensions such as support for mzXML data format, capability to perform batch processing for large number of files, support for parallel processing, new methods for calculating peak areas using post-alignment peak picking algorithm and implementation of Sammon's mapping and curvilinear distance analysis for data visualization and exploratory analysis. AVAILABILITY: MZmine is available under GNU Public license from http://mzmine.sourceforge.net/.  相似文献   

6.
With the establishment of high-throughput (HT) screening methods there is an increasing need for automatic analysis methods. Here we present RReportGenerator, a user-friendly portal for automatic routine analysis using the statistical platform R and Bioconductor. RReportGenerator is designed to analyze data using predefined analysis scenarios via a graphical user interface (GUI). A report in pdf format combining text, figures and tables is automatically generated and results may be exported. To demonstrate suitable analysis tasks we provide direct web access to a collection of analysis scenarios for summarizing data from transfected cell arrays (TCA), segmentation of CGH data, and microarray quality control and normalization. AVAILABILITY: RReportGenerator, a user manual and a collection of analysis scenarios are available under a GNU public license on http://www-bio3d-igbmc.u-strasbg.fr/~wraff  相似文献   

7.
SUMMARY: Data processing, analysis and visualization (datPAV) is an exploratory tool that allows experimentalist to quickly assess the general characteristics of the data. This platform-independent software is designed as a generic tool to process and visualize data matrices. This tool explores organization of the data, detect errors and support basic statistical analyses. Processed data can be reused whereby different step-by-step data processing/analysis workflows can be created to carry out detailed investigation. The visualization option provides publication-ready graphics. Applications of this tool are demonstrated at the web site for three cases of metabolomics, environmental and hydrodynamic data analysis. AVAILABILITY: datPAV is available free for academic use at http://www.sdwa.nus.edu.sg/datPAV/.  相似文献   

8.
ABSTRACT: BACKGROUND: We introduce the linguistic annotation of a corpus of 97 full-text biomedical publications, known as the Colorado Richly Annotated Full Text (CRAFT) corpus. We further assess the performance of existing tools for performing sentence splitting, tokenization, syntactic parsing, and named entity recognition on this corpus. RESULTS: Many biomedical natural language processing systems demonstrated large differences between their previously published results and their performance on the CRAFT corpus when tested with the publicly available models or rule sets. Trainable systems differed widely with respect to their ability to build high-performing models based on this data. CONCLUSIONS: The finding that some systems were able to train high-performing models based on this corpus is additional evidence, beyond high inter-annotator agreement, that the quality of the CRAFT corpus is high. The overall poor performance of various systems indicates that considerable work needs to be done to enable natural language processing systems to work well when the input is full-text journal articles. The CRAFT corpus provides avaluable resource to the biomedical natural language processing community for evaluation and training of new models for biomedical full text publications.  相似文献   

9.
DExH/D proteins are essential for all aspects of cellular RNA metabolism and processing, in the replication of many viruses and in DNA replication. DExH/D proteins are subject to current biological, biochemical and biophysical research which provides a continuous wealth of data. The DExH/D protein family database compiles this information and makes it available over the WWW (http://www.columbia.edu/ ej67/dbhome.htm ). The database can be fully searched by text based queries, facilitating fast access to specific information about this important class of enzymes.  相似文献   

10.
create is a Windows program for the creation of new and conversion of existing data input files for 52 genetic data analysis software programs. Programs are grouped into areas of sibship reconstruction, parentage assignment, genetic data analysis, and specialized applications. create is able to read in data from text, Microsoft Excel and Access sources and allows the user to specify columns containing individual and population identifiers, birth and death data, sex data, relationship information, and spatial location data. create's only constraints on source data are that one individual is contained in one row, and the genotypic data is contiguous. create is available for download at http://www.lsc.usgs.gov/CAFL/Ecology/Software.html.  相似文献   

11.
12.
13.
Text-mining systems are indispensable tools to reduce the increasing flux of information in scientific literature to topics pertinent to a particular interest in focus. Most of the scientific literature is published as unstructured free text, complicating the development of data processing tools, which rely on structured information. To overcome the problems of free text analysis, structured, hand-curated information derived from literature is integrated in text-mining systems to improve precision and recall. In this paper several text-mining approaches are reviewed and the next step in development of text-mining systems, which is based on a concept of multiple lines of evidence, is described: results from literature analysis are combined with evidence from experiments and genome analysis to improve the accuracy of results and to generate additional knowledge beyond what is known solely from literature.  相似文献   

14.
15.
The ever evolving Next Generation Sequencing technology is calling for new and innovative ways of data processing and visualization. Following a detailed survey of the current needs of researchers and service providers, the authors have developed GenoViewer: a highly user-friendly, easy-to-operate SAM/BAM viewer and aligner tool. GenoViewer enables fast and efficient NGS assembly browsing, analysis and read mapping. It is highly customized, making it suitable for a wide range of NGS related tasks. Due to its relatively simple architecture, it is easy to add specialised visualization functionalities, facilitating further customised data analysis. The software's source code is freely available; it is open for project and task-specific modifications. AVAILABILITY: The database is available for free at http://www.genoviewer.com/  相似文献   

16.
Tandem mass spectrometry-based proteomics experiments produce large amounts of raw data, and different database search engines are needed to reliably identify all the proteins from this data. Here, we present Compid, an easy-to-use software tool that can be used to integrate and compare protein identification results from two search engines, Mascot and Paragon. Additionally, Compid enables extraction of information from large Mascot result files that cannot be opened via the Web interface and calculation of general statistical information about peptide and protein identifications in a data set. To demonstrate the usefulness of this tool, we used Compid to compare Mascot and Paragon database search results for mitochondrial proteome sample of human keratinocytes. The reports generated by Compid can be exported and opened as Excel documents or as text files using configurable delimiters, allowing the analysis and further processing of Compid output with a multitude of programs. Compid is freely available and can be downloaded from http://users.utu.fi/lanatr/compid. It is released under an open source license (GPL), enabling modification of the source code. Its modular architecture allows for creation of supplementary software components e.g. to enable support for additional input formats and report categories.  相似文献   

17.
Schlamp K  Weinmann A  Krupp M  Maass T  Galle P  Teufel A 《Gene》2008,427(1-2):47-50
With the availability of high-throughput gene expression analysis, multiple public expression databases emerged, mostly based on microarray expression data. Although these databases are of significant biomedical value, they do hold significant drawbacks, especially concerning the reliability of single gene expression profiles obtained by microarray data. Simultaneously, reliable data on an individual gene's expression are often published as single northern blots in individual publications. These data were not yet available for high-throughput screening. To reduce the gap between high-throughput expression data and individual highly reliable expression data, we designed a novel database "BlotBase", a freely and easily accessible database, currently containing approximately 700 published northern blots of human or mouse origin (http://www.medicalgenomics.org/Databases/BlotBase). As the database is open for public data submission, we expect this database to quickly become a large expression profiling resource, eventually providing higher reliability in high-throughput gene expression analysis. Realizing BlotBase, Pubmed was searched manually and by computer based text mining methods to obtain publications containing northern blot results. Subsequently, northern blots were extracted and expression values of different tissues calculated utilizing Image J. All data were made available through a user friendly web front end. The data may be searched by either full text search or list of available northern blots of a specific tissue. Northern blot expression profiles were displayed by three expression states as well as a bar chart, allowing for automated evaluation. Furthermore, we integrated additional features, e.g. instant access to the corresponding RNA sequence or primer design tools making further expression analysis more convenient. Finally, through a semiautomatic submission system this database was opened to the bioinformatics community.  相似文献   

18.
MOTIVATION: The huge growth in gene expression data calls for the implementation of automatic tools for data processing and interpretation. RESULTS: We present a new and comprehensive machine learning data mining framework consisting in a non-linear PCA neural network for feature extraction, and probabilistic principal surfaces combined with an agglomerative approach based on Negentropy aimed at clustering gene microarray data. The method, which provides a user-friendly visualization interface, can work on noisy data with missing points and represents an automatic procedure to get, with no a priori assumptions, the number of clusters present in the data. Cell-cycle dataset and a detailed analysis confirm the biological nature of the most significant clusters. AVAILABILITY: The software described here is a subpackage part of the ASTRONEURAL package and is available upon request from the corresponding author. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.  相似文献   

19.
SUMMARY: Large volumes of microarray data are generated and deposited in public databases. Most of this data is in the form of tab-delimited text files or Excel spreadsheets. Combining data from several of these files to reanalyze these data sets is time consuming. Microarray Data Assembler is specifically designed to simplify this task. The program can list files and data sources, convert selected text files into Excel files and assemble data across multiple Excel worksheets and workbooks. This program thus makes data assembling easy, saves time and helps avoid manual error. AVAILABILITY: The program is freely available for non-profit use, via email request from the author, after signing a Material Transfer Agreement with Johns Hopkins University.  相似文献   

20.
Historical text archives constitute a rich and diverse source of information, which is becoming increasingly readily accessible, due to large-scale digitisation efforts. However, it can be difficult for researchers to explore and search such large volumes of data in an efficient manner. Text mining (TM) methods can help, through their ability to recognise various types of semantic information automatically, e.g., instances of concepts (places, medical conditions, drugs, etc.), synonyms/variant forms of concepts, and relationships holding between concepts (which drugs are used to treat which medical conditions, etc.). TM analysis allows search systems to incorporate functionality such as automatic suggestions of synonyms of user-entered query terms, exploration of different concepts mentioned within search results or isolation of documents in which concepts are related in specific ways. However, applying TM methods to historical text can be challenging, according to differences and evolutions in vocabulary, terminology, language structure and style, compared to more modern text. In this article, we present our efforts to overcome the various challenges faced in the semantic analysis of published historical medical text dating back to the mid 19th century. Firstly, we used evidence from diverse historical medical documents from different periods to develop new resources that provide accounts of the multiple, evolving ways in which concepts, their variants and relationships amongst them may be expressed. These resources were employed to support the development of a modular processing pipeline of TM tools for the robust detection of semantic information in historical medical documents with varying characteristics. We applied the pipeline to two large-scale medical document archives covering wide temporal ranges as the basis for the development of a publicly accessible semantically-oriented search system. The novel resources are available for research purposes, while the processing pipeline and its modules may be used and configured within the Argo TM platform.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号