首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
This study provides an experimental performance evaluation on population-based queries of NoSQL databases storing archetype-based Electronic Health Record (EHR) data. There are few published studies regarding the performance of persistence mechanisms for systems that use multilevel modelling approaches, especially when the focus is on population-based queries. A healthcare dataset with 4.2 million records stored in a relational database (MySQL) was used to generate XML and JSON documents based on the openEHR reference model. Six datasets with different sizes were created from these documents and imported into three single machine XML databases (BaseX, eXistdb and Berkeley DB XML) and into a distributed NoSQL database system based on the MapReduce approach, Couchbase, deployed in different cluster configurations of 1, 2, 4, 8 and 12 machines. Population-based queries were submitted to those databases and to the original relational database. Database size and query response times are presented. The XML databases were considerably slower and required much more space than Couchbase. Overall, Couchbase had better response times than MySQL, especially for larger datasets. However, Couchbase requires indexing for each differently formulated query and the indexing time increases with the size of the datasets. The performances of the clusters with 2, 4, 8 and 12 nodes were not better than the single node cluster in relation to the query response time, but the indexing time was reduced proportionally to the number of nodes. The tested XML databases had acceptable performance for openEHR-based data in some querying use cases and small datasets, but were generally much slower than Couchbase. Couchbase also outperformed the response times of the relational database, but required more disk space and had a much longer indexing time. Systems like Couchbase are thus interesting research targets for scalable storage and querying of archetype-based EHR data when population-based use cases are of interest.  相似文献   

2.
MOTIVATION: In the post-genomic era, biologists interested in systems biology often need to import data from public databases and construct their own system-specific or subject-oriented databases to support their complex analysis and knowledge discovery. To facilitate the analysis and data processing, customized and centralized databases are often created by extracting and integrating heterogeneous data retrieved from public databases. A generalized methodology for accessing, extracting, transforming and integrating the heterogeneous data is needed. RESULTS: This paper presents a new data integration approach named JXP4BIGI (Java XML Page for Biological Information Gathering and Integration). The approach provides a system-independent framework, which generalizes and streamlines the steps of accessing, extracting, transforming and integrating the data retrieved from heterogeneous data sources to build a customized data warehouse. It allows the data integrator of a biological database to define the desired bio-entities in XML templates (or Java XML pages), and use embedded extended SQL statements to extract structured, semi-structured and unstructured data from public databases. By running the templates in the JXP4BIGI framework and using a number of generalized wrappers, the required data from public databases can be efficiently extracted and integrated to construct the bio-entities in the XML format without having to hard-code the extraction logics for different data sources. The constructed XML bio-entities can then be imported into either a relational database system or a native XML database system to build a biological data warehouse. AVAILABILITY: JXP4BIGI has been integrated and tested in conjunction with the IKBAR system (http://www.ikbar.org/) in two integration efforts to collect and integrate data for about 200 human genes related to cell death from HUGO, Ensembl, and SWISS-PROT (Bairoch and Apweiler, 2000), and about 700 Drosophila genes from FlyBase (FlyBase Consortium, 2002). The integrated data has been used in comparative genomic analysis of x-ray induced cell death. Also, as explained later, JXP4BIGI is a middleware and framework to be integrated with biological database applications, and cannot run as a stand-alone software for end users. For demonstration purposes, a demonstration version is accessible at (http://www.ikbar.org/jxp4bigi/demo.html).  相似文献   

3.
XML, bioinformatics and data integration   总被引:15,自引:0,他引:15  
Motivation: The eXtensible Markup Language (XML) is an emerging standard for structuring documents, notably for the World Wide Web. In this paper, the authors present XML and examine its use as a data language for bioinformatics. In particular, XML is compared to other languages, and some of the potential uses of XML in bioinformatics applications are presented. The authors propose to adopt XML for data interchange between databases and other sources of data. Finally the discussion is illustrated by a test case of a pedigree data model in XML. Contact: Emmanuel.Barillot@infobiogen.fr  相似文献   

4.
Protein identification using MS is an important technique in proteomics as well as a major generator of proteomics data. We have designed the protein identification data object model (PDOM) and developed a parser based on this model to facilitate the analysis and storage of these data. The parser works with HTML or XML files saved or exported from MASCOT MS/MS ions search in peptide summary report or MASCOT PMF search in protein summary report. The program creates PDOM objects, eliminates redundancy in the input file, and has the capability to output any PDOM object to a relational database. This program facilitates additional analysis of MASCOT search results and aids the storage of protein identification information. The implementation is extensible and can serve as a template to develop parsers for other search engines. The parser can be used as a stand-alone application or can be driven by other Java programs. It is currently being used as the front end for a system that loads HTML and XML result files of MASCOT searches into a relational database. The source code is freely available at http://www.ccbm.jhu.edu and the program uses only free and open-source Java libraries.  相似文献   

5.
Our team developed a metadata editing and management system employing state of the art XML technologies initially aimed at the environmental sciences but with the potential to be useful across multiple domains. We chose a modular and distributed design for scalability, flexibility, options for customizations, and the possibility to add more functionality at a later stage. The system consists of a desktop design tool that generates code for the actual online editor, a native XML database, and an online user access management application. A Java Swing application that reads an XML schema, the design tool provides the designer with options to combine input fields into online forms with user-friendly tags and determine the flow of input forms. Based on design decisions, the tool generates XForm code for the online metadata editor which is based on the Orbeon XForms engine. The design tool fulfills two requirements: First data entry forms based on a schema are customized at design time and second the tool can generate data entry applications for any valid XML schema without relying on custom information in the schema. A configuration file in the design tool saves custom information generated at design time. Future developments will add functionality to the design tool to integrate help text, tool tips, project specific keyword lists, and thesaurus services.Cascading style sheets customize the look-and-feel of the finished editor. The editor produces XML files in compliance with the original schema, however, a user may save the input into a native XML database at any time independent of validity. The system uses the open source XML database eXist for storage and uses a MySQL relational database and a simple Java Server Faces user interface for file and access management. We chose three levels to distribute administrative responsibilities and handle the common situation of an information manager entering the bulk of the metadata but leave specifics to the actual data provider.  相似文献   

6.
Dendrochronological data formats in general offer limited space for recording associated metadata. Such information is often recorded separately from the actual time series, and often only on paper. TRiDaBASE has been developed to improve metadata administration. It is a relational Microsoft Access database that allows users to register digital metadata according to TRiDaS, to generate TRiDaS XML for uploading to TRiDaS-based analytical systems and repositories, and to ingest TRiDaS XML created elsewhere for local querying and analyses.  相似文献   

7.
Whereas genomic data are universally machine-readable, data from imaging, multiplex biochemistry, flow cytometry and other cell- and tissue-based assays usually reside in loosely organized files of poorly documented provenance. This arises because the relational databases used in genomic research are difficult to adapt to rapidly evolving experimental designs, data formats and analytic algorithms. Here we describe an adaptive approach to managing experimental data based on semantically typed data hypercubes (SDCubes) that combine hierarchical data format 5 (HDF5) and extensible markup language (XML) file types. We demonstrate the application of SDCube-based storage using ImageRail, a software package for high-throughput microscopy. Experimental design and its day-to-day evolution, not rigid standards, determine how ImageRail data are organized in SDCubes. We applied ImageRail to collect and analyze drug dose-response landscapes in human cell lines at single-cell resolution.  相似文献   

8.
MOTIVATION: The National Cancer Institute's Center for Bioinformatics (NCICB) has developed a Java based data management and information system called caCORE. One component of this software suite is the object oriented API (caBIO) used to access the rich biological datasets collected at the NCI. This API can access the data using native Java classes, SOAP requests or HTTP calls. Non-Java based clients wanting to use this API have to use the SOAP or HTTP interfaces with the data being returned from the NCI servers as an XML data stream. Although the XML can be read and manipulated using DOM or SAX parsers, one loses the convenience and usability of an object oriented programming paradigm. caBIONet is a set of .NET wrapper classes (managers, genes, chromosomes, sequences, etc.) capable of serializing the XML data stream into local .NET objects. The software is able to search NCICB databases and provide local objects representing the data that can be manipulated and used by other .NET programs. The software was written in C# and compiled as a .NET DLL.  相似文献   

9.
10.
The creation of cell models from annotated genome information, as well as additional data from other databases, requires both a format and medium for its distribution. Standards are described for the representation of the data in the form of Document Type Definitions (DTDs) for XML files. Separate DTDs are detailed for genetic, metabolic and gene product-interaction networks, which can be used to hold information on individual subsystems, or which may be combined to create a whole cell DTD. In the execution of this work, a fifth DTD was also created for a metabolite thesaurus, which allows incorporation of metabolite synonyms and generic nomenclature data into the models. A gene-regulation classification scheme was also created, to facilitate incorporation of gene regulatory information in an efficient manner. The work is described with particular reference to the metabolic network of Escherichia coli, which contains 808 individual enzymes. The assignment of confidence levels to these data, through the use of Gene Ontology evidence codes, is highlighted. In silico investigations may now be performed using the mathematical simulation workbench, DBsolve, which incorporates the facility to introduce data directly from XML.  相似文献   

11.
MOTIVATION: The information model chosen to store biological data affects the types of queries possible, database performance, and difficulty in updating that information model. Genetic sequence data for pharmacogenetics studies can be complex, and the best information model to use may change over time. As experimental and analytical methods change, and as biological knowledge advances, the data storage requirements and types of queries needed may also change. RESULTS: We developed a model for genetic sequence and polymorphism data, and used XML Schema to specify the elements and attributes required for this model. We implemented this model as an ontology in a frame-based representation and as a relational model in a database system. We collected genetic data from two pharmacogenetics resequencing studies, and formulated queries useful for analysing these data. We compared the ontology and relational models in terms of query complexity, performance, and difficulty in changing the information model. Our results demonstrate benefits of evolving the schema for storing pharmacogenetics data: ontologies perform well in early design stages as the information model changes rapidly and simplify query formulation, while relational models offer improved query speed once the information model and types of queries needed stabilize.  相似文献   

12.
13.
An initiative to increase biopharmaceutical research productivity by capturing, sharing and computationally integrating proprietary scientific discoveries with public knowledge is described. This initiative involves both organisational process change and multiple interoperating software systems. The software components rely on mutually supporting integration techniques. These include a richly structured ontology, statistical analysis of experimental data against stored conclusions, natural language processing of public literature, secure document repositories with lightweight metadata, web services integration, enterprise web portals and relational databases. This approach has already begun to increase scientific productivity in our enterprise by creating an organisational memory (OM) of internal research findings, accessible on the web. Through bringing together these components it has also been possible to construct a very large and expanding repository of biological pathway information linked to this repository of findings which is extremely useful in analysis of DNA microarray data. This repository, in turn, enables our research paradigm to be shifted towards more comprehensive systems-based understandings of drug action.  相似文献   

14.
15.
MOTIVATION: As more genomic data becomes available there is increased attention on understanding the mechanisms encoded in the genome. New XML dialects like CellML and Systems Biology Markup Language (SBML) are being developed to describe biological networks of all types. In the absence of detailed kinetic information for these networks, stoichiometric data is an especially valuable source of information. Network databases are the next logical step beyond storing purely genomic information. Just as comparison of entries in genomic databases has been a vital algorithmic problem through the course of the sequencing project, comparison of networks in network databases will be a crucial problem as we seek to integrate higher-order network knowledge. RESULTS: We show that comparing the stoichiometric structure of two reactions systems is equivalent to the graph isomorphism problem. This is encouraging because graph isomorphism is, in practice, a tractable problem using heuristics. The analogous problem of searching for a subsystem of a reaction system is NP-complete. We also discuss heuristic issues in implementations for practical comparison of stoichiometric matrices.  相似文献   

16.
17.
MOTIVATION: Information about a particular protein or protein family is usually distributed among multiple databases and often in more than one entry in each database. Retrieval and organization of this information can be a laborious task. This task is complicated even further by the existence of alternative terms for the same concept. RESULTS: The PDB, SWISS-PROT, ENZYME, and CATH databases have been imported into a combined relational database, BIOMOLQUEST: A powerful search engine has been built using this database as a back end. The search engine achieves significant improvements in query performance by automatically utilizing cross-references between the legacy databases. The results of the queries are presented in an organized, hierarchical way.  相似文献   

18.
A relational database of transcription factors   总被引:29,自引:4,他引:25       下载免费PDF全文
  相似文献   

19.
We have created databases and software applications for the analysis of DNA mutations in the human p53 gene, the human hprt gene and the rodent transgenic lacZ locus. The databases themselves are stand-alone dBase files and the software for analysis of the databases runs on IBM- compatible computers. The software created for these databases permits filtering, ordering, report generation and display of information in the database. In addition, a significant number of routines have been developed for the analysis of single base substitutions. One method of obtaining the databases and software is via the World Wide Web (WWW). Open home page http://sunsite.unc.edu/dnam/mainpage.ht ml with a WWW browser. Alternatively, the databases and programs are available via public ftp from anonymous@sunsite.unc.edu. There is no password required to enter the system. The databases and software are found in subdirectory pub/academic/biology/dna-mutations. Two other programs are available at the WWW site, a program for comparison of mutational spectra and a program for entry of mutational data into a relational database.  相似文献   

20.
The Gene Expression Database (GXD) is a community resource that stores and integrates expression information for the laboratory mouse, with a particular emphasis on mouse development, and makes these data freely available in formats appropriate for comprehensive analysis. GXD is implemented as a relational database and integrated with the Mouse Genome Database (MGD) to enable global analysis of genotype, expression and phenotype information. Interconnections with sequence databases and with databases from other species further extend GXD's utility for the analysis of gene expression data. GXD is available through the Mouse Genome Informatics Web Site at http://www.informatics.jax.org/  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号