首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
A major challenge facing biodiversity informatics is integrating data stored in widely distributed databases. Initial efforts have relied on taxonomic names as the shared identifier linking records in different databases. However, taxonomic names have limitations as identifiers, being neither stable nor globally unique, and the pace of molecular taxonomic and phylogenetic research means that a lot of information in public sequence databases is not linked to formal taxonomic names. This review explores the use of other identifiers, such as specimen codes and GenBank accession numbers, to link otherwise disconnected facts in different databases. The structure of these links can also be exploited using the PageRank algorithm to rank the results of searches on biodiversity databases. The key to rich integration is a commitment to deploy and reuse globally unique, shared identifiers [such as Digital Object Identifiers (DOIs) and Life Science Identifiers (LSIDs)], and the implementation of services that link those identifiers.  相似文献   

2.

Background

The ability to query many independent biological databases using a common ontology-based semantic model would facilitate deeper integration and more effective utilization of these diverse and rapidly growing resources. Despite ongoing work moving toward shared data formats and linked identifiers, significant problems persist in semantic data integration in order to establish shared identity and shared meaning across heterogeneous biomedical data sources.

Results

We present five processes for semantic data integration that, when applied collectively, solve seven key problems. These processes include making explicit the differences between biomedical concepts and database records, aggregating sets of identifiers denoting the same biomedical concepts across data sources, and using declaratively represented forward-chaining rules to take information that is variably represented in source databases and integrating it into a consistent biomedical representation. We demonstrate these processes and solutions by presenting KaBOB (the Knowledge Base Of Biomedicine), a knowledge base of semantically integrated data from 18 prominent biomedical databases using common representations grounded in Open Biomedical Ontologies. An instance of KaBOB with data about humans and seven major model organisms can be built using on the order of 500 million RDF triples. All source code for building KaBOB is available under an open-source license.

Conclusions

KaBOB is an integrated knowledge base of biomedical data representationally based in prominent, actively maintained Open Biomedical Ontologies, thus enabling queries of the underlying data in terms of biomedical concepts (e.g., genes and gene products, interactions and processes) rather than features of source-specific data schemas or file formats. KaBOB resolves many of the issues that routinely plague biomedical researchers intending to work with data from multiple data sources and provides a platform for ongoing data integration and development and for formal reasoning over a wealth of integrated biomedical data.

Electronic supplementary material

The online version of this article (doi:10.1186/s12859-015-0559-3) contains supplementary material, which is available to authorized users.  相似文献   

3.
biomaRt is a new Bioconductor package that integrates BioMart data resources with data analysis software in Bioconductor. It can annotate a wide range of gene or gene product identifiers (e.g. Entrez-Gene and Affymetrix probe identifiers) with information such as gene symbol, chromosomal coordinates, Gene Ontology and OMIM annotation. Furthermore biomaRt enables retrieval of genomic sequences and single nucleotide polymorphism information, which can be used in data analysis. Fast and up-to-date data retrieval is possible as the package executes direct SQL queries to the BioMart databases (e.g. Ensembl). The biomaRt package provides a tight integration of large, public or locally installed BioMart databases with data analysis in Bioconductor creating a powerful environment for biological data mining.  相似文献   

4.
RT Schuh 《ZooKeys》2012,(209):255-267
Arguments are presented for the merit of integrating specimen databases into the practice of revisionary systematics. Work flows, data connections, data outputs, and data standardization are enumerated as critical aspects of such integration. Background information is provided on the use of "barcodes" as unique specimen identifiers and on methods for efficient data capture. Examples are provided on how to achieve efficient workflows and data standardization, as well as data outputs and data integration.  相似文献   

5.
The pilot phase of the HUPO Plasma Proteome Project (PPP) is an international collaboration to catalog the protein composition of human blood plasma and serum by analyzing standardized aliquots of reference serum and plasma specimens using a variety of experimental techniques. Data management for this project included collection, integration, analysis, and dissemination of findings from participating organizations world-wide. Accomplishing this task required a communication and coordination infrastructure specific enough to support meaningful integration of results from all participants, but flexible enough to react to changing requirements and new insights gained during the course of the project and to allow participants with varying informatics capabilities to contribute. Challenges included integrating heterogeneous data, reducing redundant information to minimal identification sets, and data annotation. Our data integration workflow assembles a minimal and representative set of protein identifications, which account for the contributed data. It accommodates incomplete concordance of results from different laboratories, ambiguity and redundancy in contributed identifications, and redundancy in the protein sequence databases. Recommendations of the PPP for future large-scale proteomics endeavors are described.  相似文献   

6.
Babnigg G  Giometti CS 《Proteomics》2006,6(16):4514-4522
In proteome studies, identification of proteins requires searching protein sequence databases. The public protein sequence databases (e.g., NCBInr, UniProt) each contain millions of entries, and private databases add thousands more. Although much of the sequence information in these databases is redundant, each database uses distinct identifiers for the identical protein sequence and often contains unique annotation information. Users of one database obtain a database-specific sequence identifier that is often difficult to reconcile with the identifiers from a different database. When multiple databases are used for searches or the databases being searched are updated frequently, interpreting the protein identifications and associated annotations can be problematic. We have developed a database of unique protein sequence identifiers called Sequence Globally Unique Identifiers (SEGUID) derived from primary protein sequences. These identifiers serve as a common link between multiple sequence databases and are resilient to annotation changes in either public or private databases throughout the lifetime of a given protein sequence. The SEGUID Database can be downloaded (http://bioinformatics.anl.gov/SEGUID/) or easily generated at any site with access to primary protein sequence databases. Since SEGUIDs are stable, predictions based on the primary sequence information (e.g., pI, Mr) can be calculated just once; we have generated approximately 500 different calculations for more than 2.5 million sequences. SEGUIDs are used to integrate MS and 2-DE data with bioinformatics information and provide the opportunity to search multiple protein sequence databases, thereby providing a higher probability of finding the most valid protein identifications.  相似文献   

7.
8.
euGenes is a genome information system and database that provides a common summary of eukaryote genes and genomes, at http://iubio.bio.indiana.edu/eugenes/. Seven popular genomes are included: human, mouse, fruitfly, Caenorhabditis elegans worm, Saccharomyces yeast, Arabidopsis mustard weed and zebrafish, with more planned. This information, automatically extracted and updated from several source databases, offers features not readily available through other genome databases to bioscientists looking for gene relationships across organisms. The database describes 150 000 known, predicted and orphan genes, using consistent gene names along with their homologies and associations with a standard vocabulary of molecular functions, cell locations and biological processes. Usable whole-genome maps including features, chromosome locations and molecular data integration are available, as are options to retrieve sequences from these genomes. Search and retrieval methods for these data are easy to use and efficient, allowing one to ask combined questions of sequence features, protein functions and other gene attributes, and fetch results in reports, computable tabular outputs or bulk database forms. These summarized data are useful for integration in other projects, such as gene expression databases. euGenes provides an extensible, flexible genome information system for many organisms.  相似文献   

9.
MOTIVATION: At present, mapping of sequence identifiers across databases is a daunting, time-consuming and computationally expensive process, usually achieved by sequence similarity searches with strict threshold values. SUMMARY: We present a rapid and efficient method to map sequence identifiers across databases. The method uses the MD5 checksum algorithm for message integrity to generate sequence fingerprints and uses these fingerprints as hash strings to map sequences across databases. The program, called MagicMatch, is able to cross-link any of the major sequence databases within a few seconds on a modest desktop computer.  相似文献   

10.
Our current biological knowledge is spread over many independent bioinformatics databases where many different types of gene and protein identifiers are used. The heterogeneous and redundant nature of these identifiers limits data analysis across different bioinformatics resources. It is an even more serious bottleneck of data analysis for larger datasets, such as gene lists derived from microarray and proteomic experiments. The DAVID Gene ID Conversion Tool (DICT), a web-based application, is able to convert user's input gene or gene product identifiers from one type to another in a more comprehensive and high-throughput manner with a uniquely enhanced ID-ID mapping database.  相似文献   

11.
The flood of data acquired from the increasing number of publicly available genomes has led to new demands for bioinformatics software. With the growing amount of information resulting from high throughput experiments new questions arise that often focus on the comparison of genes, genomes, and their expression profiles. Inferring new knowledge by combining different kinds of "post-genomics" data obviously necessitates the development of new approaches that allow the integration of variable data sources into a flexible framework. In this paper, we describe our concept for the integration of heterogeneous data into a platform for systems biology. We have implemented a Bioinformatics Resource for the Integration of heterogeneous Data from Genomic Explorations (BRIDGE) and illustrate the usability of our approach as a platform for systems biology for two sample applications.  相似文献   

12.
Data integration is key to functional and comparative genomics because integration allows diverse data types to be evaluated in new contexts. To achieve data integration in a scalable and sensible way, semantic standards are needed, both for naming things (standardized nomenclatures, use of key words) and also for knowledge representation. The Mouse Genome Informatics database and other model organism databases help to close the gap between information and understanding of biological processes because these resources enforce well-defined nomenclature and knowledge representation standards. Model organism databases have a critical role to play in ensuring that diverse kinds of data, especially genome-scale data sets and information, remain useful to the biological community in the long-term. The efforts of model organism database groups ensure not only that organism-specific data are integrated, curated and accessible but also that the information is structured in such a way that comparison of biological knowledge across model organisms is facilitated.  相似文献   

13.
14.
15.
Wang C  Marshall A  Zhang D  Wilson ZA 《Plant physiology》2012,158(4):1523-1533
Protein interactions are fundamental to the molecular processes occurring within an organism and can be utilized in network biology to help organize, simplify, and understand biological complexity. Currently, there are more than 10 publicly available Arabidopsis (Arabidopsis thaliana) protein interaction databases. However, there are limitations with these databases, including different types of interaction evidence, a lack of defined standards for protein identifiers, differing levels of information, and, critically, a lack of integration between them. In this paper, we present an interactive bioinformatics Web tool, ANAP (Arabidopsis Network Analysis Pipeline), which serves to effectively integrate the different data sets and maximize access to available data. ANAP has been developed for Arabidopsis protein interaction integration and network-based study to facilitate functional protein network analysis. ANAP integrates 11 Arabidopsis protein interaction databases, comprising 201,699 unique protein interaction pairs, 15,208 identifiers (including 11,931 The Arabidopsis Information Resource Arabidopsis Genome Initiative codes), 89 interaction detection methods, 73 species that interact with Arabidopsis, and 6,161 references. ANAP can be used as a knowledge base for constructing protein interaction networks based on user input and supports both direct and indirect interaction analysis. It has an intuitive graphical interface allowing easy network visualization and provides extensive detailed evidence for each interaction. In addition, ANAP displays the gene and protein annotation in the generated interactive network with links to The Arabidopsis Information Resource, the AtGenExpress Visualization Tool, the Arabidopsis 1,001 Genomes GBrowse, the Protein Knowledgebase, the Kyoto Encyclopedia of Genes and Genomes, and the Ensembl Genome Browser to significantly aid functional network analysis. The tool is available open access at http://gmdd.shgmo.org/Computational-Biology/ANAP.  相似文献   

16.
Biodiversity data is being digitized and made available online at a rapidly increasing rate but current practices typically do not preserve linkages between these data, which impedes interoperation, provenance tracking, and assembly of larger datasets. For data associated with biocollections, the biodiversity community has long recognized that an essential part of establishing and preserving linkages is to apply globally unique identifiers at the point when data are generated in the field and to persist these identifiers downstream, but this is seldom implemented in practice. There has neither been coalescence towards one single identifier solution (as in some other domains), nor even a set of recommended best practices and standards to support multiple identifier schemes sharing consistent responses. In order to further progress towards a broader community consensus, a group of biocollections and informatics experts assembled in Stockholm in October 2014 to discuss community next steps to overcome current roadblocks. The workshop participants divided into four groups focusing on: identifier practice in current field biocollections; identifier application for legacy biocollections; identifiers as applied to biodiversity data records as they are published and made available in semantically marked-up publications; and cross-cutting identifier solutions that bridge across these domains. The main outcome was consensus on key issues, including recognition of differences between legacy and new biocollections processes, the need for identifier metadata profiles that can report information on identifier persistence missions, and the unambiguous indication of the type of object associated with the identifier. Current identifier characteristics are also summarized, and an overview of available schemes and practices is provided.  相似文献   

17.
Menda N  Buels RM  Tecle I  Mueller LA 《Plant physiology》2008,147(4):1788-1799
The amount of biological data available in the public domain is growing exponentially, and there is an increasing need for infrastructural and human resources to organize, store, and present the data in a proper context. Model organism databases (MODs) invest great efforts to functionally annotate genomes and phenomes by in-house curators. The SOL Genomics Network (SGN; http://www.sgn.cornell.edu) is a clade-oriented database (COD), which provides a more scalable and comparative framework for biological information. SGN has recently spearheaded a new approach by developing community annotation tools to expand its curational capacity. These tools effectively allow some curation to be delegated to qualified researchers, while, at the same time, preserving the in-house curators' full editorial control. Here we describe the background, features, implementation, results, and development road map of SGN's community annotation tools for curating genotypes and phenotypes. Since the inception of this project in late 2006, interest and participation from the Solanaceae research community has been strong and growing continuously to the extent that we plan to expand the framework to accommodate more plant taxa. All data, tools, and code developed at SGN are freely available to download and adapt.  相似文献   

18.
With several plant genomes sequenced, the power of comparative genome analysis can now be applied. However, genome-scale cross-species analyses are limited by the effort for data integration. To develop an integrated cross-species plant genome resource, we maintain comprehensive databases for model plant genomes, including Arabidopsis (Arabidopsis thaliana), maize (Zea mays), Medicago truncatula, and rice (Oryza sativa). Integration of data and resources is emphasized, both in house as well as with external partners and databases. Manual curation and state-of-the-art bioinformatic analysis are combined to achieve quality data. Easy access to the data is provided through Web interfaces and visualization tools, bulk downloads, and Web services for application-level access. This allows a consistent view of the model plant genomes for comparative and evolutionary studies, the transfer of knowledge between species, and the integration with functional genomics data.  相似文献   

19.
CRITICA: coding region identification tool invoking comparative analysis.   总被引:34,自引:0,他引:34  
Gene recognition is essential to understanding existing and future DNA sequence data. CRITICA (Coding Region Identification Tool Invoking Comparative Analysis) is a suite of programs for identifying likely protein-coding sequences in DNA by combining comparative analysis of DNA sequences with more common noncomparative methods. In the comparative component of the analysis, regions of DNA are aligned with related sequences from the DNA databases; if the translation of the aligned sequences has greater amino acid identity than expected for the observed percentage nucleotide identity, this is interpreted as evidence for coding. CRITICA also incorporates noncomparative information derived from the relative frequencies of hexanucleotides in coding frames versus other contexts (i.e., dicodon bias). The dicodon usage information is derived by iterative analysis of the data, such that CRITICA is not dependent on the existence or accuracy of coding sequence annotations in the databases. This independence makes the method particularly well suited for the analysis of novel genomes. CRITICA was tested by analyzing the available Salmonella typhimurium DNA sequences. Its predictions were compared with the DNA sequence annotations and with the predictions of GenMark. CRITICA proved to be more accurate than GenMark, and moreover, many of its predictions that would seem to be errors instead reflect problems in the sequence databases. The source code of CRITICA is freely available by anonymous FTP (rdp.life.uiuc.edu in/pub/critica) and on the World Wide Web (http:/(/)rdpwww.life.uiuc.edu).  相似文献   

20.

Background  

Transposable elements (TEs) are mobile sequences found in nearly all eukaryotic genomes. They have the ability to move and replicate within a genome, often influencing genome evolution and gene expression. The identification of TEs is an important part of every genome project. The number of sequenced genomes is rapidly rising, and the need to identify TEs within them is also growing. The ability to do this automatically and effectively in a manner similar to the methods used for genes is of increasing importance. There exist many difficulties in identifying TEs, including their tendency to degrade over time and that many do not adhere to a conserved structure. In this work, we describe a homology-based approach for the automatic identification of high-quality consensus TEs, aimed for use in the analysis of newly sequenced genomes.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号