首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In ecological sciences, the role of metadata (i.e. key information about a dataset) to make existing datasets visible and discoverable has become increasingly important. Within the EU-funded WISER project (Water bodies in Europe: Integrative Systems to assess Ecological status and Recovery), we designed a metadatabase to allow scientists to find the optimal data for their analyses. An online questionnaire helped to collect metadata from the data providers and an online query tool (http://www.wiser.eu/results/meta-database/) facilitated data evaluation. The WISER metadatabase currently holds information on 114 datasets (22 river, 71 lake, 1 general freshwater and 20 coastal/transitional datasets), which also can be accessed by external scientists. We evaluate if generally used metadata standards (e.g. Darwin Core, ISO 19115, CSDGM, EML) are suitable for such specific purposes as WISER and suggest at least the linkage with standard metadata fields. Furthermore, we discuss whether the simple metadata documentation is enough for others to reuse a dataset and why there is still reluctance to publish both metadata and primary research data (i.e. time and financial constraints, misuse of data, abandoning intellectual property rights). We emphasise that metadata publication has major advantages as it makes datasets detectable by other scientists and generally makes a scientist’s work more visible.  相似文献   

2.
Author‐level metrics are a widely used measure of scientific success. The h‐index and its variants measure publication output (number of publications) and research impact (number of citations). They are often used to influence decisions, such as allocating funding or jobs. Here, we argue that the emphasis on publication output and impact hinders scientific progress in the fields of ecology and evolution because it disincentivizes two fundamental practices: generating impactful (and therefore often long‐term) datasets and sharing data. We describe a new author‐level metric, the data‐index, which values both dataset output (number of datasets) and impact (number of data‐index citations), so promotes generating and sharing data as a result. We discuss how it could be implemented and provide user guidelines. The data‐index is designed to complement other metrics of scientific success, as scientific contributions are diverse and our value system should reflect that both for the benefit of scientific progress and to create a value system that is more equitable, diverse, and inclusive. Future work should focus on promoting other scientific contributions, such as communicating science, informing policy, mentoring other scientists, and providing open‐access code and tools.  相似文献   

3.
With growing computational capabilities of parallel machines, scientific simulations are being performed at finer spatial and temporal scales, leading to a data explosion. The growing sizes are making it extremely hard to store, manage, disseminate, analyze, and visualize these datasets, especially as neither the memory capacity of parallel machines, memory access speeds, nor disk bandwidths are increasing at the same rate as the computing power. Sampling can be an effective technique to address the above challenges, but it is extremely important to ensure that dataset characteristics are preserved, and the loss of accuracy is within acceptable levels. In this paper, we address the data explosion problems by developing a novel sampling approach, and implementing it in a flexible system that supports server-side sampling and data subsetting. We observe that to allow subsetting over scientific datasets, data repositories are likely to use an indexing technique. Among these techniques, we see that bitmap indexing can not only effectively support subsetting over scientific datasets, but can also help create samples that preserve both value and spatial distributions over scientific datasets. We have developed algorithms for using bitmap indices to sample datasets. We have also shown how only a small amount of additional metadata stored with bitvectors can help assess loss of accuracy with a particular subsampling level. Some of the other properties of this novel approach include: (1) sampling can be flexibly applied to a subset of the original dataset, which may be specified using a value-based and/or a dimension-based subsetting predicate, and (2) no data reorganization is needed, once bitmap indices have been generated. We have extensively evaluated our method with different types of datasets and applications, and demonstrated the effectiveness of our approach.  相似文献   

4.

Background

The 1980s marked the occasion when Geographical Information System (GIS) technology was broadly introduced into the geo-spatial community through the establishment of a strong GIS industry. This technology quickly disseminated across many countries, and has now become established as an important research, planning and commercial tool for a wider community that includes organisations in the public and private health sectors. The broad acceptance of GIS technology and the nature of its functionality have meant that numerous datasets have been created over the past three decades. Most of these datasets have been created independently, and without any structured documentation systems in place. However, search and retrieval systems can only work if there is a mechanism for datasets existence to be discovered and this is where proper metadata creation and management can greatly help. This situation must be addressed through support mechanisms such as Web-based portal technologies, metadata editor tools, automation, metadata standards and guidelines and collaborative efforts with relevant individuals and organisations. Engagement with data developers or administrators should also include a strategy of identifying the benefits associated with metadata creation and publication.

Findings

The establishment of numerous Spatial Data Infrastructures (SDIs), and other Internet resources, is a testament to the recognition of the importance of supporting good data management and sharing practices across the geographic information community. These resources extend to health informatics in support of research, public services and teaching and learning. This paper identifies many of these resources available to the UK academic health informatics community. It also reveals the reluctance of many spatial data creators across the wider UK academic community to use these resources to create and publish metadata, or deposit their data in repositories for sharing. The Go-Geo! service is introduced as an SDI developed to provide UK academia with the necessary resources to address the concerns surrounding metadata creation and data sharing. The Go-Geo! portal, Geodoc metadata editor tool, ShareGeo spatial data repository, and a range of other support resources, are described in detail.

Conclusions

This paper describes a variety of resources available for the health research and public health sector to use for managing and sharing their data. The Go-Geo! service is one resource which offers an SDI for the eclectic range of disciplines using GIS in UK academia, including health informatics. The benefits of data management and sharing are immense, and in these times of cost restraints, these resources can be seen as solutions to find cost savings which can be reinvested in more research.  相似文献   

5.
6.
BackgroundResearch in Bioinformatics generates tools and datasets in Bioinformatics at a very fast rate. Meanwhile, a lot of effort is going into making these resources findable and reusable to improve resource discovery by researchers in the course of their work.PurposeThis paper proposes a semi-automated tool to assess a resource according to the Findability, Accessibility, Interoperability and Reusability (FAIR) criteria. The aim is to create a portal that presents the assessment score together with a report that researchers can use to gauge a resource.MethodOur system uses internet searches to automate the process of generating FAIR scores. The process is semi-automated in that if a particular property of the FAIR scores has not been captured by AutoFAIR, a user is able to amend and supply the information to complete the assessment.ResultsWe compare our results against FAIRshake that was used as the benchmark tool for comparing the assessments. The results show that AutoFAIR was able to match the FAIR criteria in FAIRshake with minimal intervention from the user.ConclusionsWe show that AutoFAIR can be a good repository for storing metadata about tools and datasets, together with comprehensive reports detailing the assessments of the resources. Moreover, AutoFAIR is also able to score workflows, giving an overall indication of the FAIRness of the resources used in a scientific study.  相似文献   

7.
Data support knowledge development and theory advances in ecology and evolution. We are increasingly reusing data within our teams and projects and through the global, openly archived datasets of others. Metadata can be challenging to write and interpret, but it is always crucial for reuse. The value metadata cannot be overstated—even as a relatively independent research object because it describes the work that has been done in a structured format. We advance a new perspective and classify methods for metadata curation and development with tables. Tables with templates can be effectively used to capture all components of an experiment or project in a single, easy‐to‐read file familiar to most scientists. If coupled with the R programming language, metadata from tables can then be rapidly and reproducibly converted to publication formats including extensible markup language files suitable for data repositories. Tables can also be used to summarize existing metadata and store metadata across many datasets. A case study is provided and the added benefits of tables for metadata, a priori, are developed to ensure a more streamlined publishing process for many data repositories used in ecology, evolution, and the environmental sciences. In ecology and evolution, researchers are often highly tabular thinkers from experimental data collection in the lab and/or field, and representations of metadata as a table will provide novel research and reuse insights.  相似文献   

8.
In order for society to make effective policy decisions on complex and far-reaching subjects, such as appropriate responses to global climate change, scientists must effectively communicate complex results to the non-scientifically specialized public. However, there are few ways however to transform highly complicated scientific data into formats that are engaging to the general community. Taking inspiration from patterns observed in nature and from some of the principles of jazz bebop improvisation, we have generated Microbial Bebop, a method by which microbial environmental data are transformed into music. Microbial Bebop uses meter, pitch, duration, and harmony to highlight the relationships between multiple data types in complex biological datasets. We use a comprehensive microbial ecology, time course dataset collected at the L4 marine monitoring station in the Western English Channel as an example of microbial ecological data that can be transformed into music. Four compositions were generated (www.bio.anl.gov/MicrobialBebop.htm.) from L4 Station data using Microbial Bebop. Each composition, though deriving from the same dataset, is created to highlight different relationships between environmental conditions and microbial community structure. The approach presented here can be applied to a wide variety of complex biological datasets.  相似文献   

9.
CellDepot containing over 270 datasets from 8 species and many tissues serves as an integrated web application to empower scientists in exploring single-cell RNA-seq (scRNA-seq) datasets and comparing the datasets among various studies through a user-friendly interface with advanced visualization and analytical capabilities. To begin with, it provides an efficient data management system that users can upload single cell datasets and query the database by multiple attributes such as species and cell types. In addition, the graphical multi-logic, multi-condition query builder and convenient filtering tool backed by MySQL database system, allows users to quickly find the datasets of interest and compare the expression of gene(s) across these. Moreover, by embedding the cellxgene VIP tool, CellDepot enables fast exploration of individual dataset in the manner of interactivity and scalability to gain more refined insights such as cell composition, gene expression profiles, and differentially expressed genes among cell types by leveraging more than 20 frequently applied plotting functions and high-level analysis methods in single cell research. In summary, the web portal available at http://celldepot.bxgenomics.com, prompts large scale single cell data sharing, facilitates meta-analysis and visualization, and encourages scientists to contribute to the single-cell community in a tractable and collaborative way. Finally, CellDepot is released as open-source software under MIT license to motivate crowd contribution, broad adoption, and local deployment for private datasets.  相似文献   

10.
The ecoinformatics community recognizes that ecological synthesis across studies, space, and time will require new informatics tools and infrastructure. Recent advances have been encouraging, but many problems still face ecologists who manage their own datasets, prepare data for archiving, and search data stores for synthetic research. In this paper, we describe how work by the Canopy Database Project (CDP) might enable use of database technology by field ecologists: increasing the quality of database design, improving data validation, and providing structural and semantic metadata — all of which might improve the quality of data archives and thereby help drive ecological synthesis.The CDP has experimented with conceptual components for database design, templates, to address information technology issues facing ecologists. Templates represent forest structures and observational measurements on these structures. Using our software, researchers select templates to represent their study’s data and can generate normalized relational databases. Information hidden in those databases is used by ancillary tools, including data intake forms and simple data validation, data visualization, and metadata export. The primary question we address in this paper is, which templates are the right templates.We argue for defining simple templates (with relatively few attributes) that describe the domain's major entities, and for coupling those with focused and flexible observation templates. We present a conceptual model for the observation data type, and show how we have implemented the model as an observation entity in the DataBank database designer and generator. We show how our visualization tool CanopyView exploits metadata made explicit by DataBank to help scientists with analysis and synthesis. We conclude by presenting future plans for tools to conduct statistical calculations common to forest ecology and to enhance data mining with DataBank databases.DataBank could be extended to another domain by replacing our forest–ecology-specific templates with those for the new domain. This work extends the basic computer science idea of abstract data types and user-defined types to ecology-specific database design tools for individual users, and applies to ecoinformatics the software engineering innovations of domain-specific languages, software patterns, components, refactoring, and end-user programming.  相似文献   

11.
Among the many effects of climate change is its influence on the phenology of biota. In marine and coastal ecosystems, phenological shifts have been documented for multiple life forms; however, biological data related to marine species' phenology remain difficult to access and is under-used. We conducted an assessment of potential sources of biological data for marine species and their availability for use in phenological analyses and assessments. Our evaluations showed that data potentially related to understanding marine species' phenology are available through online resources of governmental, academic, and non-governmental organizations, but appropriate datasets are often difficult to discover and access, presenting opportunities for scientific infrastructure improvement. The developing Federal Marine Data Architecture when fully implemented will improve data flow and standardization for marine data within major federal repositories and provide an archival repository for collaborating academic and public data contributors. Another opportunity, largely untapped, is the engagement of citizen scientists in standardized collection of marine phenology data and contribution of these data to established data flows. Use of metadata with marine phenology related keywords could improve discovery and access to appropriate datasets. When data originators choose to self-publish, publication of research datasets with a digital object identifier, linked to metadata, will also improve subsequent discovery and access. Phenological changes in the marine environment will affect human economics, food systems, and recreation. No one source of data will be sufficient to understand these changes. The collective attention of marine data collectors is needed—whether with an agency, an educational institution, or a citizen scientist group—toward adopting the data management processes and standards needed to ensure availability of sufficient and useable marine data to understand marine phenology.  相似文献   

12.
Many communities use standard, structured documentation that is machine-readable, i.e. metadata, to make discovery, access, use, and understanding of scientific datasets possible. Organizations and communities have also developed recommendations for metadata content that is required or suggested for their data developers and users. These recommendations are typically specific to metadata representations (dialects) used by the community. By considering the conceptual content of the recommendations, quantitative analysis and comparison of the completeness of multiple metadata dialects becomes possible. This is a study of completeness of EML and CSDGM metadata records from DataONE in terms of the LTER recommendation for Completeness. The goal of the study is to quantitatively measure completeness of metadata records and to determine if metadata developed by LTER is more complete with respect to the recommendation than other collections in EML and in CSDGM. We conclude that the LTER records are broadly more complete than the other EML collections, but similar in completeness to the CSDGM collections.  相似文献   

13.
This paper describes a prototype grid infrastructure, called the “eMinerals minigrid”, for molecular simulation scientists. which is based on an integration of shared compute and data resources. We describe the key components, namely the use of Condor pools, Linux/Unix clusters with PBS and IBM's LoadLeveller job handling tools, the use of Globus for security handling, the use of Condor-G tools for wrapping globus job submit commands, Condor's DAGman tool for handling workflow, the Storage Resource Broker for handling data, and the CCLRC dataportal and associated tools for both archiving data with metadata and making data available to other workers.  相似文献   

14.
Previous studies have reported that some important loci are missed in single-locus genome-wide association studies (GWAS), especially because of the large phenotypic error in field experiments. To solve this issue, multi-locus GWAS methods have been recommended. However, only a few software packages for multi-locus GWAS are available. Therefore, we developed an R software named mrMLM v4.0.2. This software integrates mrMLM, FASTmrMLM, FASTmrEMMA, pLARmEB, pKWmEB, and ISIS EM-BLASSO methods developed by our lab. There are four components in mrMLM v4.0.2, including dataset input, parameter setting, software running, and result output. The fread function in data.table is used to quickly read datasets, especially big datasets, and the doParallel package is used to conduct parallel computation using multiple CPUs. In addition, the graphical user interface software mrMLM.GUI v4.0.2, built upon Shiny, is also available. To confirm the correctness of the aforementioned programs, all the methods in mrMLM v4.0.2 and three widely-used methods were used to analyze real and simulated datasets. The results confirm the superior performance of mrMLM v4.0.2 to other methods currently available. False positive rates are effectively controlled, albeit with a less stringent significance threshold. mrMLM v4.0.2 is publicly available at BioCode (https://bigd.big.ac.cn/biocode/tools/BT007077) or R (https://cran.r-project.org/web/packages/mrMLM.GUI/index.html) as an open-source software.  相似文献   

15.
Many important questions in biology are, fundamentally, comparative, and this extends to our analysis of a growing number of sequenced genomes. Existing genomic analysis tools are often organized around literal views of genomes as linear strings. Even when information is highly condensed, these views grow cumbersome as larger numbers of genomes are added. Data aggregation and summarization methods from the field of visual analytics can provide abstracted comparative views, suitable for sifting large multi-genome datasets to identify critical similarities and differences. We introduce a software system for visual analysis of comparative genomics data. The system automates the process of data integration, and provides the analysis platform to identify and explore features of interest within these large datasets. GenoSets borrows techniques from business intelligence and visual analytics to provide a rich interface of interactive visualizations supported by a multi-dimensional data warehouse. In GenoSets, visual analytic approaches are used to enable querying based on orthology, functional assignment, and taxonomic or user-defined groupings of genomes. GenoSets links this information together with coordinated, interactive visualizations for both detailed and high-level categorical analysis of summarized data. GenoSets has been designed to simplify the exploration of multiple genome datasets and to facilitate reasoning about genomic comparisons. Case examples are included showing the use of this system in the analysis of 12 Brucella genomes. GenoSets software and the case study dataset are freely available at http://genosets.uncc.edu. We demonstrate that the integration of genomic data using a coordinated multiple view approach can simplify the exploration of large comparative genomic data sets, and facilitate reasoning about comparisons and features of interest.  相似文献   

16.
Microarray technology has become one of the elementary tools for researchers to study the genome of organisms. As the complexity and heterogeneity of cancer is being increasingly appreciated through genomic analysis, cancerous classification is an emerging important trend. Significant directed random walk is proposed as one of the cancerous classification approach which have higher sensitivity of risk gene prediction and higher accuracy of cancer classification. In this paper, the methodology and material used for the experiment are presented. Tuning parameter selection method and weight as parameter are applied in proposed approach. Gene expression dataset is used as the input datasets while pathway dataset is used to build a directed graph, as reference datasets, to complete the bias process in random walk approach. In addition, we demonstrate that our approach can improve sensitive predictions with higher accuracy and biological meaningful classification result. Comparison result takes place between significant directed random walk and directed random walk to show the improvement in term of sensitivity of prediction and accuracy of cancer classification.  相似文献   

17.

Purpose

Life cycle assessment (LCA) in Quebec (Canada) is increasingly important. Yet, studies often still need to rely on foreign life cycle inventory (LCI) data. The Quebec government invested in the creation of a Quebec LCI database. The approach is to work as an ecoinvent “National Database Initiative” (NDI), whereby the Quebec database initiative uses and contributes to the ecoinvent database. The paper clarifies the relationship between ecoinvent and the Quebec NDI and provides details on prioritization and data collection.

Methods

The first steps were to select a partner database provider and to work out the modalities of the partnership. The main criterion for partner selection was database transparency, i.e., availability of unit process data (gate-to-gate), necessary for database adaptation. This and other criteria, such as free access to external reviewers, conservation of dataset copyright, seamless embedding of datasets, and overall database sophistication, pointed to ecoinvent. Once started, the NDI project proceeded as follows: (1) data collection was prioritized based on several criteria; (2) some datasets were “recontextualized,” i.e., existing datasets were duplicated and relocated in Quebec and linked to datasets representing regional suppliers, where relevant; (3) new datasets were created; and (4) Canadian environmentally extended supply-use tables were created for the ecoinvent IO repository.

Results and discussion

Prioritization identified 500 candidate datasets for recontextualization, based on the relative importance of relative contribution of direct electricity consumption to cradle-to-gate impacts, and 12 key sectors from which about 450 data adaptation or collection projects were singled out. Data collection and private sector solicitation are underway. Private sector participation is highly variable. A number of communication tools have been elaborated and a solicitation team formed to palliate this obstacle. The new ecoinvent database protocol (Weidema et al. 2011) increases the amount of information that is required to create a dataset, which can lengthen or, in extreme cases, impede dataset creation. However, this new information is required for the new database functionalities (e.g., providing multiple system models based on the same unit process data and regionalized LCA).

Conclusions

Being an NDI is advantageous for the Quebec LCI database project on multiple levels. By conserving dataset copyright, the NDI remains free to spawn or support other LCI databases. Embedding datasets in ecoinvent enables the generation of LCI results from “day 1.” The costs of IT infrastructure and data review are null. For these reasons, and because every NDI improves the global representativity of ecoinvent, we recommend other regional or national database projects work as NDIs.
  相似文献   

18.
Guo J  Wu X  Zhang DY  Lin K 《Nucleic acids research》2008,36(6):2002-2011
High-throughput studies of protein interactions may have produced, experimentally and computationally, the most comprehensive protein–protein interaction datasets in the completely sequenced genomes. It provides us an opportunity on a proteome scale, to discover the underlying protein interaction patterns. Here, we propose an approach to discovering motif pairs at interaction sites (often 38 residues) that are essential for understanding protein functions and helpful for the rational design of protein engineering and folding experiments. A gold standard positive (interacting) dataset and a gold standard negative (non-interacting) dataset were mined to infer the interacting motif pairs that are significantly overrepresented in the positive dataset compared to the negative dataset. Four negative datasets assembled by different strategies were evaluated and the one with the best performance was used as the gold standard negatives for further analysis. Meanwhile, to assess the efficiency of our method in detecting potential interacting motif pairs, other approaches developed previously were compared, and we found that our method achieved the highest prediction accuracy. In addition, many uncharacterized motif pairs of interest were found to be functional with experimental evidence in other species. This investigation demonstrates the important effects of a high-quality negative dataset on the performance of such statistical inference.  相似文献   

19.
Tangherlini  M.  Miralto  M.  Colantuono  C.  Sangiovanni  M.  Dell&#; Anno  A.  Corinaldesi  C.  Danovaro  R.  Chiusano  M. L. 《BMC bioinformatics》2018,19(15):443-143

Background

Environmental metagenomics is a challenging approach that is exponentially spreading in the scientific community to investigate taxonomic diversity and possible functions of the biological components. The massive amount of sequence data produced, often endowed with rich environmental metadata, needs suitable computational tools to fully explore the embedded information. Bioinformatics plays a key role in providing methodologies to manage, process and mine molecular data, integrated with environmental metagenomics collections. One such relevant example is represented by the Tara Ocean Project.

Results

We considered the Tara 16S miTAGs released by the consortium, representing raw sequences from a shotgun metagenomics approach with similarities to 16S rRNA genes. We generated assembled 16S rDNA sequences, which were classified according to their lengths, the possible presence of chimeric reads, the putative taxonomic affiliation. The dataset was included in GLOSSary (the GLobal Ocean 16S Subunit web accessible resource), a bioinformatics platform to organize environmental metagenomics data. The aims of this work were: i) to present alternative computational approaches to manage challenging metagenomics data; ii) to set up user friendly web-based platforms to allow the integration of environmental metagenomics sequences and of the associated metadata; iii) to implement an appropriate bioinformatics platform supporting the analysis of 16S rDNA sequences exploiting reference datasets, such as the SILVA database. We organized the data in a next-generation NoSQL “schema-less” database, allowing flexible organization of large amounts of data and supporting native geospatial queries. A web interface was developed to permit an interactive exploration and a visual geographical localization of the data, either raw miTAG reads or 16S contigs, from our processing pipeline. Information on unassembled sequences is also available. The taxonomic affiliations of contigs and miTAGs, and the spatial distribution of the sampling sites and their associated sequence libraries, as they are contained in the Tara metadata, can be explored by a query interface, which allows both textual and visual investigations. In addition, all the sequence data were made available for a dedicated BLAST-based web application alongside the SILVA collection.

Conclusions

GLOSSary provides an expandable bioinformatics environment, able to support the scientific community in current and forthcoming environmental metagenomics analyses.
  相似文献   

20.
Although a vast amount of life sciences data is generated in the form of images, most scientists still store images on extremely diverse and often incompatible storage media, without any type of metadata structure, and thus with no standard facility with which to conduct searches or analyses. Here we present a solution to unlock the value of scientific images. The Global Image Database (GID) is a web-based (http://www.gwer.ch/qv/gid/gid.ht m ) structured central repository for scientific annotated images. The GID was designed to manage images from a wide spectrum of imaging domains ranging from microscopy to automated screening. The annotations in the GID define the source experiment of the images by describing who the authors of the experiment are, when the images were created, the biological origin of the experimental sample and how the sample was processed for visualization. A collection of experimental imaging protocols provides details of the sample preparation, and labeling, or visualization procedures. In addition, the entries in the GID reference these imaging protocols with the probe sequences or antibody names used in labeling experiments. The GID annotations are searchable by field or globally. The query results are first shown as image thumbnail previews, enabling quick browsing prior to original-sized annotated image retrieval. The development of the GID continues, aiming at facilitating the management and exchange of image data in the scientific community, and at creating new query tools for mining image data.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号