首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The validity of material flow analyses (MFAs) depends on the available information base, that is, the quality and quantity of available data. MFA data are cross‐disciplinary, can have varying formats and qualities, and originate from heterogeneous sources, such as official statistics, scientific models, or expert estimations. Statistical methods for data evaluation are most often inadequate, because MFA data are typically isolated values rather than extensive data sets. In consideration of the properties of MFA data, a data characterization framework for MFA is presented. It consists of an MFA data terminology, a data characterization matrix, and a procedure for database analysis. The framework facilitates systematic data characterization by cell‐level tagging of data with data attributes. Data attributes represent data characteristics and metainformation regarding statistical properties, meaning, origination, and application of the data. The data characterization framework is illustrated in a case study of a national phosphorus budget. This work furthers understanding of the information basis of material flow systems, promotes the transparent documentation and precise communication of MFA input data, and can be the foundation for better data interpretation and comprehensive data quality evaluation.  相似文献   

2.
Data Quality     
A methodology is presented to develop and analyze vectors of data quality attribute scores. Each data quality vector component represents the quality of the data element for a specific attribute (e.g., age of data). Several methods for aggregating the components of data quality vectors to derive one data quality indicator (DQI) that represents the total quality associated with the input data element are presented with illustrative examples. The methods are compared and it is proven that the measure of central tendency, or arithmetic average, of the data quality vector components as a percentage of the total quality range attainable is an equivalent measure for the aggregate DQI. In addition, the methodology is applied and compared to realworld LCA data pedigree matrices. Finally, a method for aggregating weighted data quality vector attributes is developed and an illustrative example is presented. This methodology provides LCA practitioners with an approach to increase the precision of input data uncertainty assessments by selecting any number of data quality attributes with which to score the LCA inventory model input data. The resultant vector of data quality attributes can then be analyzed to develop one aggregate DQI for each input data element for use in stochastic LCA modeling.  相似文献   

3.
E Haston  R Cubey  M Pullan  H Atkins  DJ Harris 《ZooKeys》2012,(209):93-102
Digitisation programmes in many institutes frequently involve disparate and irregular funding, diverse selection criteria and scope, with different members of staff managing and operating the processes. These factors have influenced the decision at the Royal Botanic Garden Edinburgh to develop an integrated workflow for the digitisation of herbarium specimens which is modular and scalable to enable a single overall workflow to be used for all digitisation projects. This integrated workflow is comprised of three principal elements: a specimen workflow, a data workflow and an image workflow.The specimen workflow is strongly linked to curatorial processes which will impact on the prioritisation, selection and preparation of the specimens. The importance of including a conservation element within the digitisation workflow is highlighted. The data workflow includes the concept of three main categories of collection data: label data, curatorial data and supplementary data. It is shown that each category of data has its own properties which influence the timing of data capture within the workflow. Development of software has been carried out for the rapid capture of curatorial data, and optical character recognition (OCR) software is being used to increase the efficiency of capturing label data and supplementary data. The large number and size of the images has necessitated the inclusion of automated systems within the image workflow.  相似文献   

4.
GDPC: connecting researchers with multiple integrated data sources   总被引:1,自引:0,他引:1  
The goal of this project is to simplify access to genomic diversity and phenotype data, thereby encouraging reuse of this data. The Genomic Diversity and Phenotype Connection (GDPC) accomplishes this by retrieving data from one or more data sources and by allowing researchers to analyze integrated data in a standard format. GDPC is written in JAVA and provides (1) data sources available as web services that transfer XML formatted data via the SOAP protocol; (2) a JAVA API for programmatic access to data sources; and (3) a front-end application that allows users to manage data sources, retrieve data based on filters, sort/group data based on property values and save/open the data as XML files. AVAILABILITY: The source code, compiled code, documentation and GDPC Browser are freely available at: www.maizegenetics.net/gdpc/index.html the current release of GDPC is version 1.0, with updated releases planned for the future. Comments are welcome.  相似文献   

5.
医院药品的业务庞杂,数据集成与分析不完善,缺少信息全面、集成的数据仓库系统。研究医院药品数据,运用“业务维度生命周期法”进行数据仓库项目的设计、开发和部署。解决的问题有:创建数据仓库总线结构,建立主题模型,使用维度建模来进行逻辑建模,数据存储的物理设计,数据转储与开发。总体逻辑结构模型设计清晰,构建方法新颖,给出一个较好的医院药品数据仓库的分析模型。  相似文献   

6.
The Self-Organizing Map (SOM) is an efficient tool for visualizing high-dimensional data. In this paper, an intuitive and effective SOM projection method is proposed for mapping high-dimensional data onto the two-dimensional grid structure with a growing self-organizing mechanism. In the learning phase, a growing SOM is trained and the growing cell structure is used as the baseline framework. In the ordination phase, the new projection method is used to map the input vector so that the input data is mapped to the structure of the SOM without having to plot the weight values, resulting in easy visualization of the data. The projection method is demonstrated on four different data sets, including a 118 patent data set and a 399 checical abstract data set related to polymer cements, with promising results and a significantly reduced network size.  相似文献   

7.
SUMMARY: Large volumes of microarray data are generated and deposited in public databases. Most of this data is in the form of tab-delimited text files or Excel spreadsheets. Combining data from several of these files to reanalyze these data sets is time consuming. Microarray Data Assembler is specifically designed to simplify this task. The program can list files and data sources, convert selected text files into Excel files and assemble data across multiple Excel worksheets and workbooks. This program thus makes data assembling easy, saves time and helps avoid manual error. AVAILABILITY: The program is freely available for non-profit use, via email request from the author, after signing a Material Transfer Agreement with Johns Hopkins University.  相似文献   

8.
The classification of tissue samples based on gene expression data is an important problem in medical diagnosis of diseases such as cancer. In gene expression data, the number of genes is usually very high (in the thousands) compared to the number of data samples (in the tens or low hundreds); that is, the data dimension is large compared to the number of data points (such data is said to be undersampled). To cope with performance and accuracy problems associated with high dimensionality, it is commonplace to apply a preprocessing step that transforms the data to a space of significantly lower dimension with limited loss of the information present in the original data. Linear discriminant analysis (LDA) is a well-known technique for dimension reduction and feature extraction, but it is not applicable for undersampled data due to singularity problems associated with the matrices in the underlying representation. This paper presents a dimension reduction and feature extraction scheme, called uncorrelated linear discriminant analysis (ULDA), for undersampled problems and illustrates its utility on gene expression data. ULDA employs the generalized singular value decomposition method to handle undersampled data and the features that it produces in the transformed space are uncorrelated, which makes it attractive for gene expression data. The properties of ULDA are established rigorously and extensive experimental results on gene expression data are presented to illustrate its effectiveness in classifying tissue samples. These results provide a comparative study of various state-of-the-art classification methods on well-known gene expression data sets  相似文献   

9.
10.
11.
We examined three parallel data sets with respect to qualities relevant to phylogenetic analysis of 20 exemplar monocotyledons and related dicotyledons. The three data sets represent restriction-site variation in the inverted repeat region of the chloroplast genome, and nucleotide sequence variation in the chloroplast-encoded gene rbcL and in the mitochondrion-encoded gene atpA, the latter of which encodes the alpha-subunit of mitochondrial ATP synthase. The plant mitochondrial genome has been little used in plant systematics, in part because nucleotide sequence evolution in enzyme-encoding genes of this genome is relatively slow. The three data sets were examined in separate and combined analyses, with a focus on patterns of congruence, homoplasy, and data decisiveness. Data decisiveness (described by P. Goloboff) is a measure of robustness of support for most parsimonious trees by a data set in terms of the degree to which those trees are shorter than the average length of all possible trees. Because indecisive data sets require relatively fewer additional steps than decisive ones to be optimized on nonparsimonious trees, they will have a lesser tendency to be incongruent with other data sets. One consequence of this relationship between decisiveness and character incongruence is that if incongruence is used as a criterion of noncombinability, decisive data sets, which provide robust support for relationships, are more likely to be assessed as noncombinable with other data sets than are indecisive data sets, which provide weak support for relationships. For the sampling of taxa in this study, the atpA data set has about half as many cladistically informative nucleotides as the rbcL data set per site examined, and is less homoplastic and more decisive. The rbcL data set, which is the least decisive of the three, exhibits the lowest levels of character incongruence. Whatever the molecular evolutionary cause of this phenomenon, it seems likely that the poorer performance of rbcL than atpA, in terms of data decisiveness, is due to both its higher overall level of homoplasy and the fact that it is performing especially poorly at nonsynonymous sites.  相似文献   

12.
A study of representatives of the bacterial genus Pseudomonas, analysing a combined data set of four molecular sequences with completely different properties and evolutionary constraints, is reported. The best evolutionary model was obtained with a hierarchical hypothesis testing program to describe each data set and the combined data set is presented and analysed under the likelihood criterion. The resolution among Pseudomonas taxa based on the combined data set analysis of the different lineages increased due to a synergistic effect of the individual data sets. The unresolved fluorescens lineage, as well as other weakly supported lineages in the single data set trees, should be revised in detail at the biochemical and molecular level. The taxonomic status of biovars of P. putida is discussed.  相似文献   

13.
Incorrect statistical methods are often used for the analysisof ordinal response data. Such data are frequently summarizedinto mean scores for comparisons, a fallacious practice becauseordinal data are inherently not equidistant. The ubiquitousPearson chi-square test is invalid because it ignores the rankingof ordinal data. Although some of the non-parametric statisticalmethods take into account the ordering of ordinal data, thesemethods do not accommodate statistical adjustment of confoundingor assessment of effect modification, two overriding analyticgoals in virtually all etiologic inference in biology and medicine.The cumulative logit model is eminently suitable for the anlaysisof ordinal response data. This multivariate method not onlyconsiders the ranked order inherent in ordinal response data,but it also allows adjustment of confounding and assessmentof effect modification based on modest sample size. A non-technicalaccount of the cumulative logit model is given and its applicationsare illustrated by two research examples. The SAS programs forthe data analysis of the research examples are available fromthe author.  相似文献   

14.
The cancer classification problem is one of the most challenging problems in bioinformatics. The data provided by Netherland Cancer Institute consists of 295 breast cancer patient; 101 patients are with distant metastases and 194 patients are without distant metastases. Combination of features sets based on kernel method to classify the patient who are with or without distant metastases will be investigated. The single data set will be compared with three data integration strategies and also weighted data integration strategies based on kernel method. Least Square Support Vector Machine (LS-SVM) is chosen as the classifier because it can handle very high dimensional features, for instance, microarray data. The experiment result shows that the performance of weighted late integration and the using of only microarray data are almost similar. The data integration strategy is not always better than using single data set in this case. The performance of classification absolutely depends on the features that are used to represent the object.  相似文献   

15.
Introducing a new (freeware) tool for palynology   总被引:3,自引:0,他引:3  
We present a multiple-access key and searchable data base to Neotropical pollen that is available as freeware. The data base is based on FileMaker 5 and contains c . 6000 images of >1000 taxa. All pollen images are of acetolysed grains collected from vouchered herbarium specimens. The selection of taxa to be included in the data base is predicated upon their probable occurrence in lake sedimentary records, which in turn was based on their flower structure, sexual mechanisms and ecology. The multiple-access key is a forgiving format as it can be used with incomplete data or where the researcher cannot decide between the choices offered. The data base is downloadable and is compatible with both Mac and PC platforms.  相似文献   

16.
The organization and structure of data masses including results of scientific research is presented on the base of the morphometric method. The data massif is realized on ESER-1056 large scale computer. Currently, all the results of the universally scientific programme "Statist", designed for mathematical and statistical morphometric data processing, are collected in this data massif. A personal computer is linked with a large-scale computer by a cable for data transfer by telecommunication, the whole system accomplishes distributed ata processing. This enables the scientist to use the data massif directly from his working site.  相似文献   

17.
We present a gridded 8 km-resolution data product of the estimated composition of tree taxa at the time of Euro-American settlement of the northeastern United States and the statistical methodology used to produce the product from trees recorded by land surveyors. Composition is defined as the proportion of stems larger than approximately 20 cm diameter at breast height for 22 tree taxa, generally at the genus level. The data come from settlement-era public survey records that are transcribed and then aggregated spatially, giving count data. The domain is divided into two regions, eastern (Maine to Ohio) and midwestern (Indiana to Minnesota). Public Land Survey point data in the midwestern region (ca. 0.8-km resolution) are aggregated to a regular 8 km grid, while data in the eastern region, from Town Proprietor Surveys, are aggregated at the township level in irregularly-shaped local administrative units. The product is based on a Bayesian statistical model fit to the count data that estimates composition on the 8 km grid across the entire domain. The statistical model is designed to handle data from both the regular grid and the irregularly-shaped townships and allows us to estimate composition at locations with no data and to smooth over noise caused by limited counts in locations with data. Critically, the model also allows us to quantify uncertainty in our composition estimates, making the product suitable for applications employing data assimilation. We expect this data product to be useful for understanding the state of vegetation in the northeastern United States prior to large-scale Euro-American settlement. In addition to specific regional questions, the data product can also serve as a baseline against which to investigate how forests and ecosystems change after intensive settlement. The data product is being made available at the NIS data portal as version 1.0.  相似文献   

18.
19.
The current study investigates existing infrastructure, its technical solutions and implemented standards for data repositories related to integrative biodiversity research. The storage and reuse of complex biodiversity data in central databases are becoming increasingly important, particularly in attempts to cope with the impacts of environmental change on biodiversity and ecosystems. From the data side, the main challenge of biodiversity repositories is to deal with the highly interdisciplinary and heterogeneous character of standardized and unstandardized data and metadata covering information from genes to ecosystems. Furthermore, the technical improvements in data acquisition techniques produce ever larger data volumes, which represent a challenge for database structure and proper data exchange.The current study is based on comprehensive in-depth interviews and an online survey addressing IT specialists involved in database development and operation. The results show that metadata are already well established, but that non-meta data still is largely unstandardized across various scientific communities. For example, only a third of all repositories in our investigation use internationally unified semantic standard checklists for taxonomy. The study also showed that database developers are mostly occupied with the implementation of state of the art technology and solving operational problems, leaving no time to implement user's requirements. One of the main reasons for this dissatisfying situation is the undersized and unreliable funding situation of most repositories, as reflected by the marginally small number of permanent IT staff members. We conclude that a sustainable data management system that fosters the future use and reuse of these valuable data resources requires the development of fewer, but more permanent data repositories using commonly accepted standards for their long-term data. This can only be accomplished through the consolidation of hitherto widely scattered small and non-permanent repositories.  相似文献   

20.
The collection of data on physical parameters of body segments is a preliminary critical step in studying the biomechanics of locomotion. Little data on nonhuman body segment parameters has been published. The lack of standardization of techniques for data collection and presentation has made the comparative use of these data difficult and at times impossible. This study offers an approach for collecting data on center of gravity and moments of inertia for standardized body segments. The double swing pendulum approach is proposed as a solution for difficulties previously encountered in calculating moments of inertia for body segments. A format for prompting a computer to perform these calculations is offered, and the resulting segment mass data for Lemur fulvus is presented.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号