首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Data Quality     
A methodology is presented to develop and analyze vectors of data quality attribute scores. Each data quality vector component represents the quality of the data element for a specific attribute (e.g., age of data). Several methods for aggregating the components of data quality vectors to derive one data quality indicator (DQI) that represents the total quality associated with the input data element are presented with illustrative examples. The methods are compared and it is proven that the measure of central tendency, or arithmetic average, of the data quality vector components as a percentage of the total quality range attainable is an equivalent measure for the aggregate DQI. In addition, the methodology is applied and compared to realworld LCA data pedigree matrices. Finally, a method for aggregating weighted data quality vector attributes is developed and an illustrative example is presented. This methodology provides LCA practitioners with an approach to increase the precision of input data uncertainty assessments by selecting any number of data quality attributes with which to score the LCA inventory model input data. The resultant vector of data quality attributes can then be analyzed to develop one aggregate DQI for each input data element for use in stochastic LCA modeling.  相似文献   

2.
Sensory evaluation data is sometimes collected using trinary category scales. Nonparametric data analysis for such data is discussed and a homogeneity statistic for trinary data is proposed which is simple enough to calculate with a pocket calculator. A statistic for identifying market segmentation in trinary data is also suggested.  相似文献   

3.
Background: Single-cell RNA sequencing (scRNA-seq) is an emerging technology that enables high resolution detection of heterogeneities between cells. One important application of scRNA-seq data is to detect differential expression (DE) of genes. Currently, some researchers still use DE analysis methods developed for bulk RNA-Seq data on single-cell data, and some new methods for scRNA-seq data have also been developed. Bulk and single-cell RNA-seq data have different characteristics. A systematic evaluation of the two types of methods on scRNA-seq data is needed. Results: In this study, we conducted a series of experiments on scRNA-seq data to quantitatively evaluate 14 popular DE analysis methods, including both of traditional methods developed for bulk RNA-seq data and new methods specifically designed for scRNA-seq data. We obtained observations and recommendations for the methods under different situations. Conclusions: DE analysis methods should be chosen for scRNA-seq data with great caution with regard to different situations of data. Different strategies should be taken for data with different sample sizes and/or different strengths of the expected signals. Several methods for scRNA-seq data show advantages in some aspects, and DEGSeq tends to outperform other methods with respect to consistency, reproducibility and accuracy of predictions on scRNA-seq data.  相似文献   

4.
5.
Geoscience observations and model simulations are generating vast amounts of multi-dimensional data. Effectively analyzing these data are essential for geoscience studies. However, the tasks are challenging for geoscientists because processing the massive amount of data is both computing and data intensive in that data analytics requires complex procedures and multiple tools. To tackle these challenges, a scientific workflow framework is proposed for big geoscience data analytics. In this framework techniques are proposed by leveraging cloud computing, MapReduce, and Service Oriented Architecture (SOA). Specifically, HBase is adopted for storing and managing big geoscience data across distributed computers. MapReduce-based algorithm framework is developed to support parallel processing of geoscience data. And service-oriented workflow architecture is built for supporting on-demand complex data analytics in the cloud environment. A proof-of-concept prototype tests the performance of the framework. Results show that this innovative framework significantly improves the efficiency of big geoscience data analytics by reducing the data processing time as well as simplifying data analytical procedures for geoscientists.  相似文献   

6.
The classification of tissue samples based on gene expression data is an important problem in medical diagnosis of diseases such as cancer. In gene expression data, the number of genes is usually very high (in the thousands) compared to the number of data samples (in the tens or low hundreds); that is, the data dimension is large compared to the number of data points (such data is said to be undersampled). To cope with performance and accuracy problems associated with high dimensionality, it is commonplace to apply a preprocessing step that transforms the data to a space of significantly lower dimension with limited loss of the information present in the original data. Linear discriminant analysis (LDA) is a well-known technique for dimension reduction and feature extraction, but it is not applicable for undersampled data due to singularity problems associated with the matrices in the underlying representation. This paper presents a dimension reduction and feature extraction scheme, called uncorrelated linear discriminant analysis (ULDA), for undersampled problems and illustrates its utility on gene expression data. ULDA employs the generalized singular value decomposition method to handle undersampled data and the features that it produces in the transformed space are uncorrelated, which makes it attractive for gene expression data. The properties of ULDA are established rigorously and extensive experimental results on gene expression data are presented to illustrate its effectiveness in classifying tissue samples. These results provide a comparative study of various state-of-the-art classification methods on well-known gene expression data sets  相似文献   

7.
create is a Windows program for the creation of new and conversion of existing data input files for 52 genetic data analysis software programs. Programs are grouped into areas of sibship reconstruction, parentage assignment, genetic data analysis, and specialized applications. create is able to read in data from text, Microsoft Excel and Access sources and allows the user to specify columns containing individual and population identifiers, birth and death data, sex data, relationship information, and spatial location data. create's only constraints on source data are that one individual is contained in one row, and the genotypic data is contiguous. create is available for download at http://www.lsc.usgs.gov/CAFL/Ecology/Software.html.  相似文献   

8.
基于生态系统服务功能的生态系统评估是识别生态环境问题、开展生态系统恢复和生物多样性保护、建立生态补偿机制的重要基础,也是保障国家生态安全、推进生态文明建设的重要环节。生态系统评估涉及生态系统多个方面,需要多要素、多类型、多尺度的生态系统观测数据作为支撑。地面观测数据和遥感数据是生态系统评估的两大数据源,但是其在使用时常存在观测标准不一、观测要素不全面、时间连续性不足、尺度不匹配等问题,给生态系统评估增加了极大的不确定性。如何融合不同尺度的观测数据量化生态系统服务功能是实现生态系统准确评估的关键。为此,从观测尺度出发,阐述了地面观测数据、近地面遥感数据、机载遥感数据和卫星遥感数据的特点及其在问题,并综述了这几类数据源进行融合的常用方法,并以生产力、固碳能力、生物多样性几个关键生态参数为例介绍了“基于多源数据融合的生态系统评估技术及其应用研究”项目的多源数据融合体系。最后,总结面向生态系统评估的多源数据融合体系,并指出了该研究的未来发展方向。  相似文献   

9.
10.
高凡  闫正龙  黄强 《生态学报》2011,31(21):6363-6370
流域尺度海量生态环境数据库构建是生态环境精准化研究的基础。以塔里木河流域生态环境数据库构建为例,对流域尺度海量生态环境数据建库的无缝数据拼接、建库规范设计、要素代码设计、空间索引设计、特征展示表及一键入库设计等关键技术进行了探讨。针对流域跨带裂缝问题,从缝隙源出发,通过分离物理数据层和逻辑数据层并区分矢量数据和栅格数据,在统一的多尺度空间框架体系下实现了海量数据的跨带无缝拼接;数据库规范设计和要素代码设计是数据入库前的关键工作,针对流域实际,分别采用规范化英文字母和图形数据比例尺设置数据库命名规范和建立代码标准;在ArcSDE框架下,采用格网索引设计和多级金字塔结构分别构建矢量数据和栅格数据的空间索引,提高了数据的快速检索和浏览;通过建立特征展示表并提出"一键入库"策略,提高了系统响应及数据入库效率等。通过构建流域尺度海量生态环境数据库系统,实现了流域尺度多源、多类型、跨带海量生态环境数据的有效存储和管理,为流域一体化管理和生态环境研究提供了基础数据支撑。  相似文献   

11.
Collecting natural data at regular, fine scales is an onerous and often costly procedure. However, there is a basic need for fine scale data when applying inductive methods such as neural networks or genetic algorithms for the development of ecological models. This paper will address the issues involved in interpolating data for use in machine learning methods by considering how to determine if a downscaling of the data is valid. The approach is based on a multi-scale estimate of errors. The resulting function has similar properties to a time series variogram; however, the comparison at different scales is based on the variance introduced by rescaling from the original sequence. This approach has a number of properties, including the ability to detect frequencies in the data below the current sampling rate, an estimate of the probable average error introduced when a sampled variable is downscaled and a method for visualising the sequences of a time series that are most susceptible to error due to sampling. The described approach is ideal for supporting the ongoing sampling of ecological data and as a tool for assessing the impact of using interpolated data for building inductive models of ecological response.  相似文献   

12.
Proteomics is a rapidly expanding field encompassing a multitude of complex techniques and data types. To date much effort has been devoted to achieving the highest possible coverage of proteomes with the aim to inform future developments in basic biology as well as in clinical settings. As a result, growing amounts of data have been deposited in publicly available proteomics databases. These data are in turn increasingly reused for orthogonal downstream purposes such as data mining and machine learning. These downstream uses however, need ways to a posteriori validate whether a particular data set is suitable for the envisioned purpose. Furthermore, the (semi-)automatic curation of repository data is dependent on analyses that can highlight misannotation and edge conditions for data sets. Such curation is an important prerequisite for efficient proteomics data reuse in the life sciences in general. We therefore present here a selection of quality control metrics and approaches for the a posteriori detection of potential issues encountered in typical proteomics data sets. We illustrate our metrics by relying on publicly available data from the Proteomics Identifications Database (PRIDE), and simultaneously show the usefulness of the large body of PRIDE data as a means to derive empirical background distributions for relevant metrics.  相似文献   

13.
Proposed standard for image cytometry data files   总被引:1,自引:0,他引:1  
P Dean  L Mascio  D Ow  D Sudar  J Mullikin 《Cytometry》1990,11(5):561-569
A number of different types of computers running a variety of operating systems are presently used for the collection and analysis of image cytometry data. In order to facilitate the development of sharable data analysis programs, to allow for the transport of image cytometry data from one installation to another, and to provide a uniform and controlled means for including textual information in data files, this document describes a data storage format that is proposed as a standard for use in image cytometry. In this standard, data from an image measurement are stored in a minimum of two files. One file is written in ASCII to include information about the way the image data are written and optionally, information about the sample, experiment, equipment, etc. The image data are written separately into a binary file. This standard is proposed with the intention that it will be used internationally for the storage and handling of biomedical image cytometry data. The method of data storage described in this paper is similar to those methods published in American Association of Physicists in Medicine (AAPM) Report Number 10 and in ACR-NEMA Standards Publication Number 300-1985.  相似文献   

14.
Among the many effects of climate change is its influence on the phenology of biota. In marine and coastal ecosystems, phenological shifts have been documented for multiple life forms; however, biological data related to marine species' phenology remain difficult to access and is under-used. We conducted an assessment of potential sources of biological data for marine species and their availability for use in phenological analyses and assessments. Our evaluations showed that data potentially related to understanding marine species' phenology are available through online resources of governmental, academic, and non-governmental organizations, but appropriate datasets are often difficult to discover and access, presenting opportunities for scientific infrastructure improvement. The developing Federal Marine Data Architecture when fully implemented will improve data flow and standardization for marine data within major federal repositories and provide an archival repository for collaborating academic and public data contributors. Another opportunity, largely untapped, is the engagement of citizen scientists in standardized collection of marine phenology data and contribution of these data to established data flows. Use of metadata with marine phenology related keywords could improve discovery and access to appropriate datasets. When data originators choose to self-publish, publication of research datasets with a digital object identifier, linked to metadata, will also improve subsequent discovery and access. Phenological changes in the marine environment will affect human economics, food systems, and recreation. No one source of data will be sufficient to understand these changes. The collective attention of marine data collectors is needed—whether with an agency, an educational institution, or a citizen scientist group—toward adopting the data management processes and standards needed to ensure availability of sufficient and useable marine data to understand marine phenology.  相似文献   

15.
Many animal health, welfare and food safety databases include data on clinical and test-based disease diagnoses. However, the circumstances and constraints for establishing the diagnoses vary considerably among databases. Therefore results based on different databases are difficult to compare and compilation of data in order to perform meta-analysis is almost impossible. Nevertheless, diagnostic information collected either routinely or in research projects is valuable in cross comparisons between databases, but there is a need for improved transparency and documentation of the data and the performance characteristics of tests used to establish diagnoses. The objective of this paper is to outline the circumstances and constraints for recording of disease diagnoses in different types of databases, and to discuss these in the context of disease diagnoses when using them for additional purposes, including research. Finally some limitations and recommendations for use of data and for recording of diagnostic information in the future are given. It is concluded that many research questions have such a specific objective that investigators need to collect their own data. However, there are also examples, where a minimal amount of extra information or continued validation could make sufficient improvement of secondary data to be used for other purposes. Regardless, researchers should always carefully evaluate the opportunities and constraints when they decide to use secondary data. If the data in the existing databases are not sufficiently valid, researchers may have to collect their own data, but improved recording of diagnostic data may improve the usefulness of secondary diagnostic data in the future.  相似文献   

16.
Empirical Bayes models have been shown to be powerful tools for identifying differentially expressed genes from gene expression microarray data. An example is the WAME model, where a global covariance matrix accounts for array-to-array correlations as well as differing variances between arrays. However, the existing method for estimating the covariance matrix is very computationally intensive and the estimator is biased when data contains many regulated genes. In this paper, two new methods for estimating the covariance matrix are proposed. The first method is a direct application of the EM algorithm for fitting the multivariate t-distribution of the WAME model. In the second method, a prior distribution for the log fold-change is added to the WAME model, and a discrete approximation is used for this prior. Both methods are evaluated using simulated and real data. The first method shows equal performance compared to the existing method in terms of bias and variability, but is superior in terms of computer time. For large data sets (>15 arrays), the second method also shows superior computer run time. Moreover, for simulated data with regulated genes the second method greatly reduces the bias. With the proposed methods it is possible to apply the WAME model to large data sets with reasonable computer run times. The second method shows a small bias for simulated data, but appears to have a larger bias for real data with many regulated genes.  相似文献   

17.
18.
Data sensitivity can pose a formidable barrier to data sharing. Knowledge of species current distributions from data sharing is critical for the creation of watch lists and an early warning/rapid response system and for model generation for the spread of invasive species. We have created an on-line system to synthesize disparate datasets of non-native species locations that includes a mechanism to account for data sensitivity. Data contributors are able to mark their data as sensitive. This data is then ‘fuzzed’ in mapping applications and downloaded files to quarter-quadrangle grid cells, but the actual locations are available for analyses. We propose that this system overcomes the hurdles to data sharing posed by sensitive data.  相似文献   

19.
The number of methods for pre-processing and analysis of gene expression data continues to increase, often making it difficult to select the most appropriate approach. We present a simple procedure for comparative estimation of a variety of methods for microarray data pre-processing and analysis. Our approach is based on the use of real microarray data in which controlled fold changes are introduced into 20% of the data to provide a metric for comparison with the unmodified data. The data modifications can be easily applied to raw data measured with any technological platform and retains all the complex structures and statistical characteristics of the real-world data. The power of the method is illustrated by its application to the quantitative comparison of different methods of normalization and analysis of microarray data. Our results demonstrate that the method of controlled modifications of real experimental data provides a simple tool for assessing the performance of data preprocessing and analysis methods.  相似文献   

20.
As a new data processing era like Big Data, Cloud Computing, and Internet of Things approaches, the amount of data being collected in databases far exceeds the ability to reduce and analyze these data without the use of automated analysis techniques, data mining. As the importance of data mining has grown, one of the critical issues to emerge is how to scale data mining techniques to larger and complex databases so that it is particularly imperative for computationally intensive data mining tasks such as identifying natural clusters of instances. In this paper, we suggest an optimized combinatorial clustering algorithm for noisy performance which is essential for large data with random sampling. The algorithm outperforms conventional approaches through various numerical and qualitative thresholds like mean and standard deviation of accuracy and computation speed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号