首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 765 毫秒
1.
The feature selection addresses the issue of developing accurate models for classification in data mining. The aggregated data collection from distributed environment for feature selection makes the problem of accessing the relevant inputs of individual data records. Preserving the privacy of individual data is often critical issue in distributed data mining. In this paper, it proposes the privacy preservation of individual data for both feature and sub-feature selection based on data mining techniques and fuzzy probabilities. For privacy purpose, each party maintains their privacy as the instruction of data miner with the help of fuzzy probabilities as alias values. The techniques have developed for own database of data miner in distributed network with fuzzy system and also evaluation of sub-feature value included for the processing of data mining task. The feature selection has been explained by existing data mining techniques i.e., gain ratio using fuzzy optimization. The estimation of gain ratio based on the relevant inputs for the feature selection has been evaluated within the expected upper and lower bound of fuzzy data set. It mainly focuses on sub-feature selection with privacy algorithm using fuzzy random variables among different parties in distributed environment. The sub-feature selection is uniquely identified for better class prediction. The algorithm provides the idea of selecting sub-feature using fuzzy probabilities with fuzzy frequency data from data miner’s database. The experimental result shows performance of our findings based on real world data set.  相似文献   

2.
The validity of material flow analyses (MFAs) depends on the available information base, that is, the quality and quantity of available data. MFA data are cross‐disciplinary, can have varying formats and qualities, and originate from heterogeneous sources, such as official statistics, scientific models, or expert estimations. Statistical methods for data evaluation are most often inadequate, because MFA data are typically isolated values rather than extensive data sets. In consideration of the properties of MFA data, a data characterization framework for MFA is presented. It consists of an MFA data terminology, a data characterization matrix, and a procedure for database analysis. The framework facilitates systematic data characterization by cell‐level tagging of data with data attributes. Data attributes represent data characteristics and metainformation regarding statistical properties, meaning, origination, and application of the data. The data characterization framework is illustrated in a case study of a national phosphorus budget. This work furthers understanding of the information basis of material flow systems, promotes the transparent documentation and precise communication of MFA input data, and can be the foundation for better data interpretation and comprehensive data quality evaluation.  相似文献   

3.
MOTIVATION: The methods for analyzing overlap data are distinct from those for analyzing probe data, making integration of the two forms awkward. Conversion of overlap data to probe-like data elements would facilitate comparison and uniform integration of overlap data and probe data using software developed for analysis of STS data. RESULTS: We show that overlap data can be effectively converted to probe-like data elements by extracting maximal sets of mutually overlapping clones. We call these sets virtual probes, since each set determines a site in the genome corresponding to the region which is common among the clones of the set. Finding the virtual probes is equivalent to finding the maximal cliques of a graph. We modify a known maximal-clique algorithm such that it finds all virtual probes in a large dataset within minutes. We illustrate the algorithm by converting fingerprint and Alu-PCR overlap data to virtual probes. The virtual probes are then analyzed using double-linkage intersection graphs and structure graphs to show that methods designed for STS data are also applicable to overlap data represented as virtual probes. Next we show that virtual probes can produce a uniform integration of different kinds of mapping data, in particular STS probe data and fingerprint and Alu-PCR overlap data. The integrated virtual probes produce longer double-linkage contigs than STS probes alone, and in conjunction with structure graphs they facilitate the identification and elimination of anomalies. Thus, the virtual-probe technique provides: (i) a new way to examine overlap data; (ii) a basis on which to compare overlap data and probe data using the same systems and standards; and (iii) a unique and useful way to uniformly integrate overlap data with probe data.  相似文献   

4.
GDPC: connecting researchers with multiple integrated data sources   总被引:1,自引:0,他引:1  
The goal of this project is to simplify access to genomic diversity and phenotype data, thereby encouraging reuse of this data. The Genomic Diversity and Phenotype Connection (GDPC) accomplishes this by retrieving data from one or more data sources and by allowing researchers to analyze integrated data in a standard format. GDPC is written in JAVA and provides (1) data sources available as web services that transfer XML formatted data via the SOAP protocol; (2) a JAVA API for programmatic access to data sources; and (3) a front-end application that allows users to manage data sources, retrieve data based on filters, sort/group data based on property values and save/open the data as XML files. AVAILABILITY: The source code, compiled code, documentation and GDPC Browser are freely available at: www.maizegenetics.net/gdpc/index.html the current release of GDPC is version 1.0, with updated releases planned for the future. Comments are welcome.  相似文献   

5.
申文明  孙中平  张雪  初东  李飞  吕灿宾 《生态学报》2013,33(24):7846-7852
针对快速、实时、有效采集并录入生态环境野外调查大样本量、多源数据的需求,充分应用移动GIS技术、移动智能终端、3G等现代信息技术优势,提出了基于ArcGIS for Mobile的移动数据采集方案,研究解决了包括系统运行机制、数据访问模式、移动数据库技术等面向全国生态环境野外调查的移动GIS关键技术,设计并研发野外调查移动数据采集系统。该系统采用C#和Java语言,以SQL Server 2008 为服务器端的数据库环境,在Microsoft Visual Studio 2008集成开发环境上实现设计开发,并在全国生态环境10年变化遥感调查与评估项目土地覆盖类型地面核查、生态系统参数野外观测等工作中予以应用。实践检验表明:该系统实现了野外调查数据的数字采集、智能校验、实时上传与有效管理,简化了填报程序,规范了填报内容,提高了工作效率,能够为生态环境相关的调查数据采集提供信息化支持。  相似文献   

6.
The objective of this paper is to give an overview of existing databases in Denmark and describe some of the most important of these in relation to establishment of the Danish Veterinary and Food Administrations’ veterinary data warehouse. The purpose of the data warehouse and possible use of the data are described. Finally, sharing of data and validity of data is discussed. There are databases in other countries describing animal husbandry and veterinary antimicrobial consumption, but Denmark will be the first country relating all data concerning animal husbandry, -health and -welfare in Danish production animals to each other in a data warehouse. Moreover, creating access to these data for researchers and authorities will hopefully result in easier and more substantial risk based control, risk management and risk communication by the authorities and access to data for researchers for epidemiological studies in animal health and welfare.  相似文献   

7.
Abstract

Biodiversity data generated in the context of research projects often lack a strategy for long-term preservation and availability, and are therefore at risk of becoming outdated and finally lost. The reBiND project aims to develop an efficient and well-documented workflow for rescuing such data sets. The workflow consists of phases for data transformation into contemporary standards, data validation, storage in a native XML database, and data publishing in international biodiversity networks. It has been developed and tested using the example of collection and observational data but is flexible enough to be transferred to other data types and domains.  相似文献   

8.
There is a concerted global effort to digitize biodiversity occurrence data from herbarium and museum collections that together offer an unparalleled archive of life on Earth over the past few centuries. The Global Biodiversity Information Facility provides the largest single gateway to these data. Since 2004 it has provided a single point of access to specimen data from databases of biological surveys and collections. Biologists now have rapid access to more than 120 million observations, for use in many biological analyses. We investigate the quality and coverage of data digitally available, from the perspective of a biologist seeking distribution data for spatial analysis on a global scale. We present an example of automatic verification of geographic data using distributions from the International Legume Database and Information Service to test empirically, issues of geographic coverage and accuracy. There are over 1/2 million records covering 31% of all Legume species, and 84% of these records pass geographic validation. These data are not yet a global biodiversity resource for all species, or all countries. A user will encounter many biases and gaps in these data which should be understood before data are used or analyzed. The data are notably deficient in many of the world's biodiversity hotspots. The deficiencies in data coverage can be resolved by an increased application of resources to digitize and publish data throughout these most diverse regions. But in the push to provide ever more data online, we should not forget that consistent data quality is of paramount importance if the data are to be useful in capturing a meaningful picture of life on Earth.  相似文献   

9.
三维几何形态学概述及其在昆虫学中的应用   总被引:3,自引:0,他引:3  
白明  杨星科 《昆虫学报》2014,57(9):1105-1111
长期以来二维(two-dimensional, 2D)数据是几何形态学(geometric morphometrics)分析的最主要的数据类型,在推动几何形态学的发展过程中起到了奠基性的作用,并也解决了很多重大的科学问题,展示了几何形态学强大的科学计算能力与问题解决能力。但有些特殊的科学问题或者特殊的形态结构,无法通过二维数据完美解决,亟需大规模、大尺度三维(three-dimensional, 3D)数据的支持,这对几何形态学的三维化发展提出需求。更重要的是,随着三维数据获取成本的日渐降低,大量三维数据涌现出来。因此,三维几何形态学应运而生。本文对三维几何形态学的原理及其应用进行了概述,重点探讨了三维几何形态学与二维几何形态学的异同点,并对前者的两个发展阶段(少量样本的形态模拟与准定量比较及大量样本的定量比较)进行了概述,评价了四维数据和有限元等方法的应用,指出了该方法在昆虫学领域的发展潜力,最后对该方法在样本量增加、硬件提升、数据分辨率提高、新算法的开发、分析结果的呈现及3D打印等方面的发展趋势进行了展望。  相似文献   

10.
Three-dimensional(3D) reconstructions from tilt series in an electron microscope show in general an anisotropic resolution due to an instrumentally limited tilt angle. As a consequence, the information in the z direction is blurred, thus making it difficult to detect the boundary of the reconstructed structures. In contrast, high-resolution topography data from microscopic surface techniques provide exactly complementary information. The combination of topographic surface and volume data leads to a better understanding of the 3D structure. The new correlation procedure presented determines both the height scaling of the topographic surface and the relative position of surface and volume data, thus allowing information to be combined. Experimental data for crystalline T4 bacteriophage polyheads were used to test the new method. Three-dimensional volume data were reconstructed from a negatively stained tilt series. Topographic data for both surfaces were obtained by surface relief reconstruction of electron micrographs of freeze-dried and unidirectionally metal-shadowed polyheads. The combined visualization of volume data with the scaled and aligned surface data shows that the correlation technique yields meaningful results. The reported correlation method may be applied to surface data obtained by any microscopic technique yielding topographic data.  相似文献   

11.
Realizing personalized medicine requires integrating diverse data types with bioinformatics. The most vital data are genomic information for individuals that are from advanced next-generation sequencing (NGS) technologies at present. The technologies continue to advance in terms of both decreasing cost and sequencing speed with concomitant increase in the amount and complexity of the data. The prodigious data together with the requisite computational pipelines for data analysis and interpretation are stressors to IT infrastructure and the scientists conducting the work alike. Bioinformatics is increasingly becoming the rate-limiting step with numerous challenges to be overcome for translating NGS data for personalized medicine. We review some key bioinformatics tasks, issues, and challenges in contexts of IT requirements, data quality, analysis tools and pipelines, and validation of biomarkers.  相似文献   

12.
Besides the problem of searching for effective methods for data analysis there are some additional problems with handling data of high uncertainty. Uncertainty problems often arise in an analysis of ecological data, e.g. in the cluster analysis of ecological data. Conventional clustering methods based on Boolean logic ignore the continuous nature of ecological variables and the uncertainty of ecological data. That can result in misclassification or misinterpretation of the data structure. Clusters with fuzzy boundaries reflect better the continuous character of ecological features. But the problem is, that the common clustering methods (like the fuzzy c-means method) are only designed for treating crisp data, that means they provide a fuzzy partition only for crisp data (e.g. exact measurement data). This paper presents the extension and implementation of the method of fuzzy clustering of fuzzy data proposed by Yang and Liu [Yang, M.-S. and Liu, H-H, 1999. Fuzzy clustering procedures for conical fuzzy vector data. Fuzzy Sets and Systems, 106, 189-200.]. The imprecise data can be defined as multidimensional fuzzy sets with not sharply formed boundaries (in the form of the so-called conical fuzzy vectors). They can then be used for the fuzzy clustering together with crisp data. That can be particularly useful when information is not available about the variances which describe the accuracy of the data and probabilistic approaches are impossible. The method proposed by Yang has been extended and implemented for the Fuzzy Clustering System EcoFucs developed at the University of Kiel. As an example, the paper presents the fuzzy cluster analysis of chemicals according to their ecotoxicological properties. The uncertainty and imprecision of ecotoxicological data are very high because of the use of various data sources, various investigation tests and the difficulty of comparing these data. The implemented method can be very helpful in searching for an adequate partition of ecological data into clusters with similar properties.  相似文献   

13.
医院药品的业务庞杂,数据集成与分析不完善,缺少信息全面、集成的数据仓库系统。研究医院药品数据,运用“业务维度生命周期法”进行数据仓库项目的设计、开发和部署。解决的问题有:创建数据仓库总线结构,建立主题模型,使用维度建模来进行逻辑建模,数据存储的物理设计,数据转储与开发。总体逻辑结构模型设计清晰,构建方法新颖,给出一个较好的医院药品数据仓库的分析模型。  相似文献   

14.
We present a gridded 8 km-resolution data product of the estimated composition of tree taxa at the time of Euro-American settlement of the northeastern United States and the statistical methodology used to produce the product from trees recorded by land surveyors. Composition is defined as the proportion of stems larger than approximately 20 cm diameter at breast height for 22 tree taxa, generally at the genus level. The data come from settlement-era public survey records that are transcribed and then aggregated spatially, giving count data. The domain is divided into two regions, eastern (Maine to Ohio) and midwestern (Indiana to Minnesota). Public Land Survey point data in the midwestern region (ca. 0.8-km resolution) are aggregated to a regular 8 km grid, while data in the eastern region, from Town Proprietor Surveys, are aggregated at the township level in irregularly-shaped local administrative units. The product is based on a Bayesian statistical model fit to the count data that estimates composition on the 8 km grid across the entire domain. The statistical model is designed to handle data from both the regular grid and the irregularly-shaped townships and allows us to estimate composition at locations with no data and to smooth over noise caused by limited counts in locations with data. Critically, the model also allows us to quantify uncertainty in our composition estimates, making the product suitable for applications employing data assimilation. We expect this data product to be useful for understanding the state of vegetation in the northeastern United States prior to large-scale Euro-American settlement. In addition to specific regional questions, the data product can also serve as a baseline against which to investigate how forests and ecosystems change after intensive settlement. The data product is being made available at the NIS data portal as version 1.0.  相似文献   

15.
Incorrect statistical methods are often used for the analysisof ordinal response data. Such data are frequently summarizedinto mean scores for comparisons, a fallacious practice becauseordinal data are inherently not equidistant. The ubiquitousPearson chi-square test is invalid because it ignores the rankingof ordinal data. Although some of the non-parametric statisticalmethods take into account the ordering of ordinal data, thesemethods do not accommodate statistical adjustment of confoundingor assessment of effect modification, two overriding analyticgoals in virtually all etiologic inference in biology and medicine.The cumulative logit model is eminently suitable for the anlaysisof ordinal response data. This multivariate method not onlyconsiders the ranked order inherent in ordinal response data,but it also allows adjustment of confounding and assessmentof effect modification based on modest sample size. A non-technicalaccount of the cumulative logit model is given and its applicationsare illustrated by two research examples. The SAS programs forthe data analysis of the research examples are available fromthe author.  相似文献   

16.
The current study investigates existing infrastructure, its technical solutions and implemented standards for data repositories related to integrative biodiversity research. The storage and reuse of complex biodiversity data in central databases are becoming increasingly important, particularly in attempts to cope with the impacts of environmental change on biodiversity and ecosystems. From the data side, the main challenge of biodiversity repositories is to deal with the highly interdisciplinary and heterogeneous character of standardized and unstandardized data and metadata covering information from genes to ecosystems. Furthermore, the technical improvements in data acquisition techniques produce ever larger data volumes, which represent a challenge for database structure and proper data exchange.The current study is based on comprehensive in-depth interviews and an online survey addressing IT specialists involved in database development and operation. The results show that metadata are already well established, but that non-meta data still is largely unstandardized across various scientific communities. For example, only a third of all repositories in our investigation use internationally unified semantic standard checklists for taxonomy. The study also showed that database developers are mostly occupied with the implementation of state of the art technology and solving operational problems, leaving no time to implement user's requirements. One of the main reasons for this dissatisfying situation is the undersized and unreliable funding situation of most repositories, as reflected by the marginally small number of permanent IT staff members. We conclude that a sustainable data management system that fosters the future use and reuse of these valuable data resources requires the development of fewer, but more permanent data repositories using commonly accepted standards for their long-term data. This can only be accomplished through the consolidation of hitherto widely scattered small and non-permanent repositories.  相似文献   

17.
In this paper, we propose a successive learning method in hetero-associative memories, such as Bidirectional Associative Memories and Multidirectional Associative Memories, using chaotic neural networks. It can distinguish unknown data from the stored known data and can learn the unknown data successively. The proposed model makes use of the difference in the response to the input data in order to distinguish unknown data from the stored known data. When input data is regarded as unknown data, it is memorized. Furthermore, the proposed model can estimate and learn correct data from noisy unknown data or incomplete unknown data by considering the temporal summation of the continuous data input. In addition, similarity to the physiological facts in the olfactory bulb of a rabbit found by Freeman are observed in the behavior of the proposed model. A series of computer simulations shows the effectiveness of the proposed model.  相似文献   

18.
Ryu  Minho  Lee  Geonseok  Lee  Kichun 《Cluster computing》2021,24(3):1975-1987

In the new era of big data, numerous information and technology systems can store huge amounts of streaming data in real time, for example, in server-access logs on web application servers. The importance of anomaly detection in voluminous quantities of streaming data from such systems is rapidly increasing. One of the biggest challenges in the detection task is to carry out real-time contextual anomaly detection in streaming data with varying patterns that are visually detectable but unsuitable for a parametric model. Most anomaly detection algorithms have weaknesses in dealing with streaming time-series data containing such patterns. In this paper, we propose a novel method for online contextual anomaly detection in streaming time-series data using generalized extreme studentized deviates (GESD) tests. The GESD test is relatively accurate and efficient because it performs statistical hypothesis testing but it is unable to handle streaming time-series data. Thus, focusing on streaming time-series data, we propose an online version of the test capable of detecting outliers under varying patterns. We perform extensive experiments with simulated data, syntactic data, and real online traffic data from Yahoo Webscope, showing a clear advantage of the proposed method, particularly for analyzing streaming data with varying patterns.

  相似文献   

19.
利用功能磁共振成像(functional Magnetic Resonance Imaging,fMRI)进行脑功能研究是目前的一个热点。现以逻辑计算为认知任务,利用fMRI进行数据采集。采用低频振荡振幅(amplitude of low frequency fluctuation,ALFF)算法分别对14例正常人的计算和无任务静息状态的功能磁共振数据进行处理,并做对比分析。观察振幅增强或减弱情况,发现计算任务下相关激活脑区存在低频振荡,逻辑认知的负载也导致了默认(default mode)网络改变;同时针对相关的改变做出了初步探索。  相似文献   

20.
The generation of proteomic data is becoming ever more high throughput. Both the technologies and experimental designs used to generate and analyze data are becoming increasingly complex. The need for methods by which such data can be accurately described, stored and exchanged between experimenters and data repositories has been recognized. Work by the Proteome Standards Initiative of the Human Proteome Organization has laid the foundation for the development of standards by which experimental design can be described and data exchange facilitated. The Minimum Information About a Proteomic Experiment data model describes both the scope and purpose of a proteomics experiment and encompasses the development of more specific interchange formats such as the mzData model of mass spectrometry. The eXtensible Mark-up Language-MI data interchange format, which allows exchange of molecular interaction data, has already been published and major databases within this field are supplying data downloads in this format.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号