首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Information Quality (IQ) is a critical factor for the success of many activities in the information age, including the development of data warehouses and implementation of data mining. The issue of IQ risk is recognized during the process of data mining; however, there is no formal methodological approach to dealing with such issues.

Consequently, it is essential to measure the risk of IQ in a data warehouse to ensure success in implementing data mining. This article presents a methodology to determine three IQ risk characteristics: accuracy, comprehensiveness, and non-membership. The methodology provides a set of quantitative models to examine how the quality risks of source information affect the quality for information outputs produced using the relational algebra operations: Restriction, Projection, and Cubic product. It can be used to determine how quality risks associated with diverse data sources affect the derived data. The study also develops a data cube model and associated algebra to support IQ risk operations.  相似文献   


2.
MOTIVATION: Methods for analyzing cancer microarray data often face two distinct challenges: the models they infer need to perform well when classifying new tissue samples while at the same time providing an insight into the patterns and gene interactions hidden in the data. State-of-the-art supervised data mining methods often cover well only one of these aspects, motivating the development of methods where predictive models with a solid classification performance would be easily communicated to the domain expert. RESULTS: Data visualization may provide for an excellent approach to knowledge discovery and analysis of class-labeled data. We have previously developed an approach called VizRank that can score and rank point-based visualizations according to degree of separation of data instances of different class. We here extend VizRank with techniques to uncover outliers, score features (genes) and perform classification, as well as to demonstrate that the proposed approach is well suited for cancer microarray analysis. Using VizRank and radviz visualization on a set of previously published cancer microarray data sets, we were able to find simple, interpretable data projections that include only a small subset of genes yet do clearly differentiate among different cancer types. We also report that our approach to classification through visualization achieves performance that is comparable to state-of-the-art supervised data mining techniques. AVAILABILITY: VizRank and radviz are implemented as part of the Orange data mining suite (http://www.ailab.si/orange). SUPPLEMENTARY INFORMATION: Supplementary data are available from http://www.ailab.si/supp/bi-cancer.  相似文献   

3.
MOTIVATION: Chemical carcinogenicity is an important subject in health and environmental sciences, and a reliable method is expected to identify characteristic factors for carcinogenicity. The predictive toxicology challenge (PTC) 2000-2001 has provided the opportunity for various data mining methods to evaluate their performance. The cascade model, a data mining method developed by the author, has the capability to mine for local correlations in data sets with a large number of attributes. The current paper explores the effectiveness of the method on the problem of chemical carcinogenicity. RESULTS: Rodent carcinogenicity of 417 compounds examined by the National Toxicology Program (NTP) was used as the training set. The analysis by the cascade model, for example, could obtain a rule 'Highly flexible molecules are carcinogenic, if they have no hydrogen bond acceptors in halogenated alkanes and alkenes'. Resulting rules are applied to predict the activity of 185 compounds examined by the FDA. The ROC analysis performed by the PTC organizers has shown that the current method has excellent predictive power for the female rat data. AVAILABILITY: The binary program of DISCAS 2.1 and samples of input data sets on Windows PC are available at http://www.clab.kwansei.ac.jp/mining/discas/discas.html upon request from the author. SUPPLEMENTARY INFORMATION: Summary of prediction results and cross validations is accessible via http://www.clab.kwansei.ac.jp/~okada/BIJ/BIJsupple.htm. Used rules and the prediction results for each molecule are also provided.  相似文献   

4.
MedlineR: an open source library in R for Medline literature data mining   总被引:3,自引:0,他引:3  
SUMMARY: We describe an open source library written in the R programming language for Medline literature data mining. This MedlineR library includes programs to query Medline through the NCBI PubMed database; to construct the co-occurrence matrix; and to visualize the network topology of query terms. The open source nature of this library allows users to extend it freely in the statistical programming language of R. To demonstrate its utility, we have built an application to analyze term-association by using only 10 lines of code. We provide MedlineR as a library foundation for bioinformaticians and statisticians to build more sophisticated literature data mining applications. AVAILABILITY: The library is available from http://dbsr.duke.edu/pub/MedlineR.  相似文献   

5.
BNArray is a systemized tool developed in R. It facilitates the construction of gene regulatory networks from DNA microarray data by using Bayesian network. Significant sub-modules of regulatory networks with high confidence are reconstructed by using our extended sub-network mining algorithm of directed graphs. BNArray can handle microarray datasets with missing data. To evaluate the statistical features of generated Bayesian networks, re-sampling procedures are utilized to yield collections of candidate 1st-order network sets for mining dense coherent sub-networks. AVAILABILITY: The R package and the supplementary documentation are available at http://www.cls.zju.edu.cn/binfo/BNArray/.  相似文献   

6.
Mining gene expression databases for association rules   总被引:16,自引:0,他引:16  
  相似文献   

7.
MOTIVATION: In general, most accurate gene/protein annotations are provided by curators. Despite having lesser evidence strengths, it is inevitable to use computational methods for fast and a priori discovery of protein function annotations. This paper considers the problem of assigning Gene Ontology (GO) annotations to partially annotated or newly discovered proteins. RESULTS: We present a data mining technique that computes the probabilistic relationships between GO annotations of proteins on protein-protein interaction data, and assigns highly correlated GO terms of annotated proteins to non-annotated proteins in the target set. In comparison with other techniques, probabilistic suffix tree and correlation mining techniques produce the highest prediction accuracy of 81% precision with the recall at 45%. AVAILABILITY: Code is available upon request. Results and used materials are available online at http://kirac.case.edu/PROTAN.  相似文献   

8.
Given the growing amount of biological data, data mining methods have become an integral part of bioinformatics research. Unfortunately, standard data mining tools are often not sufficiently equipped for handling raw data such as e.g. amino acid sequences. One popular and freely available framework that contains many well-known data mining algorithms is the Waikato Environment for Knowledge Analysis (Weka). In the BioWeka project, we introduce various input formats for bioinformatics data and bioinformatics methods like alignments to Weka. This allows users to easily combine them with Weka's classification, clustering, validation and visualization facilities on a single platform and therefore reduces the overhead of converting data between different data formats as well as the need to write custom evaluation procedures that can deal with many different programs. We encourage users to participate in this project by adding their own components and data formats to BioWeka. Availability: The software, documentation and tutorial are available at http://www.bioweka.org.  相似文献   

9.
GeneMerge--post-genomic analysis,data mining,and hypothesis testing   总被引:6,自引:0,他引:6  
SUMMARY: GeneMerge is a web-based and standalone program written in PERL that returns a range of functional and genomic data for a given set of study genes and provides statistical rank scores for over-representation of particular functions or categories in the data set. Functional or categorical data of all kinds can be analyzed with GeneMerge, facilitating regulatory and metabolic pathway analysis, tests of population genetic hypotheses, cross-experiment comparisons, and tests of chromosomal clustering, among others. GeneMerge can perform analyses on a wide variety of genomic data quickly and easily and facilitates both data mining and hypothesis testing. AVAILABILITY: GeneMerge is available free of charge for academic use over the web and for download from: http://www.oeb.harvard.edu/hartl/lab/publications/GeneMerge.html.  相似文献   

10.
sMOL Explorer is a 2D ligand-based computational tool that provides three major functionalities: data management, information retrieval and extraction and statistical analysis and data mining through Web interface. With sMOL Explorer, users can create personal databases by adding each small molecule via a drawing interface or uploading the data files from internal and external projects into the sMOL database. Then, the database can be browsed and queried with textual and structural similarity search. The molecule can also be submitted to search against external public databases including PubChem, KEGG, DrugBank and eMolecules. Moreover, users can easily access a variety of data mining tools from Weka and R packages to perform analysis including (1) finding the frequent substructure, (2) clustering the molecular fingerprints, (3) identifying and removing irrelevant attributes from the data and (4) building the classification model of biological activity. AVAILABILITY: sMOL Explorer is an Open Source project and is freely available to all interested users at http://www.biotec.or.th/ISL/SMOL/.  相似文献   

11.
With the advent of high-throughput sequencing technology, sequences from many genomes are being deposited to public databases at a brisk rate. Open access to large amount of expressed sequence tag (EST) data in the public databases has provided a powerful platform for simple sequence repeat (SSR) development in species where sequence information is not available. SSRs are markers of choice for their high reproducibility, abundant polymorphism and high inter-specific transferability. The mining of SSRs from ESTs requires different high-throughput computational tools that need to be executed individually which are computationally intensive and time consuming. To reduce the time lag and to streamline the cumbersome process of SSR mining from ESTs, we have developed a user-friendly, web-based EST-SSR pipeline "EST-SSR-MARKER PIPELINE (ESMP)". This pipeline integrates EST pre-processing, clustering, assembly and subsequently mining of SSRs from assembled EST sequences. The mining of SSRs from ESTs provides valuable information on the abundance of SSRs in ESTs and will facilitate the development of markers for genetic analysis and related applications such as marker-assisted breeding. AVAILABILITY: The database is available for free at http://bioinfo.aau.ac.in/ESMP.  相似文献   

12.
ToxoDB: accessing the Toxoplasma gondii genome   总被引:1,自引:0,他引:1  
ToxoDB (http://ToxoDB.org) provides a genome resource for the protozoan parasite Toxoplasma gondii. Several sequencing projects devoted to T. gondii have been completed or are in progress: an EST project (http://genome.wustl.edu/est/index.php?toxoplasma=1), a BAC clone end-sequencing project (http://www.sanger.ac.uk/Projects/T_gondii/) and an 8X random shotgun genomic sequencing project (http://www.tigr.org/tdb/e2k1/tga1/). ToxoDB was designed to provide a central point of access for all available T. gondii data, and a variety of data mining tools useful for the analysis of unfinished, un-annotated draft sequence during the early phases of the genome project. In later stages, as more and different types of data become available (microarray, proteomic, SNP, QTL, etc.) the database will provide an integrated data analysis platform facilitating user-defined queries across the different data types.  相似文献   

13.
Data mining in bioinformatics using Weka   总被引:8,自引:0,他引:8  
The Weka machine learning workbench provides a general-purpose environment for automatic classification, regression, clustering and feature selection-common data mining problems in bioinformatics research. It contains an extensive collection of machine learning algorithms and data pre-processing methods complemented by graphical user interfaces for data exploration and the experimental comparison of different machine learning techniques on the same problem. Weka can process data given in the form of a single relational table. Its main objectives are to (a) assist users in extracting useful information from data and (b) enable them to easily identify a suitable algorithm for generating an accurate predictive model from it. AVAILABILITY: http://www.cs.waikato.ac.nz/ml/weka.  相似文献   

14.
Gene co-expression network (GCN) mining identifies gene modules with highly correlated expression profiles across samples/conditions. It enables researchers to discover latent gene/molecule interactions, identify novel gene functions, and extract molecular features from certain disease/condition groups, thus helping to identify disease biomarkers. However, there lacks an easy-to-use tool package for users to mine GCN modules that are relatively small in size with tightly connected genes that can be convenient for downstream gene set enrichment analysis, as well as modules that may share common members. To address this need, we developed an online GCN mining tool package: TSUNAMI (Tools SUite for Network Analysis and MIning). TSUNAMI incorporates our state-of-the-art lmQCM algorithm to mine GCN modules for both public and user-input data (microarray, RNA-seq, or any other numerical omics data), and then performs downstream gene set enrichment analysis for the identified modules. It has several features and advantages: 1) a user-friendly interface and real-time co-expression network mining through a web server; 2) direct access and search of NCBI Gene Expression Omnibus (GEO) and The Cancer Genome Atlas (TCGA) databases, as well as user-input gene expression matrices for GCN module mining; 3) multiple co-expression analysis tools to choose from, all of which are highly flexible in regards to parameter selection options; 4) identified GCN modules are summarized to eigengenes, which are convenient for users to check their correlation with other clinical traits; 5) integrated downstream Enrichr enrichment analysis and links to other gene set enrichment tools; and 6) visualization of gene loci by Circos plot in any step of the process. The web service is freely accessible through URL: https://biolearns.medicine.iu.edu/. Source code is available at https://github.com/huangzhii/TSUNAMI/.  相似文献   

15.
SUMMARY: In this paper we present a data mining system, which allows the application of different clustering and cluster validity algorithms for DNA microarray data. This tool may improve the quality of the data analysis results, and may support the prediction of the number of relevant clusters in the microarray datasets. This systematic evaluation approach may significantly aid genome expression analyses for knowledge discovery applications. The developed software system may be effectively used for clustering and validating not only DNA microarray expression analysis applications but also other biomedical and physical data with no limitations. AVAILABILITY: The program is freely available for non-profit use on request at http://www.cs.tcd.ie/Nadia.Bolshakova/Machaon.html CONTACT: Nadia.Bolshakova@cs.tcd.ie.  相似文献   

16.
Functional switches are often regulated by dynamic protein modifications. Assessing protein functions, in vivo, and their functional switches remains still a great challenge in this age of development. An alternative methodology based on in silico procedures may facilitate assessing the multifunctionality of proteins and, in addition, allow predicting functions of those proteins that exhibit their functionality through transitory modifications. Extensive research is ongoing to predict the sequence of protein modification sites and analyze their dynamic nature. This study reports the analysis performed on phosphorylation, Phospho.ELM (version 3.0) and glycosylation, OGlycBase (version 6.0) data for mining association patterns utilizing a newly developed algorithm, MAPRes. This method, MAPRes (Mining Association Patterns among preferred amino acid residues in the vicinity of amino acids targeted for post-translational modifications), is based on mining association among significantly preferred amino acids of neighboring sequence environment and modification sites themselves. Association patterns arrived at by association pattern/rule mining were in significant conformity with the results of different approaches. However, attempts to analyze substrate sequence environment of phosphorylation sites catalyzed for Tyr kinases and the sequence data for O-GlcNAc modification were not successful, due to the limited data available. Using the MAPRes algorithm for developing an association among PTM site with its vicinal amino acids is a valid method with many potential uses: this is indeed the first method ever to apply the association pattern mining technique to protein post-translational modification data.  相似文献   

17.
新疆天山雪莲(Sasussured involucrata)具有较高的极端低温耐受特性,为低温耐受机制研究提供了一种非常好的模式植物。新疆天山雪莲转录组注解知识库(http://www.shengtingbiology.com/Saussurea KBase/index.jsp)是基于网络数据资源的综合性数据库,由html、Perl、Perl CGI/DBD/DBI、Java和Java Script编程所设计的前端界面和用于数据存取、注释及管理的后端数据库管理系统Postgrel SQL构成。知识库包含基因组数据、转录组原始数据、质量控制数据、GC含量、功能基因序列及注释、功能基因代谢通路、功能基因的注释统计、雪莲与其它物种的转录组或基因组比较分析数据和生物分析软件包等资源。该数据库不仅有利于低温功能基因组学及低温耐受机制研究,而且为冷耐受性状物种的分子育种提供基因资源平台和理论依据。  相似文献   

18.
19.
Data on the response of bird communities to surface mining and habitat modification are limited, with virtually no data examining the effects of mining on bird communities in and along riparian forest corridors. Bird community composition was examined using line transects from 1994 to 2000 at eight sites within and along a riparian forest corridor in southwestern Indiana that was impacted by an adjacent surface mining operation. Three habitats were sampled: closed canopy, riparian forest with no open water; fragmented canopy, riparian forest with flood plain oxbows; and reclaimed mined land with constructed ponds. Despite shifts in species composition, overall bird species richness, measured as the mean number of bird species recorded/transect route, did not differ among habitats and remained unchanged across years. More species were recorded solely on mined land than in either closed forest or forested oxbow habitats. Mined land provided stopover habitat for shorebirds and waterfowl not recorded in other habitats, and supported an assemblage of grassland-associated bird species weakly represented in the area prior to mining. A variety of wood warblers and other migrants were recorded in the forest corridor throughout the survey period, suggesting that, although surface mining reduced the width of the forest corridor, the corridor was still important habitat for movement of forest-dependent birds and non-resident bird species in migration. We suggest that surface mining and reclamation practices can be implemented near riparian forest and still provide for a diverse assemblage of bird species. These data indicate that even narrow (0.4 km wide) riparian corridors are potentially valuable in a landscape context as stopover habitats and routes of dispersal and movement of forest-dependent and migratory bird species.  相似文献   

20.
Recent advances in computing technology have increased interest in applying data mining to ecology. Machine learning is one of the methods used in most of these data mining applications. As is well known, approximately 80% of the resources in most data mining applications are devoted to cleaning and preprocessing the data. However, there are few studies on preprocessing the ecological data used as the input in these data mining systems. In this study, we use four different feature selection methods (χ2, Information Gain, Gain Ratio, and Symmetrical Uncertainty) and evaluate their effectiveness in preprocessing the input data to be used for inducing artificial neural networks (ANNs) and decision trees (DTs). The presence/absence of fish is the data item used to illustrate our models. Feature selection is fundamental in order to increase the performances of the models obtained. Accuracy of classification improves when a small set of optimally selected features is used. DTs and ANNs are very useful tools when applied to modeling presence/absence of Alburnus alburnus alborella. ANNs generally performed better than DT models.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号