首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 78 毫秒
1.
Data mining in bioinformatics using Weka   总被引:8,自引:0,他引:8  
The Weka machine learning workbench provides a general-purpose environment for automatic classification, regression, clustering and feature selection-common data mining problems in bioinformatics research. It contains an extensive collection of machine learning algorithms and data pre-processing methods complemented by graphical user interfaces for data exploration and the experimental comparison of different machine learning techniques on the same problem. Weka can process data given in the form of a single relational table. Its main objectives are to (a) assist users in extracting useful information from data and (b) enable them to easily identify a suitable algorithm for generating an accurate predictive model from it. AVAILABILITY: http://www.cs.waikato.ac.nz/ml/weka.  相似文献   

2.
Interactions between chromatin segments play a large role in functional genomic assays and developments in genomic interaction detection methods have shown interacting topological domains within the genome. Among these methods, Hi-C plays a key role. Here, we present the Genome Interaction Tools and Resources (GITAR), a software to perform a comprehensive Hi-C data analysis, including data preprocessing, normalization, and visualization, as well as analysis of topologically-associated domains (TADs). GITAR is composed of two main modules: (1) HiCtool, a Python library to process and visualize Hi-C data, including TAD analysis; and (2) processed data library, a large collection of human and mouse datasets processed using HiCtool. HiCtool leads the user step-by-step through a pipeline, which goes from the raw Hi-C data to the computation, visualization, and optimized storage of intra-chromosomal contact matrices and TAD coordinates. A large collection of standardized processed data allows the users to compare different datasets in a consistent way, while saving time to obtain data for visualization or additional analyses. More importantly, GITAR enables users without any programming or bioinformatic expertise to work with Hi-C data. GITAR is publicly available at http://genomegitar.org as an open-source software.  相似文献   

3.
为准确、快速地获取入侵生物野外调查数据, 我们基于全球卫星导航系统、地理信息系统、移动互联网等现代信息技术提出了外来物种入侵大数据采集方法, 设计并研发了数据表单可自定义的野外调查工具软件——云采集。该系统以Android手机为数据采集终端, 采用C#和Java语言设计开发, 运用卫星导航定位技术实现野外调查发生位置的快速采集, 通过定义9种调查指标的数据类型及指标(列值)默认值、图像拍摄、语音录入、排序等4个辅助属性, 建立调查指标与手机客户端数据录入界面的关联, 实现用户界面可定制的数据录入模式。该系统在国家重点研发项目、福建省科技重大专项及福建省红火蚁(Solenopsis invicta)疫情普查等项目的调查任务中予以应用。实践检验表明: 该系统实现了野外调查数据的离线采集、数据同步、数据查询与输出管理, 将移动智能终端采集取代传统的纸笔记录, 简化了野外调查的流程, 提高了入侵生物野外调查的数据质量, 为外来生物入侵野外调查大数据采集提供了信息化支持。  相似文献   

4.
For the past year we have been using a relational database as part of an automated data collection system for cryoEM. The database is vital for keeping track of the very large number of images collected and analyzed by the automated system and essential for quantitatively evaluating the utility of methods and algorithms used in the data collection. The database can be accessed using a variety of tools including specially developed Web-based interfaces that enable a user to annotate and categorize images using a Web-based form.  相似文献   

5.
Genetic data obtained on population samples convey information about their evolutionary history. Inference methods can extract part of this information but they require sophisticated statistical techniques that have been made available to the biologist community (through computer programs) only for simple and standard situations typically involving a small number of samples. We propose here a computer program (DIY ABC) for inference based on approximate Bayesian computation (ABC), in which scenarios can be customized by the user to fit many complex situations involving any number of populations and samples. Such scenarios involve any combination of population divergences, admixtures and population size changes. DIY ABC can be used to compare competing scenarios, estimate parameters for one or more scenarios and compute bias and precision measures for a given scenario and known values of parameters (the current version applies to unlinked microsatellite data). This article describes key methods used in the program and provides its main features. The analysis of one simulated and one real dataset, both with complex evolutionary scenarios, illustrates the main possibilities of DIY ABC. AVAILABILITY: The software DIY ABC is freely available at http://www.montpellier.inra.fr/CBGP/diyabc.  相似文献   

6.
The LCB Data Warehouse   总被引:2,自引:0,他引:2  
  相似文献   

7.
KEGGanim: pathway animations for high-throughput data   总被引:1,自引:0,他引:1  
MOTIVATION: Gene expression analysis with microarrays has become one of the most widely used high-throughput methods for gathering genome-wide functional data. Emerging -omics fields such as proteomics and interactomics introduce new information sources. With the rise of systems biology, researchers need to concentrate on entire complex pathways that guide individual genes and related processes. Bioinformatics methods are needed to link the existing knowledge about pathways with the growing amounts of experimental data. RESULTS: We present KEGGanim, a novel web-based tool for visualizing experimental data in biological pathways. KEGGanim produces animations and images of KEGG pathways using public or user uploaded high-throughput data. Pathway members are coloured according to experimental measurements, and animated over experimental conditions. KEGGanim visualization highlights dynamic changes over conditions and allows the user to observe important modules and key genes that influence the pathway. The simple user interface of KEGGanim provides options for filtering genes and experimental conditions. KEGGanim may be used with public or private data for 14 organisms with a large collection of public microarray data readily available. Most common gene and protein identifiers and microarray probesets are accepted for visualization input. AVAILABILITY: http://biit.cs.ut.ee/KEGGanim/.  相似文献   

8.
Background, Aims and Scope  Although LCA is frequently used in product comparison, many practitioners are interested in identifying and assessing improvements within a life cycle. Thus, the goals of this work are to provide guidelines for scenario formulation for process and material alternatives within a life cycle inventory and to evaluate the usefulness of decision tree and matrix computational structures in the assessment of material and process alternatives. We assume that if the analysis goal is to guide the selection among alternatives towards reduced life cycle environmental impacts, then the analysis should estimate the inventory results in a manner that: (1) reveals the optimal set of processes with respect to minimization of each impact of interest, and (2) minimizes and organizes computational and data collection needs. Methods  A sample industrial system is used to reveal the complexities of scenario formulation for process and material alternatives in an LCI. The system includes 4 processes, each executable in 2 different ways, as well as 1 process able to use 2 different materials interchangeably. We formulate and evaluate scenarios for this system using three different methods and find advantages and disadvantages with each. First, the single branch decision tree method stays true to the typical construction of decision trees such that each branch of the tree represents a single scenario. Next, the process flow decision tree method strays from the typical construction of decision trees by following the process flow of the product system, such that multiple branches are needed to represent a single scenario. In the final method, disaggregating the demand vector, each scenario is represented by separate vectors which are combined into a matrix to allow the simultaneous solution of the inventory problem for all scenarios. Results  For both decision tree and matrix methods, scenario formulation, data collection, and scenario analysis are facilitated in two ways. First, process alternatives that cannot actually be chosen should be modeled as sub-inventories (or as a complete LCI within an LCI). Second, material alternatives (e.g., a choice between structural materials) must be maintained within the analysis to avoid the creation of artificial multi-functional processes. Further, in the same manner that decision trees can be used to estimate ‘expected value’ (the sum of the probability of each scenario multiplied by its ‘value’), we find that expected inventory and impact results can be defined for both decision tree and matrix methods. Discussion  For scenario formulation, naming scenarios in a way that differentiate them from other scenarios is complex and important in the continuing development of LCI data for use in databases or LCA software. In the formulation and assessment of scenarios, decision tree methods offer some level of visual appeal and the potential for using commercially available software/ traditional decision tree solution constructs for estimating expected values (for relatively small or highly aggregated product systems). However, solving decision tree systems requires the use of sequential process scaling which is difficult to formalize with mathematical notation. In contrast, preparation of a demand matrix does not require use of the sequential method to solve the inventory problem but requires careful scenario tracking efforts. Conclusions  Here, we recognize that improvements can be made within a product system. This recognition supports the greater use of LCA in supply chain formation and product research, development, and design. We further conclude that although both decision tree and matrix methods are formulated herein to reveal optimal life cycle scenarios, the use of demand matrices is preferred in the preparation of a formal mathematical construct. Further, for both methods, data collection and assessment are facilitated by the use of sub-inventories (or as a complete LCI within an LCI) for process alternatives and the full consideration of material alternatives to avoid the creation of artificial multi-functional processes. Recommendations and Perspectives  The methods described here are used in the assessment of forest management alternatives and are being further developed to form national commodity models considering technology alternatives, national production mixes and imports, and point-to-point transportation models. ESS-Submission Editor: Thomas Gloria, PhD (t.gloria@fivewinds.com)  相似文献   

9.
MOTIVATION: The identification of the change of gene expression in multifactorial diseases, such as breast cancer is a major goal of DNA microarray experiments. Here we present a new data mining strategy to better analyze the marginal difference in gene expression between microarray samples. The idea is based on the notion that the consideration of gene's behavior in a wide variety of experiments can improve the statistical reliability on identifying genes with moderate changes between samples. RESULTS: The availability of a large collection of array samples sharing the same platform in public databases, such as NCBI GEO, enabled us to re-standardize the expression intensity of a gene using its mean and variation in the wide variety of experimental conditions. This approach was evaluated via the re-identification of breast cancer-specific gene expression. It successfully prioritized several genes associated with breast tumor, for which the expression difference between normal and breast cancer cells was marginal and thus would have been difficult to recognize using conventional analysis methods. Maximizing the utility of microarray data in the public database, it provides a valuable tool particularly for the identification of previously unrecognized disease-related genes. AVAILABILITY: A user friendly web-interface (http://compbio.sookmyung.ac.kr/~lage/) was constructed to provide the present large-scale approach for the analysis of GEO microarray data (GS-LAGE server).  相似文献   

10.
Quantifying differences in resource use and waste generation between individual households and exploring the reasons for the variations observed implies the need for disaggregated data on household activities and related physical flows. The collection of disaggregated data for water use, gas use, electricity use, and mobility has been reported in the literature and is normally achieved through sensors and computational algorithms. This study focuses on collecting disaggregated data for goods consumption and related waste generation at the level of individual households. To this end, two data collection approaches were devised and evaluated: (1) triangulating shopping receipt analysis and waste component analysis and (2) tracking goods consumption and waste generation using a smartphone. A case study on two households demonstrated that it is possible to collect quantitative data on goods consumption and related waste generation on a per unit basis for individual households. The study suggested that the type of data collected can be relevant in a number of different research contexts: eco‐feedback; user‐centered research; living‐lab research; and life cycle impacts of household consumption. The approaches presented in this study are most applicable in the context of user‐centered or living‐lab research. For the other contexts, alternative data sources (e.g., retailers and producers) may be better suited to data collection on larger samples, though at a lesser level of detail, compared with the two data collection approaches devised and evaluated in this study.  相似文献   

11.
RNA-Seq analysis in MeV   总被引:1,自引:0,他引:1  
  相似文献   

12.
Increased platform heterogeneity and varying resource availability in distributed systems motivate the design of resource-aware applications, which ensure a desired performance level by continuously adapting their behavior to changing resource characteristics. In this paper, we describe an application-independent adaptation framework that simplifies the design of resource-aware applications. This framework eliminates the need for adaptation decisions to be explicitly programmed into the application by relying on two novel components: (1) a tunability interface, which exposes adaptation choices in the form of alternate application configurations while encapsulating core application functionality; and (2) a virtual execution environment, which emulates application execution under diverse resource availability enabling off-line collection of information about resulting behavior. Together, these components permit automatic run-time decisions on when to adapt by continuously monitoring resource conditions and application progress, and how to adapt by dynamically choosing an application configuration most appropriate for the prescribed user preference. We evaluate the framework using an interactive distributed image visualization application and a parallel image processing application. The framework permits automatic adaptation to changes in execution environment characteristics such as available network bandwidth or data arrival pattern by choosing a different application configuration that satisfies user preferences of output quality and timeliness.  相似文献   

13.
To explore the extent to which microevolutionary inference can be made using spatial autocorrelation analysis of gene frequency surfaces, we simulated sets of surfaces for nine evolutionary scenarios, and subjected spatially-based summary statistics of these to linear discriminant analysis. Scenarios varied the amounts of dispersion, selection, migration, and deme sizes, and included: panmixia, drift, intrusion, and stepping-stone models with 0–2 migrations, 0–2 selection gradients, and migration plus selection. To discover how weak evolutionary forces could be and still allow discrimination, each scenario had both a strong and a weak configuration. Discriminant rules were calculated using one collection of data (the training set) consisting of 250 sets of 15 surfaces for each of the nine scenarios. Misclassification rates were verified against a second, entirely new set of data (the test set) equal in size. Test set misclassification rates for the 20 best discriminating variables ranged from 39.3% (weak) to 3.6% (strong), far lower than the expected rate of 88.9% absent any discriminating ability. Misclassification was highest when discriminating the number of migrational events or the presence or number of selection events. Discrimination of drift and panmixia from the other scenarios was perfect. A subsequent subjective analysis of a subset of the data by one of us yielded comparable, although somewhat higher, misclassification rates. Judging by these results, spatial autocorrelation variables describing sets of gene frequency surfaces permit some microevolutionary inferences.  相似文献   

14.
15.
Transitivity Clustering is a method for the partitioning of biological data into groups of similar objects, such as genes, for instance. It provides integrated access to various functions addressing each step of a typical cluster analysis. To facilitate this, Transitivity Clustering is accessible online and offers three user-friendly interfaces: a powerful stand-alone version, a web interface, and a collection of Cytoscape plug-ins. In this paper, we describe three major workflows: (i) protein (super)family detection with Cytoscape, (ii) protein homology detection with incomplete gold standards and (iii) clustering of gene expression data. This protocol guides the user through the most important features of Transitivity Clustering and takes ~1 h to complete.  相似文献   

16.
Determination of precise and accurate protein structures by NMR generally requires weeks or even months to acquire and interpret all the necessary NMR data. However, even medium-accuracy fold information can often provide key clues about protein evolution and biochemical function(s). In this article we describe a largely automatic strategy for rapid determination of medium-accuracy protein backbone structures. Our strategy derives from ideas originally introduced by other groups for determining medium-accuracy NMR structures of large proteins using deuterated, (13)C-, (15)N-enriched protein samples with selective protonation of side-chain methyl groups ((13)CH(3)). Data collection includes acquiring NMR spectra for automatically determining assignments of backbone and side-chain (15)N, H(N) resonances, and side-chain (13)CH(3) methyl resonances. These assignments are determined automatically by the program AutoAssign using backbone triple resonance NMR data, together with Spin System Type Assignment Constraints (STACs) derived from side-chain triple-resonance experiments. The program AutoStructure then derives conformational constraints using these chemical shifts, amide (1)H/(2)H exchange, nuclear Overhauser effect spectroscopy (NOESY), and residual dipolar coupling data. The total time required for collecting such NMR data can potentially be as short as a few days. Here we demonstrate an integrated set of NMR software which can process these NMR spectra, carry out resonance assignments, interpret NOESY data, and generate medium-accuracy structures within a few days. The feasibility of this combined data collection and analysis strategy starting from raw NMR time domain data was illustrated by automatic analysis of a medium accuracy structure of the Z domain of Staphylococcal protein A.  相似文献   

17.
The Molecular Evolutionary Genetics Analysis (MEGA) software has matured to contain a large collection of methods and tools of computational molecular evolution. Here, we describe new additions that make MEGA a more comprehensive tool for building timetrees of species, pathogens, and gene families using rapid relaxed-clock methods. Methods for estimating divergence times and confidence intervals are implemented to use probability densities for calibration constraints for node-dating and sequence sampling dates for tip-dating analyses. They are supported by new options for tagging sequences with spatiotemporal sampling information, an expanded interactive Node Calibrations Editor, and an extended Tree Explorer to display timetrees. Also added is a Bayesian method for estimating neutral evolutionary probabilities of alleles in a species using multispecies sequence alignments and a machine learning method to test for the autocorrelation of evolutionary rates in phylogenies. The computer memory requirements for the maximum likelihood analysis are reduced significantly through reprogramming, and the graphical user interface has been made more responsive and interactive for very big data sets. These enhancements will improve the user experience, quality of results, and the pace of biological discovery. Natively compiled graphical user interface and command-line versions of MEGA11 are available for Microsoft Windows, Linux, and macOS from www.megasoftware.net.  相似文献   

18.
Klamt S  von Kamp A 《Bio Systems》2011,105(2):162-168
CellNetAnalyzer (CNA) is a MATLAB toolbox providing computational methods for studying structure and function of metabolic and cellular signaling networks. In order to allow non-experts to use these methods easily, CNA provides GUI-based interactive network maps as a means of parameter input and result visualization. However, with the availability of high-throughput data, there is a need to make CNA's functionality also accessible in batch mode for automatic data processing. Furthermore, as some algorithms of CNA are of general relevance for network analysis it would be desirable if they could be called as sub-routines by other applications. For this purpose, we developed an API (application programming interface) for CNA allowing users (i) to access the content of network models in CNA, (ii) to use CNA's network analysis capabilities independent of the GUI, and (iii) to interact with the GUI to facilitate the development of graphical plugins.Here we describe the organization of network projects in CNA and the application of the new API functions to these projects. This includes the creation of network projects from scratch, loading and saving of projects and scenarios, and the application of the actual analysis methods. Furthermore, API functions for the import/export of metabolic models in SBML format and for accessing the GUI are described. Lastly, two example applications demonstrate the use and versatile applicability of CNA's API. CNA is freely available for academic use and can be downloaded from http://www.mpi-magdeburg.mpg.de/projects/cna/cna.html.  相似文献   

19.
To measure the activity of neurons using whole-brain activity imaging, precise detection of each neuron or its nucleus is required. In the head region of the nematode C. elegans, the neuronal cell bodies are distributed densely in three-dimensional (3D) space. However, no existing computational methods of image analysis can separate them with sufficient accuracy. Here we propose a highly accurate segmentation method based on the curvatures of the iso-intensity surfaces. To obtain accurate positions of nuclei, we also developed a new procedure for least squares fitting with a Gaussian mixture model. Combining these methods enables accurate detection of densely distributed cell nuclei in a 3D space. The proposed method was implemented as a graphical user interface program that allows visualization and correction of the results of automatic detection. Additionally, the proposed method was applied to time-lapse 3D calcium imaging data, and most of the nuclei in the images were successfully tracked and measured.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号