首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Classification is a data mining task the goal of which is to learn a model, from a training dataset, that can predict the class of a new data instance, while clustering aims to discover natural instance-groupings within a given dataset. Learning cluster-based classification systems involves partitioning a training set into data subsets (clusters) and building a local classification model for each data cluster. The class of a new instance is predicted by first assigning the instance to its nearest cluster and then using that cluster’s local classification model to predict the instance’s class. In this paper, we present an ant colony optimization (ACO) approach to building cluster-based classification systems. Our ACO approach optimizes the number of clusters, the positioning of the clusters, and the choice of classification algorithm to use as the local classifier for each cluster. We also present an ensemble approach that allows the system to decide on the class of a given instance by considering the predictions of all local classifiers, employing a weighted voting mechanism based on the fuzzy degree of membership in each cluster. Our experimental evaluation employs five widely used classification algorithms: naïve Bayes, nearest neighbour, Ripper, C4.5, and support vector machines, and results are reported on a suite of 54 popular UCI benchmark datasets.  相似文献   

2.
We present a benchmark suite for computational Grids. It is based on the NAS Parallel Benchmarks (NPB) and is called NAS Grid Benchmark (NGB) in this paper. We present NGB as a data flow graph encapsulating an instance of an NPB code in each graph node, which communicates with other nodes by sending/receiving initialization data. These nodes may be mapped to the same or different Grid machines. Like NPB, NGB specifies several different classes (problem sizes). NGB also specifies the generic Grid services sufficient for running the suite. The implementor has the freedom to choose any Grid environment. We describe a reference implementation in Java, and present some scenarios for using NGB.  相似文献   

3.
The POPPs is a suite of inter-related software tools which allow the user to discover what is statistically 'unusual' in the composition of an unknown protein, or to automatically cluster proteins into families based on peptide composition. Finally, the user can search for related proteins based on peptide composition. Statistically based peptide composition provides a view of proteins that is, to some extent, orthogonal to that provided by sequence. In a test study, the POPP suite is able to regroup into their families sets of approximately 100 randomised Pfam protein domains. The POPPs suite is used to explore the diverse set of late embryogenesis abundant (LEA) proteins.  相似文献   

4.
Summary In some biomedical studies involving clustered binary responses (say, disease status), the cluster sizes can vary because some components of the cluster can be absent. When both the presence of a cluster component as well as the binary disease status of a present component are treated as responses of interest, we propose a novel two‐stage random effects logistic regression framework. For the ease of interpretation of regression effects, both the marginal probability of presence/absence of a component as well as the conditional probability of disease status of a present component, preserve the approximate logistic regression forms. We present a maximum likelihood method of estimation implementable using standard statistical software. We compare our models and the physical interpretation of regression effects with competing methods from literature. We also present a simulation study to assess the robustness of our procedure to wrong specification of the random effects distribution and to compare finite‐sample performances of estimates with existing methods. The methodology is illustrated via analyzing a study of the periodontal health status in a diabetic Gullah population.  相似文献   

5.
Very often, living beings seem able to change their functioning when external conditions vary. In order to study this property, we have devised abstract machines whose internal organisation changes whenever the external conditions vary. The internal organisations of these machines (or programs), are as simple as possible, functions of discrete variables. We call such machines self-modifying automata.These machines stabilise after any transient steps when they go indefinitely through a loop called p-cycle or limit cycle of length p. More often than not, the p in the cycle is equal to one and the cycle reduces to a fixed point.In this case the external value (v) can be considered as the index of function f such as: fv(v)v and the machine has the property of self-replication and to be self-referential. Many authors, in computer and natural science, consider that self-referential objects are a main concept in comprehension of perception, behaviour and associations.In the third part, we have studied chains of automata. Only one automaton changes its internal organisation at each step. Chains of automata have better performances than single self-modifying automata: Higher frequency of fixed point occurrence and a shorter transient length. The performances of the chains of automata improve when the value of their internal states increases whereas the performances of single automata decrease.  相似文献   

6.
Datamonkey is a web interface to a suite of cutting edge maximum likelihood-based tools for identification of sites subject to positive or negative selection. The methods range from very fast data exploration to the some of the most complex models available in public domain software, and are implemented to run in parallel on a cluster of computers. AVAILABILITY: http://www.datamonkey.org. In the future, we plan to expand the collection of available analytic tools, and provide a package for installation on other systems.  相似文献   

7.
With the proliferation of Quad/Multi-core micro-processors in mainstream platforms such as desktops and workstations; a large number of unused CPU cycles can be utilized for running virtual machines (VMs) as dynamic nodes in distributed environments. Grid services and its service oriented business broker now termed cloud computing could deploy image based virtualization platforms enabling agent based resource management and dynamic fault management. In this paper we present an efficient way of utilizing heterogeneous virtual machines on idle desktops as an environment for consumption of high performance grid services. Spurious and exponential increases in the size of the datasets are constant concerns in medical and pharmaceutical industries due to the constant discovery and publication of large sequence databases. Traditional algorithms are not modeled at handing large data sizes under sudden and dynamic changes in the execution environment as previously discussed. This research was undertaken to compare our previous results with running the same test dataset with that of a virtual Grid platform using virtual machines (Virtualization). The implemented architecture, A3pviGrid utilizes game theoretic optimization and agent based team formation (Coalition) algorithms to improve upon scalability with respect to team formation. Due to the dynamic nature of distributed systems (as discussed in our previous work) all interactions were made local within a team transparently. This paper is a proof of concept of an experimental mini-Grid test-bed compared to running the platform on local virtual machines on a local test cluster. This was done to give every agent its own execution platform enabling anonymity and better control of the dynamic environmental parameters. We also analyze performance and scalability of Blast in a multiple virtual node setup and present our findings. This paper is an extension of our previous research on improving the BLAST application framework using dynamic Grids on virtualization platforms such as the virtual box.  相似文献   

8.
9.
We present a software system BASIO that allows one to segment a sequence into regions with homogeneous nucleotide composition at a desired length scale. The system can work with arbitrary alphabet and therefore can be applied to various (e.g. protein) sequences. Several sequences of complete genomes of eukaryotes are used to demonstrate the efficiency of the software. AVAILABILITY: The BASIO suite is available for non-commercial users free of charge as a set of executables and accompanying segmentation scenarios from http://www.imb.ac.ru/compbio/basio. To obtain the source code, contact the authors.  相似文献   

10.
We present the RAVEN (Reconstruction, Analysis and Visualization of Metabolic Networks) Toolbox: a software suite that allows for semi-automated reconstruction of genome-scale models. It makes use of published models and/or the KEGG database, coupled with extensive gap-filling and quality control features. The software suite also contains methods for visualizing simulation results and omics data, as well as a range of methods for performing simulations and analyzing the results. The software is a useful tool for system-wide data analysis in a metabolic context and for streamlined reconstruction of metabolic networks based on protein homology. The RAVEN Toolbox workflow was applied in order to reconstruct a genome-scale metabolic model for the important microbial cell factory Penicillium chrysogenum Wisconsin54-1255. The model was validated in a bibliomic study of in total 440 references, and it comprises 1471 unique biochemical reactions and 1006 ORFs. It was then used to study the roles of ATP and NADPH in the biosynthesis of penicillin, and to identify potential metabolic engineering targets for maximization of penicillin production.  相似文献   

11.
A totally data-based approach to the evaluation of short-term tests is proposed. The performances of 22 tests over a range of 42 chemicals (data from literature) were studied by cluster analysis. The comparison between them was performed only on the basis of their responses to the chemicals. Two different clustering methods produced a coincident classification, pointing to a clear resolution of all tests into 3 groups with common characteristics. With respect to carcinogen discrimination, cluster 1 showed the highest sensitivity and the lowest specificity. Cluster 3 had opposite characteristics. The tests of cluster 2 showed intermediate features. As far as the membership to clusters is concerned, the literature data about the responses to chemicals indicated a strong test system specificity. This apparently overcame both phylogeny and end-point community. A major characteristic of the present approach is the ability to elicit underlying patterns, the knowledge of which can contribute both to hypothesis formulation and be useful for practical purposes.  相似文献   

12.
HORA suite (Human blOod Range vAlidator) consists of a Java application used to validate the metabolomic analysis of human blood against a database that stores the normal plasma and serum range concentrations of metabolites. The goal of HORA is to find the metabolites that are outside the normal range and to show those not present in the list provided by the user, for different thresholds of concentration. Moreover it supplies a graphical interface to manage the data. The software can also be used to compare different metabolomic techniques. HORA is open-source software and it can be accessed at . A separate file contains instructions for the installation and a brief tutorial.  相似文献   

13.
We present in this paper a novel, semiautomated image-analysis software to streamline the quantitative analysis of root growth and architecture of complex root systems. The software combines a vectorial representation of root objects with a powerful tracing algorithm that accommodates a wide range of image sources and quality. The root system is treated as a collection of roots (possibly connected) that are individually represented as parsimonious sets of connected segments. Pixel coordinates and gray level are therefore turned into intuitive biological attributes such as segment diameter and orientation as well as distance to any other segment or topological position. As a consequence, user interaction and data analysis directly operate on biological entities (roots) and are not hampered by the spatially discrete, pixel-based nature of the original image. The software supports a sampling-based analysis of root system images, in which detailed information is collected on a limited number of roots selected by the user according to specific research requirements. The use of the software is illustrated with a time-lapse analysis of cluster root formation in lupin (Lupinus albus) and an architectural analysis of the maize (Zea mays) root system. The software, SmartRoot, is an operating system-independent freeware based on ImageJ and relies on cross-platform standards for communication with data-analysis software.  相似文献   

14.
BEAST 2: A Software Platform for Bayesian Evolutionary Analysis   总被引:1,自引:0,他引:1  
We present a new open source, extensible and flexible software platform for Bayesian evolutionary analysis called BEAST 2. This software platform is a re-design of the popular BEAST 1 platform to correct structural deficiencies that became evident as the BEAST 1 software evolved. Key among those deficiencies was the lack of post-deployment extensibility. BEAST 2 now has a fully developed package management system that allows third party developers to write additional functionality that can be directly installed to the BEAST 2 analysis platform via a package manager without requiring a new software release of the platform. This package architecture is showcased with a number of recently published new models encompassing birth-death-sampling tree priors, phylodynamics and model averaging for substitution models and site partitioning. A second major improvement is the ability to read/write the entire state of the MCMC chain to/from disk allowing it to be easily shared between multiple instances of the BEAST software. This facilitates checkpointing and better support for multi-processor and high-end computing extensions. Finally, the functionality in new packages can be easily added to the user interface (BEAUti 2) by a simple XML template-based mechanism because BEAST 2 has been re-designed to provide greater integration between the analysis engine and the user interface so that, for example BEAST and BEAUti use exactly the same XML file format.
This is a PLOS Computational Biology Software Article.
  相似文献   

15.
The current status and portability of our sequence handling software.   总被引:94,自引:15,他引:79       下载免费PDF全文
I describe the current status of our sequence analysis software. The package contains a comprehensive suite of programs for managing large shotgun sequencing projects, a program containing 61 functions for analysing single sequences and a program for comparing pairs of sequences for similarity. The programs that have been described before have been improved by the addition of new functions and by being made very much easier to use. The major interactive programs have 125 pages of online help available from within them. Several new programs are described including screen editing of aligned gel readings for shotgun sequencing projects; a method to highlight errors in aligned gel readings, new methods for searching for putative signals in sequences. We use the programs on a VAX computer but the whole package has been rewritten to make it easy to transport it to other machines. I believe the programs will now run on any machine with a FORTRAN77 compiler and sufficient memory. We are currently putting the programs onto an IBM PC XT/AT and another micro running under UNIX.  相似文献   

16.
In this paper, we investigate the benefits that organisations can reap by using “Cloud Computing” providers to augment the computing capacity of their local infrastructure. We evaluate the cost of seven scheduling strategies used by an organisation that operates a cluster managed by virtual machine technology and seeks to utilise resources from a remote Infrastructure as a Service (IaaS) provider to reduce the response time of its user requests. Requests for virtual machines are submitted to the organisation’s cluster, but additional virtual machines are instantiated in the remote provider and added to the local cluster when there are insufficient resources to serve the users’ requests. Naïve scheduling strategies can have a great impact on the amount paid by the organisation for using the remote resources, potentially increasing the overall cost with the use of IaaS. Therefore, in this work we investigate seven scheduling strategies that consider the use of resources from the “Cloud”, to understand how these strategies achieve a balance between performance and usage cost, and how much they improve the requests’ response times.  相似文献   

17.
In high performance computing (HPC) resources’ extensive experiments are frequently executed. HPC resources (e.g. computing machines and switches) should be able to handle running several experiments in parallel. Typically HPC utilizes parallelization in programs, processing and data. The underlying network is seen as the only non-parallelized HPC component (i.e. no dynamic virtual slicing based on HPC jobs). In this scope we present an approach in this paper to utilize software defined networking (SDN) to parallelize HPC clusters among the different running experiments. We propose to accomplish this through two major components: A passive module (network mapper/remapper) to select for each experiment as soon as it starts the least busy resources in the network, and an SDN-HPC active load balancer to perform more complex and intelligent operations. Active load balancer can logically divide the network based on experiments’ host files. The goal is to reduce traffic to unnecessary hosts or ports. An HPC experiment should multicast, rather than broadcast to only cluster nodes that are used by the experiment. We use virtual tenant network modules in Opendaylight controller to create VLANs based on HPC experiments. In each HPC host, virtual interfaces are created to isolate traffic from the different experiments. The traffic between the different physical hosts that belong to the same experiment can be distinguished based on the VLAN ID assigned to each experiment. We evaluate the new approach using several HPC public benchmarks. Results show a significant enhancement in experiments’ performance especially when HPC cluster experiences running several heavy load experiments simultaneously. Results show also that this multi-casting approach can significantly reduce casting overhead that is caused by using a single cast for all resources in the HPC cluster. In comparison with InfiniBand networks that offer interconnect services with low latency and high bandwidth, HPC services based on SDN can provide two distinguished objectives that may not be possible with InfiniBand: The first objective is the integration of HPC with Ethernet enterprise networks and hence expanding HPC usage to much wider domains. The second objective is the ability to enable users and their applications to customize HPC services with different QoS requirements that fit the different needs of those applications and optimize the usage of HPC clusters.  相似文献   

18.
Influenza incidence forecasting is used to facilitate better health system planning and could potentially be used to allow at-risk individuals to modify their behavior during a severe seasonal influenza epidemic or a novel respiratory pandemic. For example, the US Centers for Disease Control and Prevention (CDC) runs an annual competition to forecast influenza-like illness (ILI) at the regional and national levels in the US, based on a standard discretized incidence scale. Here, we use a suite of forecasting models to analyze type-specific incidence at the smaller spatial scale of clusters of nearby counties. We used data from point-of-care (POC) diagnostic machines over three seasons, in 10 clusters, capturing: 57 counties; 1,061,891 total specimens; and 173,909 specimens positive for Influenza A. Total specimens were closely correlated with comparable CDC ILI data. Mechanistic models were substantially more accurate when forecasting influenza A positive POC data than total specimen POC data, especially at longer lead times. Also, models that fit subpopulations of the cluster (individual counties) separately were better able to forecast clusters than were models that directly fit to aggregated cluster data. Public health authorities may wish to consider developing forecasting pipelines for type-specific POC data in addition to ILI data. Simple mechanistic models will likely improve forecast accuracy when applied at small spatial scales to pathogen-specific data before being scaled to larger geographical units and broader syndromic data. Highly local forecasts may enable new public health messaging to encourage at-risk individuals to temporarily reduce their social mixing during seasonal peaks and guide public health intervention policy during potentially severe novel influenza pandemics.  相似文献   

19.
20.
Clusters of Symmetrical Multiprocessors (SMPs) have recently become the norm for high-performance economical computing solutions. Multiple nodes in a cluster can be used for parallel programming using a message passing library. An alternate approach is to use a software Distributed Shared Memory (DSM) to provide a view of shared memory to the application programmer. This paper describes Strings, a high performance distributed shared memory system designed for such SMP clusters. The distinguishing feature of this system is the use of a fully multi-threaded runtime system, using kernel level threads. Strings allows multiple application threads to be run on each node in a cluster. Since most modern UNIX systems can multiplex these threads on kernel level light weight processes, applications written using Strings can exploit multiple processors on a SMP machine. This paper describes some of the architectural details of the system and illustrates the performance improvements with benchmark programs from the SPLASH-2 suite, some computational kernels as well as a full fledged application. It is found that using multiple processes on SMP nodes provides good speedups only for a few of the programs. Multiple application threads can improve the performance in some cases, but other programs show a slowdown. If kernel threads are used additionally, the overall performance improves significantly in all programs tested. Other design decisions also have a beneficial impact, though to a lesser degree. This revised version was published online in July 2006 with corrections to the Cover Date.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号