期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Fair Scheduling of General-Purpose Workloads on Workstation Clusters

Cosimo Anglano 《Cluster computing》2002,5(1):87-96

In this paper we present a scheduling strategy for workstation clusters able to effectively and fairly schedule general-purpose workloads potentially made up by compute-bound, interactive, and I/O-intensive applications, that may each be sequential, client-server, or parallel. The scheduling strategy allocates resources to processes of the same parallel applications in such a way that they all get the same CPU share regardless of the level of resource contention on the respective machines, and relies on an extended stride scheduler to fairly allocate individual workstations. A simulation analysis carried out for a variety of workloads and operational conditions shows that our strategy (a) delivers good performance to all the applications classes composing general-purpose workloads, (b) fairly allocates resources among competing applications, and (c) outperforms alternative strategies. 相似文献

2.

Aproximating static list schedules in dynamic multithreaded applications

Cícero A. S. Camargo Gerson G. H. Cavalheiro Maurício L. Pilla Simone A. da Costa Luciana Foss 《Cluster computing》2014,17(2):155-168

List scheduling algorithms are known to be efficient when the application to be executed can be described statically as a Directed Acyclic Graph (DAG) of tasks. Regardless of knowing the entire DAG beforehand, obtaining an optimal schedule in a parallel machine is a NP-hard problem. Moreover, many programming tools propose the use of scheduling techniques based on list strategies. This paper presents an analysis of scheduling algorithms for multithread programs in a dynamic scenario where threads are created and destroyed during execution. We introduce an algorithm to convert DAGs, describing applications as tasks, into Directed Cyclic Graphs (DCGs) describing the same application designed in a multithread programming interface. Our algorithm covers case studies described in previous works, successfully mapping from the abstract level of graphs to the application environment. These mappings preserve the guarantees offered by the abstract model, providing efficient scheduling of dynamic programs that follow the intended multithread model. We conclude the paper presenting some performance results we obtained by list schedulers in dynamic multithreaded environments. We also compare these results with the best scheduling we could obtain with similar static task schedulers. 相似文献

3.

A parallel workload model and its implications for processor allocation

Allen B. Downey 《Cluster computing》1998,1(1):133-145

We develop a workload model based on the observed behavior of parallel computers at the San Diego Supercomputer Center and the Cornell Theory Center. This model gives us insight into the performance of strategies for scheduling moldable jobs on space-sharing parallel computers. We find that Adaptive Static Partitioning (ASP), which has been reported to work well for other workloads, does not perform as well as strategies that adapt better to system load. The best of the strategies we consider is one that explicitly reduces allocations when load is high (a variation of Sevcik's (1989) A+ strategy). This revised version was published online in July 2006 with corrections to the Cover Date. 相似文献

4.

Middleware support for many-task computing

Ioan Raicu Ian Foster Mike Wilde Zhao Zhang Kamil Iskra Peter Beckman Yong Zhao Alex Szalay Alok Choudhary Philip Little Christopher Moretti Amitabh Chaudhary Douglas Thain 《Cluster computing》2010,13(3):291-314

Many-task computing aims to bridge the gap between two computing paradigms, high throughput computing and high performance computing. Many-task computing denotes high-performance computations comprising multiple distinct activities, coupled via file system operations. The aggregate number of tasks, quantity of computing, and volumes of data may be extremely large. Traditional techniques found in production systems in the scientific community to support many-task computing do not scale to today’s largest systems, due to issues in local resource manager scalability and granularity, efficient utilization of the raw hardware, long wait queue times, and shared/parallel file system contention and scalability. To address these limitations, we adopted a “top-down” approach to building a middleware called Falkon, to support the most demanding many-task computing applications at the largest scales. Falkon (Fast and Light-weight tasK executiON framework) integrates (1) multi-level scheduling to enable dynamic resource provisioning and minimize wait queue times, (2) a streamlined task dispatcher able to achieve orders-of-magnitude higher task dispatch rates than conventional schedulers, and (3) data diffusion which performs data caching and uses a data-aware scheduler to co-locate computational and storage resources. Micro-benchmarks have shown Falkon to achieve over 15K+ tasks/s throughputs, scale to hundreds of thousands of processors and to millions of queued tasks, and execute billions of tasks per day. Data diffusion has also shown to improve applications scalability and performance, with its ability to achieve hundreds of Gb/s I/O rates on modest sized clusters, with Tb/s I/O rates on the horizon. Falkon has shown orders of magnitude improvements in performance and scalability than traditional approaches to resource management across many diverse workloads and applications at scales of billions of tasks on hundreds of thousands of processors across clusters, specialized systems, Grids, and supercomputers. Falkon’s performance and scalability have enabled a new class of applications called Many-Task Computing to operate at previously so-believed impossible scales with high efficiency. 相似文献

5.

A Load Balancing Tool for Distributed Parallel Loops 总被引：1，自引：0，他引：1

Ricolindo?L.?Cari?o Email author Ioana?Banicescu 《Cluster computing》2005,8(4):313-321

Large scale applications typically contain parallel loops with many iterates. The iterates of a parallel loop may have variable execution times which translate into performance degradation of an application due to load imbalance. This paper describes a tool for load balancing parallel loops on distributed-memory systems. The tool assumes that the data for a parallel loop to be executed is already partitioned among the participating processors. The tool utilizes the MPI library for interprocessor coordination, and determines processor workloads by loop scheduling techniques. The tool was designed independent of any application; hence, it must be supplied with a routine that encapsulates the computations for a chunk of loop iterates, as well as the routines to transfer data and results between processors. Performance evaluation on a Linux cluster indicates that the tool reduces the cost of executing a simulated irregular loop without load balancing by up to 81%. The tool is useful for parallelizing sequential applications with parallel loops, or as an alternate load balancing routine for existing parallel applications. 相似文献

6.

A comparative study on resource allocation and energy efficient job scheduling strategies in large-scale parallel computing systems

Aftab Ahmed Chandio Kashif Bilal Nikos Tziritas Zhibin Yu Qingshan Jiang Samee U. Khan Cheng-Zhong Xu 《Cluster computing》2014,17(4):1349-1367

In the large-scale parallel computing environment, resource allocation and energy efficient techniques are required to deliver the quality of services (QoS) and to reduce the operational cost of the system. Because the cost of the energy consumption in the environment is a dominant part of the owner’s and user’s budget. However, when considering energy efficiency, resource allocation strategies become more difficult, and QoS (i.e., queue time and response time) may violate. This paper therefore is a comparative study on job scheduling in large-scale parallel systems to: (a) minimize the queue time, response time, and energy consumption and (b) maximize the overall system utilization. We compare thirteen job scheduling policies to analyze their behavior. A set of job scheduling policies includes (a) priority-based, (b) first fit, (c) backfilling, and (d) window-based policies. All of the policies are extensively simulated and compared. For the simulation, a real data center workload comprised of 22385 jobs is used. Based on results of their performance, we incorporate energy efficiency in three policies i.e., (1) best result producer, (2) average result producer, and (3) worst result producer. We analyze the (a) queue time, (b) response time, (c) slowdown ratio, and (d) energy consumption to evaluate the policies. Moreover, we present a comprehensive workload characterization for optimizing system’s performance and for scheduler design. Major workload characteristics including (a) Narrow, (b) Wide, (c) Short, and (d) Long jobs are characterized for detailed analysis of the schedulers’ performance. This study highlights the strengths and weakness of various job scheduling polices and helps to choose an appropriate job scheduling policy in a given scenario. 相似文献

7.

Empirical performance evaluation of schedulers for cluster of workstations

Kalim Qureshi Syed Munir Hussain Shah Paul Manuel 《Cluster computing》2011,14(2):101-113

Cluster computing is receiving exponential popularity as a choice for high performance computing. This is mainly due to its effective cost performance ratio. Resource management systems (RMS) are the key component to manage the resources of clusters efficiently and have a very vital role in the performance of distributed parallel systems especially a job scheduling module. In this paper, we have empirically evaluated four resource management systems (SGE, TORQUE, and MAUI Scheduler and SLURM) with special focus on job scheduler component. These schedulers have been evaluated on a more comprehensive set of metrics such as throughput, CPU, memory and network utilization. Experiments were carried out on three different size testbeds with a range of scheduler configurations such as FCFS, Backfilling, Fair share and SJF scheduling techniques. 相似文献

8.

Sequential and parallel dual labeling of nanoparticles using click chemistry

《Bioorganic & medicinal chemistry》2014,22(21):6288-6296

Bioorthogonal ‘click’ reactions have recently emerged as promising tools for chemistry and biological applications. By using a combination of two different ‘click’ reactions, ‘double-click’ strategies have been developed to attach multiple labels onto biomacromolecules. These strategies require multi-step modifications of the biomacromolecules that can lead to heterogeneity in the final conjugates. Herein, we report the synthesis and characterization of a set of three trifunctional linkers. The linkers having alkyne and cyclooctyne moieties that are capable of participating in sequential copper(I)-catalyzed and copper-free cycloaddition reactions with azides. We have also prepared a linker comprised of an alkyne and a 1,2,4,5-terazine moiety that allows for simultaneous cycloaddition reactions with azides and trans-cyclooctenes, respectively. These linkers can be attached to synthetic or biological macromolecules to create a platform capable of sequential or parallel ‘double-click’ labeling in biological systems. We show this potential using a generation 5 (G5) polyamidoamine (PAMAM) dendrimer in combination with the clickable linkers. The dendrimers were successfully modified with these linkers and we demonstrate both sequential and parallel ‘double-click’ labeling with fluorescent reporters. We anticipate that these linkers will have a variety of application including molecular imaging and monitoring of macromolecule interactions in biological systems. 相似文献

9.

Energy efficient scheduling for parallel applications on mobile clusters

Ziliang Zong Mais Nijim Adam Manzanares Xiao Qin 《Cluster computing》2008,11(1):91-113

During the past decade, cluster computing and mobile communication technologies have been extensively deployed and widely applied because of their giant commercial value. The rapid technological advancement makes it feasible to integrate these two technologies and a revolutionary application called mobile cluster computing is arising on the horizon. Mobile cluster computing technology can further enhance the power of our laptops and mobile devices by running parallel applications. However, scheduling parallel applications on mobile clusters is technically challenging due to the significant communication latency and limited battery life of mobile devices. Therefore, shortening schedule length and conserving energy consumption have become two major concerns in designing efficient and energy-aware scheduling algorithms for mobile clusters. In this paper, we propose two novel scheduling strategies aimed at leveraging performance and power consumption for parallel applications running on mobile clusters. Our research focuses on scheduling precedence constrained parallel tasks and thus duplication heuristics are applied to schedule parallel tasks to minimize communication overheads. However, existing duplication algorithms are developed with consideration of schedule lengths, completely ignoring energy consumption of clusters. In this regard, we design two energy-aware duplication scheduling algorithms, called EADUS and TEBUS, to schedule precedence constrained parallel tasks with a complexity of O(n ²), where n is the number of tasks in a parallel task set. Unlike the existing duplication-based scheduling algorithms that replicate all the possible predecessors of each task, the proposed algorithms judiciously replicate predecessors of a task if the duplication can help in conserving energy. Our energy-aware scheduling strategies are conducive to balancing scheduling lengths and energy savings of a set of precedence constrained parallel tasks. We conducted extensive experiments using both synthetic benchmarks and real-world applications to compare our algorithms with two existing approaches. Experimental results based on simulated mobile clusters demonstrate the effectiveness and practicality of the proposed duplication-based scheduling strategies. For example, EADUS and TABUS can reduce energy consumption for the Gaussian Elimination application by averages of 16.08% and 8.1% with merely 5.7% and 2.2% increase in schedule length respectively.

Xiao Qin (Corresponding author)Email:

相似文献

10.

Finding statistically significant communities in networks 总被引：1，自引：0，他引：1

Lancichinetti A Radicchi F Ramasco JJ Fortunato S 《PloS one》2011,6(4):e18961

Community structure is one of the main structural features of networks, revealing both their internal organization and the similarity of their elementary units. Despite the large variety of methods proposed to detect communities in graphs, there is a big need for multi-purpose techniques, able to handle different types of datasets and the subtleties of community structure. In this paper we present OSLOM (Order Statistics Local Optimization Method), the first method capable to detect clusters in networks accounting for edge directions, edge weights, overlapping communities, hierarchies and community dynamics. It is based on the local optimization of a fitness function expressing the statistical significance of clusters with respect to random fluctuations, which is estimated with tools of Extreme and Order Statistics. OSLOM can be used alone or as a refinement procedure of partitions/covers delivered by other techniques. We have also implemented sequential algorithms combining OSLOM with other fast techniques, so that the community structure of very large networks can be uncovered. Our method has a comparable performance as the best existing algorithms on artificial benchmark graphs. Several applications on real networks are shown as well. OSLOM is implemented in a freely available software (http://www.oslom.org), and we believe it will be a valuable tool in the analysis of networks. 相似文献

11.

Architecture-aware configuration and scheduling of matrix multiplication on asymmetric multicore processors

Sandra Catalán Francisco D. Igual Rafael Mayo Rafael Rodríguez-Sánchez Enrique S. Quintana-Ortí 《Cluster computing》2016,19(3):1037-1051

Asymmetric multicore processors have recently emerged as an appealing technology for severely energy-constrained environments, especially in mobile appliances where heterogeneity in applications is mainstream. In addition, given the growing interest for low-power high performance computing, this type of architectures is also being investigated as a means to improve the throughput-per-Watt of complex scientific applications on clusters of commodity systems-on-chip. In this paper, we design and embed several architecture-aware optimizations into a multi-threaded general matrix multiplication (gemm), a key operation of the BLAS, in order to obtain a high performance implementation for ARM big.LITTLE AMPs. Our solution is based on the reference implementation of gemm in the BLIS library, and integrates a cache-aware configuration as well as asymmetric-static and dynamic scheduling strategies that carefully tune and distribute the operation’s micro-kernels among the big and LITTLE cores of the target processor. The experimental results on a Samsung Exynos 5422, a system-on-chip with ARM Cortex-A15 and Cortex-A7 clusters that implements the big.LITTLE model, expose that our cache-aware versions of gemm with asymmetric scheduling attain important gains in performance with respect to its architecture-oblivious counterparts while exploiting all the resources of the AMP to deliver considerable energy efficiency. 相似文献

12.

Object Placement Using Performance Surfaces

André Turgeon Quinn Snell Mark Clement 《Cluster computing》2001,4(3):263-273

Heterogeneous parallel clusters of workstations are being used to solve many important computational problems. Scheduling parallel applications on the best collection of machines in a heterogeneous computing environment is a complex problem. Performance prediction is vital to good application performance in this environment since utilization of an ill-suited machine can slow the computation down significantly. This paper addresses the problem of network performance prediction. A new methodology for characterizing network links and application's need for network resources is developed which makes use of Performance Surfaces [3]. This Performance Surface abstraction is used to schedule a parallel application on resources where it will run most efficiently. 相似文献

13.

Cost-intelligent application-specific data layout optimization for parallel file systems

Huaiming Song Yanlong Yin Yong Chen Xian-He Sun 《Cluster computing》2013,16(2):285-298

Parallel file systems have been developed in recent years to ease the I/O bottleneck of high-end computing system. These advanced file systems offer several data layout strategies in order to meet the performance goals of specific I/O workloads. However, while a layout policy may perform well on some I/O workload, it may not perform as well for another. Peak I/O performance is rarely achieved due to the complex data access patterns. Data access is application dependent. In this study, a cost-intelligent data access strategy based on the application-specific optimization principle is proposed. This strategy improves the I/O performance of parallel file systems. We first present examples to illustrate the difference of performance under different data layouts. By developing a cost model which estimates the completion time of data accesses in various data layouts, the layout can better match the application. Static layout optimization can be used for applications with dominant data access patterns, and dynamic layout selection with hybrid replications can be used for applications with complex I/O patterns. Theoretical analysis and experimental testing have been conducted to verify the proposed cost-intelligent layout approach. Analytical and experimental results show that the proposed cost model is effective and the application-specific data layout approach can provide up to a 74% performance improvement for data-intensive applications. 相似文献

14.

Substructural cooperativity and parallel versus sequential events during protein unfolding

Reich L Weikl TR 《Proteins》2006,63(4):1052-1058

According to the "old view," proteins fold along well-defined sequential pathways, whereas the "new view" sees protein folding as a highly parallel stochastic process on funnel-shaped energy landscapes. We have analyzed parallel and sequential processes on a large number of molecular dynamics unfolding trajectories of the protein CI2 at high temperatures. Using rigorous statistical measures, we quantify the degree of sequentiality on two structural levels. The unfolding process is highly parallel on the microstructural level of individual contacts. On a coarser, macrostructural level of contact clusters, characteristic parallel and sequential events emerge. These characteristic events can be understood from loop-closure dependencies between the contact clusters. A correlation analysis of the unfolding times of the contacts reveals a high degree of substructural cooperativity within the contact clusters. 相似文献

15.

Attacking the bottlenecks of backfilling schedulers

Peter J. Keleher Dmitry Zotkin Dejan Perkovic 《Cluster computing》2000,3(4):245-254

Backfilling is a simple and effective way of improving the utilization of spacesharing schedulers. Simple firstcomefirstserved approaches are ineffective because large jobs can fragment the available resources. Backfilling schedulers address this problem by allowing jobs to move ahead in the queue, provided that they will not delay subsequent jobs. Previous research has shown that inaccurate estimates of execution times can lead to better backfilling schedules. In the first part of this study, we characterize this effect on several workloads, and show that average slowdowns can be effectively reduced by systematically lengthening estimated execution times. Further, we show that the average job slowdown metric can be addressed directly by sorting jobs by increasing execution time. Finally, we modify our sorting scheduler to ensure that incoming jobs can be given hard guarantees. The resulting scheduler guarantees to avoid starvation, and performs significantly better than previous backfilling schedulers. In the second part of this study, we show how queue randomization and even more a combination of queue randomization and sorting by job length can improve performance. We show that these improvements are better than with queue sorting by job length alone in the simulation with actual estimates of job running times. We investigate the real characteristics of these estimates, and show the wide range of overestimation. To exploit even more randomization and queue sorting, we eliminate guarantees from backfilling algorithm, and show significant improvements. Finally, we show a limited usefulness of these guarantees, and show that queue sorting criteria can be modified to prevent starvation in the modified backfilling algorithm. 相似文献

16.

A Latent Model to Detect Multiple Clusters of Varying Sizes

Minge Xie Qiankun Sun Joseph Naus 《Biometrics》2009,65(4):1011-1020

Summary This article develops a latent model and likelihood‐based inference to detect temporal clustering of events. The model mimics typical processes generating the observed data. We apply model selection techniques to determine the number of clusters, and develop likelihood inference and a Monte Carlo expectation–maximization algorithm to estimate model parameters, detect clusters, and identify cluster locations. Our method differs from the classical scan statistic in that we can simultaneously detect multiple clusters of varying sizes. We illustrate the methodology with two real data applications and evaluate its efficiency through simulation studies. For the typical data‐generating process, our methodology is more efficient than a competing procedure that relies on least squares. 相似文献

17.

A Framework for Adaptive Cluster Computing Using JavaSpaces

Jyoti Batheja Manish Parashar 《Cluster computing》2003,6(3):201-213

Heterogeneous networked clusters are being increasingly used as platforms for resource-intensive parallel and distributed applications. The fundamental underlying idea is to provide large amounts of processing capacity over extended periods of time by harnessing the idle and available resources on the network in an opportunistic manner. In this paper we present the design, implementation and evaluation of a framework that uses JavaSpaces to support this type of opportunistic adaptive parallel/distributed computing over networked clusters in a non-intrusive manner. The framework targets applications exhibiting coarse grained parallelism and has three key features: (1) portability across heterogeneous platforms, (2) minimal configuration overheads for participating nodes, and (3) automated system state monitoring (using SNMP) to ensure non-intrusive behavior. Experimental results presented in this paper demonstrate that for applications that can be broken into coarse-grained, relatively independent tasks, the opportunistic adaptive parallel computing framework can provide performance gains. Furthermore, the results indicate that monitoring and reacting to the current system state minimizes the intrusiveness of the framework. 相似文献

18.

Smoldyn on graphics processing units: massively parallel Brownian dynamics simulations

Dematté L 《IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM》2012,9(3):655-667

Space is a very important aspect in the simulation of biochemical systems; recently, the need for simulation algorithms able to cope with space is becoming more and more compelling. Complex and detailed models of biochemical systems need to deal with the movement of single molecules and particles, taking into consideration localized fluctuations, transportation phenomena, and diffusion. A common drawback of spatial models lies in their complexity: models can become very large, and their simulation could be time consuming, especially if we want to capture the systems behavior in a reliable way using stochastic methods in conjunction with a high spatial resolution. In order to deliver the promise done by systems biology to be able to understand a system as whole, we need to scale up the size of models we are able to simulate, moving from sequential to parallel simulation algorithms. In this paper, we analyze Smoldyn, a widely diffused algorithm for stochastic simulation of chemical reactions with spatial resolution and single molecule detail, and we propose an alternative, innovative implementation that exploits the parallelism of Graphics Processing Units (GPUs). The implementation executes the most computational demanding steps (computation of diffusion, unimolecular, and bimolecular reaction, as well as the most common cases of molecule-surface interaction) on the GPU, computing them in parallel on each molecule of the system. The implementation offers good speed-ups and real time, high quality graphics output 相似文献

19.

Performance Evaluation of the Quadrics Interconnection Network 总被引：1，自引：0，他引：1

Fabrizio Petrini Eitan Frachtenberg Adolfy Hoisie Salvador Coll 《Cluster computing》2003,6(2):125-142

相似文献

20.

Scheduling data analytics work with performance guarantees: queuing and machine learning models in synergy

Ji Xue Feng Yan Alma Riska Evgenia Smirni 《Cluster computing》2016,19(2):849-864

In today’s scaled out systems, co-scheduling data analytics work with high priority user workloads is common as it utilizes better the vast hardware availability. User workloads are dominated by periodic patterns, with alternating periods of high and low utilization, creating promising conditions to schedule data analytics work during low activity periods. To this end, we show the effectiveness of machine learning models in accurately predicting user workload intensities, essentially by suggesting the most opportune time to co-schedule data analytics work. Yet, machine learning models cannot predict the effects of performance interference when co-scheduling is employed, as this constitutes a “new” observation. Specifically, in tiered storage systems, their hierarchical design makes performance interference even more complex, thus accurate performance prediction is more challenging. Here, we quantify the unknown performance effects of workload co-scheduling by enhancing machine learning models with queuing theory ones to develop a hybrid approach that can accurately predict performance and guide scheduling decisions in a tiered storage system. Using traces from commercial systems we illustrate that queuing theory and machine learning models can be used in synergy to surpass their respective weaknesses and deliver robust co-scheduling solutions that achieve high performance. 相似文献