期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Key Message Approach to Optimize Communication of Parallel Applications on Clusters

Ming Zhu Wentong Cai Bu-Sung Lee 《Cluster computing》2003,6(3):253-265

Over the past few years, cluster/distributed computing has been gaining popularity. The proliferation of the cluster/distributed computing is due to the improved performance and increased reliability of these systems. Many parallel programming languages and related parallel programming models have become widely accepted. However, one of the major shortcomings of running parallel applications on cluster/distributed computing environments is the high communication overhead incurred. To reduce the communication overhead, and thus the completion time of a parallel application, this paper describes a simple, efficient and portable Key Message (KM) approach to support parallel computing on cluster/distributed computing environments. To demonstrate the advantage of the KM approach, a prototype runtime system has been implemented and evaluated. Our preliminary experimental results show that the KM approach has better improvement on communication of a parallel application when network background load increases or the computation to communication ratio of the application decreases. 相似文献

2.

Performance portability on EARTH: a case study across several parallel architectures

Weirong Zhu Yanwei Niu Guang R. Gao 《Cluster computing》2007,10(2):115-126

Due to the increase of the diversity of parallel architectures, and the increasing development time for parallel applications, performance portability has become one of the major considerations when designing the next generation of parallel program execution models, APIs, and runtime system software. This paper analyzes both code portability and performance portability of parallel programs for fine-grained multi-threaded execution and architecture models. We concentrate on one particular event-driven fine-grained multi-threaded execution model—EARTH, and discuss several design considerations of the EARTH model and runtime system that contribute to the performance portability of parallel applications. We believe that these are important issues for future high end computing system software design. Four representative benchmarks were conducted on several different parallel architectures, including two clusters listed in the 23rd supercomputer TOP500 list. The results demonstrate that EARTH based programs can achieve robust performance portability across the selected hardware platforms without any code modification or tuning. 相似文献

3.

Performance Analysis of a Myrinet-Based Cluster

Teddy Surya Gunawan Wentong Cai 《Cluster computing》2003,6(4):299-313

In recent years, there has been a growing interest in the cluster system as an accepted form of supercomputing, due to its high performance at an affordable cost. This paper attempts to elaborate performance analysis of Myrinet-based cluster. The communication performance and effect of background load on parallel applications were analyzed. For point-to-point communication, it was found that an extension to the Hockney's model was required to estimate the performance. The proposed model suggested that there should be two ranges to be used for the performance metrics to cope with the cache effect. Moreover, based on the extension of the point-to-point communication model, the Xu and Hwang's model for collective communication performance was also extended. Results showed that our models can make better estimation of the communication performance than the previous models. Finally, the interference of other user processes to the cluster system is evaluated by using synthetic background load generation programs. 相似文献

4.

Parallel FFT on ATM‐based networks of workstations

Suresh Chalasani Parameswaran Ramanathan 《Cluster computing》1998,1(1):13-26

We consider parallel computing on a network of workstations using a connection-oriented protocol (e.g., Asynchronous Transfer Mode) for data communication. In a connection-oriented protocol, a virtual circuit of guaranteed bandwidth is established for each pair of communicating workstations. Since all virtual circuits do not have the same guaranteed bandwidth, a parallel application must deal with the unequal bandwidths between workstations. Since most works in the design of parallel algorithms assume equal bandwidths on all the communication links, they often do not perform well when executed on networks of workstations using connection-oriented protocols. In this paper, we first evaluate the performance degradation caused by unequal bandwidths on the execution of conventional parallel algorithms such as the fast Fourier transform and bitonic sort. We then present a strategy based on dynamic redistribution of data points to reduce the bottlenecks caused by unequal bandwidths. We also extend this strategy to deal with processor heterogeneity. Using analysis and simulation we show that there is a considerable reduction in the runtime if the proposed redistribution strategy is adopted. The basic idea presented in this paper can also be used to improve the runtimes of other parallel applications in connection-oriented environments. This revised version was published online in July 2006 with corrections to the Cover Date. 相似文献

5.

Automated and dynamic abstraction of MPI application performance

Anna Sikora Tomàs Margalef Josep Jorba 《Cluster computing》2016,19(3):1105-1137

Developing an efficient parallel application is not an easy task, and achieving a good performance requires a thorough understanding of the program’s behavior. Careful performance analysis and optimization are crucial. To help developers or users of these applications to analyze the program’s behavior, it is necessary to provide them with an abstraction of the application performance. In this paper, we propose a dynamic performance abstraction technique, which enables the automated discovery of causal execution paths, composed of communication and computational activities, in MPI parallel programs. This approach enables autonomous and low-overhead execution monitoring that generates performance knowledge about application behavior for the purpose of online performance diagnosis. Our performance abstraction technique reflects an application behavior and is made up of elements correlated with high-level program structures, such as loops and communication operations. Moreover, it characterizes all elements with statistical execution profiles. We have evaluated our approach on a variety of scientific parallel applications. In all scenarios, our online performance abstraction technique proved effective for low-overhead capturing of the program’s behavior and facilitated performance understanding. 相似文献

6.

Performance analysis of MPI collective operations 总被引：1，自引：0，他引：1

Jelena Pješivac-Grbović Thara Angskun George Bosilca Graham E. Fagg Edgar Gabriel Jack J. Dongarra 《Cluster computing》2007,10(2):127-143

Previous studies of application usage show that the performance of collective communications are critical for high-performance computing. Despite active research in the field, both general and feasible solution to the optimization of collective communication problem is still missing. In this paper, we analyze and attempt to improve intra-cluster collective communication in the context of the widely deployed MPI programming paradigm by extending accepted models of point-to-point communication, such as Hockney, LogP/LogGP, and PLogP, to collective operations. We compare the predictions from models against the experimentally gathered data and using these results, construct optimal decision function for broadcast collective. We quantitatively compare the quality of the model-based decision functions to the experimentally-optimal one. Additionally, in this work, we also introduce a new form of an optimized tree-based broadcast algorithm, splitted-binary. Our results show that all of the models can provide useful insights into various aspects of the different algorithms as well as their relative performance. Still, based on our findings, we believe that the complete reliance on models would not yield optimal results. In addition, our experimental results have identified the gap parameter as being the most critical for accurate modeling of both the classical point-to-point-based pipeline and our extensions to fan-out topologies.

Jack J. DongarraEmail:

相似文献

7.

Reverse computation for rollback-based fault tolerance in large parallel systems

Kalyan S. Perumalla Alfred J. Park 《Cluster computing》2014,17(2):303-313

Reverse computation is presented here as an important future direction in addressing the challenge of fault tolerant execution on very large cluster platforms for parallel computing. As the scale of parallel jobs increases, traditional checkpointing approaches suffer scalability problems ranging from computational slowdowns to high congestion at the persistent stores for checkpoints. Reverse computation can overcome such problems and is also better suited for parallel computing on newer architectures with smaller, cheaper or energy-efficient memories and file systems. Initial evidence for the feasibility of reverse computation in large systems is presented with detailed performance data from a particle (ideal gas) simulation scaling to 65,536 processor cores and 950 accelerators (GPUs). Reverse computation is observed to deliver very large gains relative to checkpointing schemes when nodes rely on their host processors/memory to tolerate faults at their accelerators. A comparison between reverse computation and checkpointing with measurements such as cache miss ratios, TLB misses and memory usage indicates that reverse computation is hard to ignore as a future alternative to be pursued in emerging architectures. 相似文献

8.

Implementing noncollective parallel I/O in cluster environments using Active Message communication

Jarek Nieplocha Holger Dachsel Ian Foster 《Cluster computing》1999,2(4):271-279

A costeffective secondary storage architecture for parallel computers is to distribute storage across all processors, which then engage in either computation or I/O, depending on the demands of the moment. A difficulty associated with this architecture is that access to storage on another processor typically requires the cooperation of that processor, which can be hard to arrange if the processor is engaged in other computation. One partial solution to this problem is to require that remote I/O operations occur only via collective calls. In this paper, we describe an alternative approach based on the use of singlesided communication operations such as Active Messages. We present an implementation of this basic approach called Distant I/O and present experimental results that quantify the lowlevel performance of DIO mechanisms. This technique is exploited to support noncollective parallel shared file model for a large outofcore scientific application with very high I/O bandwidth requirements. The achieved performance exceeds by a wide margin the performance of a well equipped PIOFS parallel filesystem on the IBM SP. 相似文献

9.

Delta Execution: A preemptive Java thread migration mechanism

Matchy J.M. Ma Cho-Li Wang Francis C.M. Lau 《Cluster computing》2000,3(2):83-94

Delta Execution is a preemptive and transparent thread migration mechanism for supporting load distribution and balancing in a cluster of workstations. The design of Delta Execution allows the execution system to migrate threads of a Java application to different nodes of a cluster so as to achieve parallel execution. The approach is to break down and group the execution context of a migrating thread into sets of consecutive machine-dependent and machine-independent execution sub-contexts. Each set of machine-independent sub-contexts, also known as a delta set, is then migrated to a remote node in a regulated manner for continuing the execution. Since Delta Execution is implemented at the virtual machine level, all the migration-related activities are conducted transparently with respect to the applications. No new migration-related instructions need to be added to the programs and existing applications can immediately benefit from the parallel execution capability of Delta Execution without any code modification. Furthermore, because the Delta Execution approach identifies and migrates only the machine-independent part of a thread's execution context, the implementation is therefore reasonably manageable and the resulting software is portable. This revised version was published online in July 2006 with corrections to the Cover Date. 相似文献

10.

A Load Balancing Tool for Distributed Parallel Loops 总被引：1，自引：0，他引：1

Ricolindo?L.?Cari?o Email author Ioana?Banicescu 《Cluster computing》2005,8(4):313-321

Large scale applications typically contain parallel loops with many iterates. The iterates of a parallel loop may have variable execution times which translate into performance degradation of an application due to load imbalance. This paper describes a tool for load balancing parallel loops on distributed-memory systems. The tool assumes that the data for a parallel loop to be executed is already partitioned among the participating processors. The tool utilizes the MPI library for interprocessor coordination, and determines processor workloads by loop scheduling techniques. The tool was designed independent of any application; hence, it must be supplied with a routine that encapsulates the computations for a chunk of loop iterates, as well as the routines to transfer data and results between processors. Performance evaluation on a Linux cluster indicates that the tool reduces the cost of executing a simulated irregular loop without load balancing by up to 81%. The tool is useful for parallelizing sequential applications with parallel loops, or as an alternate load balancing routine for existing parallel applications. 相似文献

11.

cljam: a library for handling DNA sequence alignment/map (SAM) with parallel processing

Toshiki?Takeuchi Email author View author&#;s OrcID profile Atsuo?Yamada Takashi?Aoki Kunihiro?Nishimura 《Source code for biology and medicine》2016,11(1):12

Background

Next-generation sequencing can determine DNA bases and the results of sequence alignments are generally stored in files in the Sequence Alignment/Map (SAM) format and the compressed binary version (BAM) of it. SAMtools is a typical tool for dealing with files in the SAM/BAM format. SAMtools has various functions, including detection of variants, visualization of alignments, indexing, extraction of parts of the data and loci, and conversion of file formats. It is written in C and can execute fast. However, SAMtools requires an additional implementation to be used in parallel with, for example, OpenMP (Open Multi-Processing) libraries. For the accumulation of next-generation sequencing data, a simple parallelization program, which can support cloud and PC cluster environments, is required.

Results

We have developed cljam using the Clojure programming language, which simplifies parallel programming, to handle SAM/BAM data. Cljam can run in a Java runtime environment (e.g., Windows, Linux, Mac OS X) with Clojure.

Conclusions

Cljam can process and analyze SAM/BAM files in parallel and at high speed. The execution time with cljam is almost the same as with SAMtools. The cljam code is written in Clojure and has fewer lines than other similar tools.

相似文献

12.

Hardware-Assisted Characterization of NAS Benchmarks

W.E. Cohen R.K. Gaede W.D. Garrett 《Cluster computing》2001,4(3):189-196

The UAH Logging, Trace Recording, and Analysis instrumentation (ULTRA) provides highly repeatable (0.0002% variation) application instruction counts for parallel programs which are invariant to the communication network used, the number of processors used, and the MPI communication library used. ULTRA, implemented as an MPI profiling wrapper, avoids the data collection system artifacts of time-based measurements by using instruction counts as the basic measure of work performed and records the operation performed and the amount of data sent for each network operation. These measurements can be scaled appropriately for various target architectures. ULTRA's instrumentation overhead is minimized by using the Pentium II processors's performance monitoring hardware, allowing large, production-run applications to be quickly characterized. Traces of the NAS benchmarks representing 6.67×10¹² application instructions were generated by ULTRA. The application instructions executed per byte injected into the network and the instructions executed per message sent were computed from the traces. These values can be scaled by the expected processor performance to estimate the minimum network performance required to support the programs. It is impossible to use time-based measurements for this purpose due to measurement artifacts caused by the background processes and the communication network of the data collection system. 相似文献

13.

Scalable PGAS collective operations in NUMA clusters

Damián A. Mallón Guillermo L. Taboada Carlos Teijeiro Jorge González-Domínguez Andrés Gómez Brian Wibecan 《Cluster computing》2014,17(4):1473-1495

The increasing number of cores per processor is turning manycore-based systems in pervasive. This involves dealing with multiple levels of memory in non uniform memory access (NUMA) systems and processor cores hierarchies, accessible via complex interconnects in order to dispatch the increasing amount of data required by the processing elements. The key for efficient and scalable provision of data is the use of collective communication operations that minimize the impact of bottlenecks. Leveraging one sided communications becomes more important in these systems, to avoid unnecessary synchronization between pairs of processes in collective operations implemented in terms of two sided point to point functions. This work proposes a series of algorithms that provide a good performance and scalability in collective operations, based on the use of hierarchical trees, overlapping one-sided communications, message pipelining and the available NUMA binding features. An implementation has been developed for Unified Parallel C, a Partitioned Global Address Space language, which presents a shared memory view across the nodes for programmability, while keeping private memory regions for performance. The performance evaluation of the proposed implementation, conducted on five representative systems (JuRoPA, JUDGE, Finis Terrae, SVG and Superdome), has shown generally good performance and scalability, even outperforming MPI in some cases, which confirms the suitability of the developed algorithms for manycore architectures. 相似文献

14.

Optimizing dataflow applications on heterogeneous environments

George Teodoro Timothy D. R. Hartley Umit V. Catalyurek Renato Ferreira 《Cluster computing》2012,15(2):125-144

The increases in multi-core processor parallelism and in the flexibility of many-core accelerator processors, such as GPUs, have turned traditional SMP systems into hierarchical, heterogeneous computing environments. Fully exploiting these improvements in parallel system design remains an open problem. Moreover, most of the current tools for the development of parallel applications for hierarchical systems concentrate on the use of only a single processor type (e.g., accelerators) and do not coordinate several heterogeneous processors. Here, we show that making use of all of the heterogeneous computing resources can significantly improve application performance. Our approach, which consists of optimizing applications at run-time by efficiently coordinating application task execution on all available processing units is evaluated in the context of replicated dataflow applications. The proposed techniques were developed and implemented in an integrated run-time system targeting both intra- and inter-node parallelism. The experimental results with a real-world complex biomedical application show that our approach nearly doubles the performance of the GPU-only implementation on a distributed heterogeneous accelerator cluster. 相似文献

15.

Speed and accuracy improvement of higher-order epistasis detection on CUDA-enabled GPUs

Daniel Jünger Christian Hundt Jorge González Domínguez Bertil Schmidt 《Cluster computing》2017,20(3):1899-1908

The discovery of higher-order epistatic interactions is an important task in the field of genome wide association studies which allows for the identification of complex interaction patterns between multiple genetic markers. Some existing bruteforce approaches explore the whole space of k-interactions in an exhaustive manner resulting in almost intractable execution times. Computational cost can be reduced drastically by restricting the search space with suitable preprocessing filters which prune unpromising candidates. Other approaches mitigate the execution time by employing massively parallel accelerators in order to benefit from the vast computational resources of these architectures. In this paper, we combine a novel preprocessing filter, namely SingleMI, with massively parallel computation on modern GPUs to further accelerate epistasis discovery. Our implementation improves both the runtime and accuracy when compared to a previous GPU counterpart that employs mutual information clustering for prefiltering. SingleMI is open source software and publicly available at: https://github.com/sleeepyjack/singlemi/. 相似文献

16.

Design issues for a high-performance distributed shared memory on symmetrical multiprocessor clusters

Sumit Roy Vipin Chaudhary 《Cluster computing》1999,2(3):177-186

Clusters of Symmetrical Multiprocessors (SMPs) have recently become the norm for high-performance economical computing solutions. Multiple nodes in a cluster can be used for parallel programming using a message passing library. An alternate approach is to use a software Distributed Shared Memory (DSM) to provide a view of shared memory to the application programmer. This paper describes Strings, a high performance distributed shared memory system designed for such SMP clusters. The distinguishing feature of this system is the use of a fully multi-threaded runtime system, using kernel level threads. Strings allows multiple application threads to be run on each node in a cluster. Since most modern UNIX systems can multiplex these threads on kernel level light weight processes, applications written using Strings can exploit multiple processors on a SMP machine. This paper describes some of the architectural details of the system and illustrates the performance improvements with benchmark programs from the SPLASH-2 suite, some computational kernels as well as a full fledged application. It is found that using multiple processes on SMP nodes provides good speedups only for a few of the programs. Multiple application threads can improve the performance in some cases, but other programs show a slowdown. If kernel threads are used additionally, the overall performance improves significantly in all programs tested. Other design decisions also have a beneficial impact, though to a lesser degree. This revised version was published online in July 2006 with corrections to the Cover Date. 相似文献

17.

Parallel application-level behavioral attributes for performance and energy management of high-performance computing systems 总被引：1，自引：0，他引：1

Jeffrey J. Evans Charles E. Lucas 《Cluster computing》2013,16(1):91-115

Run time variability of parallel applications continues to present significant challenges to their performance and energy efficiency in high-performance computing (HPC) systems. When run times are extended and unpredictable, application developers perceive this as a degradation of system (or subsystem) performance. Extended run times directly contribute to proportionally higher energy consumption, potentially negating efforts by applications, or the HPC system, to optimize energy consumption using low-level control techniques, such as dynamic voltage and frequency scaling (DVFS). Therefore, successful systemic management of application run time performance can result in less wasted energy, or even energy savings. We have been studying run time variability in terms of communication time, from the perspective of the application, focusing on the interconnection network. More recently, our focus has shifted to developing a more complete understanding of the effects of HPC subsystem interactions on parallel applications. In this context, the set of executing applications on the HPC system is treated as a subsystem, along with more traditional subsystems like the communication subsystem, storage subsystem, etc. To gain insight into the run time variability problem, our earlier work developed a framework to emulate parallel applications (PACE) that stresses the communication subsystem. Evaluation of run time sensitivity to network performance of real applications is performed with a tool called PARSE, which uses PACE. In this paper, we propose a model defining application-level behavioral attributes, that collectively describes how applications behave in terms of their run time performance, as functions of their process distribution on the system (spacial locality), and subsystem interactions (communication subsystem degradation). These subsystem interactions are produced when multiple applications execute concurrently on the same HPC system. We also revisit our evaluation framework and tools to demonstrate the flexibility of our application characterization techniques, and the ease with which attributes can be quantified. The validity of the model is demonstrated using our tools with several parallel benchmarks and application fragments. Results suggest that it is possible to articulate application-level behavioral attributes as a tuple of numeric values that describe course-grained performance behavior. 相似文献

18.

Extending the scope of the controlled logical clock

Daniel?Becker Email author Markus?Geimer Rolf?Rabenseifner Felix?Wolf 《Cluster computing》2013,16(1):171-189

Event traces are helpful in understanding the performance behavior of parallel applications since they allow the in-depth analysis of communication and synchronization patterns. However, the absence of synchronized clocks on most cluster systems may render the analysis ineffective because inaccurate relative event timings may misrepresent the logical event order and lead to errors when quantifying the impact of certain behaviors or confuse the users of time-line visualization tools by showing messages flowing backward in time. In our earlier work, we have developed a scalable algorithm called the controlled logical clock that eliminates inconsistent inter-process timings postmortem in traces of pure MPI applications, potentially running on large processor configurations. In this paper, we first demonstrate that our algorithm also proves beneficial in computational grids, where a single application is executed using the combined computational power of several geographically dispersed clusters. Second, we present an extended version of the algorithm that—in addition to message-passing event semantics—also preserves and restores shared-memory event semantics, enabling the correction of traces from hybrid applications. 相似文献

19.

A stochastic approach to estimating earliest start times of nodes for scheduling DAGs on heterogeneous distributed computing systems

Ankur Kamthe Soo-Young Lee 《Cluster computing》2011,14(4):377-395

Previously, DAG scheduling schemes used the mean (average) of computation or communication time in dealing with temporal heterogeneity. However, it is not optimal to consider only the means of computation and communication times in DAG scheduling on a temporally (and spatially) heterogeneous distributed computing system. In this paper, it is proposed that the second order moments of computation and communication times, such as the standard deviations, be taken into account in addition to their means, in scheduling “stochastic” DAGs. An effective scheduling approach which accurately estimates the earliest start time of each node and derives a schedule leading to a shorter average parallel execution time has been developed. Through an extensive computer simulation, it has been shown that a significant improvement (reduction) in the average parallel execution times of stochastic DAGs can be achieved by the proposed approach. 相似文献

20.

Parallel Clustering Algorithm for Large-Scale Biological Data Sets

Minchao Wang Wu Zhang Wang Ding Dongbo Dai Huiran Zhang Hao Xie Luonan Chen Yike Guo Jiang Xie 《PloS one》2014,9(4)

Backgrounds

Recent explosion of biological data brings a great challenge for the traditional clustering algorithms. With increasing scale of data sets, much larger memory and longer runtime are required for the cluster identification problems. The affinity propagation algorithm outperforms many other classical clustering algorithms and is widely applied into the biological researches. However, the time and space complexity become a great bottleneck when handling the large-scale data sets. Moreover, the similarity matrix, whose constructing procedure takes long runtime, is required before running the affinity propagation algorithm, since the algorithm clusters data sets based on the similarities between data pairs.

Methods

Two types of parallel architectures are proposed in this paper to accelerate the similarity matrix constructing procedure and the affinity propagation algorithm. The memory-shared architecture is used to construct the similarity matrix, and the distributed system is taken for the affinity propagation algorithm, because of its large memory size and great computing capacity. An appropriate way of data partition and reduction is designed in our method, in order to minimize the global communication cost among processes.

Result

A speedup of 100 is gained with 128 cores. The runtime is reduced from serval hours to a few seconds, which indicates that parallel algorithm is capable of handling large-scale data sets effectively. The parallel affinity propagation also achieves a good performance when clustering large-scale gene data (microarray) and detecting families in large protein superfamilies. 相似文献