共查询到20条相似文献,搜索用时 15 毫秒
1.
Over the past few years, cluster/distributed computing has been gaining popularity. The proliferation of the cluster/distributed computing is due to the improved performance and increased reliability of these systems. Many parallel programming languages and related parallel programming models have become widely accepted. However, one of the major shortcomings of running parallel applications on cluster/distributed computing environments is the high communication overhead incurred. To reduce the communication overhead, and thus the completion time of a parallel application, this paper describes a simple, efficient and portable Key Message (KM) approach to support parallel computing on cluster/distributed computing environments. To demonstrate the advantage of the KM approach, a prototype runtime system has been implemented and evaluated. Our preliminary experimental results show that the KM approach has better improvement on communication of a parallel application when network background load increases or the computation to communication ratio of the application decreases. 相似文献
2.
Due to the increase of the diversity of parallel architectures, and the increasing development time for parallel applications,
performance portability has become one of the major considerations when designing the next generation of parallel program
execution models, APIs, and runtime system software. This paper analyzes both code portability and performance portability
of parallel programs for fine-grained multi-threaded execution and architecture models. We concentrate on one particular event-driven
fine-grained multi-threaded execution model—EARTH, and discuss several design considerations of the EARTH model and runtime
system that contribute to the performance portability of parallel applications. We believe that these are important issues
for future high end computing system software design. Four representative benchmarks were conducted on several different parallel
architectures, including two clusters listed in the 23rd supercomputer TOP500 list. The results demonstrate that EARTH based
programs can achieve robust performance portability across the selected hardware platforms without any code modification or
tuning. 相似文献
3.
In recent years, there has been a growing interest in the cluster system as an accepted form of supercomputing, due to its high performance at an affordable cost. This paper attempts to elaborate performance analysis of Myrinet-based cluster. The communication performance and effect of background load on parallel applications were analyzed. For point-to-point communication, it was found that an extension to the Hockney's model was required to estimate the performance. The proposed model suggested that there should be two ranges to be used for the performance metrics to cope with the cache effect. Moreover, based on the extension of the point-to-point communication model, the Xu and Hwang's model for collective communication performance was also extended. Results showed that our models can make better estimation of the communication performance than the previous models. Finally, the interference of other user processes to the cluster system is evaluated by using synthetic background load generation programs. 相似文献
4.
We consider parallel computing on a network of workstations using a connection-oriented protocol (e.g., Asynchronous Transfer
Mode) for data communication. In a connection-oriented protocol, a virtual circuit of guaranteed bandwidth is established
for each pair of communicating workstations. Since all virtual circuits do not have the same guaranteed bandwidth, a parallel
application must deal with the unequal bandwidths between workstations. Since most works in the design of parallel algorithms
assume equal bandwidths on all the communication links, they often do not perform well when executed on networks of workstations
using connection-oriented protocols. In this paper, we first evaluate the performance degradation caused by unequal bandwidths
on the execution of conventional parallel algorithms such as the fast Fourier transform and bitonic sort. We then present
a strategy based on dynamic redistribution of data points to reduce the bottlenecks caused by unequal bandwidths. We also
extend this strategy to deal with processor heterogeneity. Using analysis and simulation we show that there is a considerable
reduction in the runtime if the proposed redistribution strategy is adopted. The basic idea presented in this paper can also
be used to improve the runtimes of other parallel applications in connection-oriented environments.
This revised version was published online in July 2006 with corrections to the Cover Date. 相似文献
5.
Developing an efficient parallel application is not an easy task, and achieving a good performance requires a thorough understanding of the program’s behavior. Careful performance analysis and optimization are crucial. To help developers or users of these applications to analyze the program’s behavior, it is necessary to provide them with an abstraction of the application performance. In this paper, we propose a dynamic performance abstraction technique, which enables the automated discovery of causal execution paths, composed of communication and computational activities, in MPI parallel programs. This approach enables autonomous and low-overhead execution monitoring that generates performance knowledge about application behavior for the purpose of online performance diagnosis. Our performance abstraction technique reflects an application behavior and is made up of elements correlated with high-level program structures, such as loops and communication operations. Moreover, it characterizes all elements with statistical execution profiles. We have evaluated our approach on a variety of scientific parallel applications. In all scenarios, our online performance abstraction technique proved effective for low-overhead capturing of the program’s behavior and facilitated performance understanding. 相似文献
6.
Performance analysis of MPI collective operations 总被引:1,自引:0,他引:1
Jelena Pješivac-Grbović Thara Angskun George Bosilca Graham E. Fagg Edgar Gabriel Jack J. Dongarra 《Cluster computing》2007,10(2):127-143
Previous studies of application usage show that the performance of collective communications are critical for high-performance
computing. Despite active research in the field, both general and feasible solution to the optimization of collective communication
problem is still missing.
In this paper, we analyze and attempt to improve intra-cluster collective communication in the context of the widely deployed
MPI programming paradigm by extending accepted models of point-to-point communication, such as Hockney, LogP/LogGP, and PLogP,
to collective operations. We compare the predictions from models against the experimentally gathered data and using these
results, construct optimal decision function for broadcast collective. We quantitatively compare the quality of the model-based
decision functions to the experimentally-optimal one. Additionally, in this work, we also introduce a new form of an optimized
tree-based broadcast algorithm, splitted-binary.
Our results show that all of the models can provide useful insights into various aspects of the different algorithms as well
as their relative performance. Still, based on our findings, we believe that the complete reliance on models would not yield
optimal results. In addition, our experimental results have identified the gap parameter as being the most critical for accurate
modeling of both the classical point-to-point-based pipeline and our extensions to fan-out topologies.
相似文献
Jack J. DongarraEmail: |
7.
Reverse computation is presented here as an important future direction in addressing the challenge of fault tolerant execution on very large cluster platforms for parallel computing. As the scale of parallel jobs increases, traditional checkpointing approaches suffer scalability problems ranging from computational slowdowns to high congestion at the persistent stores for checkpoints. Reverse computation can overcome such problems and is also better suited for parallel computing on newer architectures with smaller, cheaper or energy-efficient memories and file systems. Initial evidence for the feasibility of reverse computation in large systems is presented with detailed performance data from a particle (ideal gas) simulation scaling to 65,536 processor cores and 950 accelerators (GPUs). Reverse computation is observed to deliver very large gains relative to checkpointing schemes when nodes rely on their host processors/memory to tolerate faults at their accelerators. A comparison between reverse computation and checkpointing with measurements such as cache miss ratios, TLB misses and memory usage indicates that reverse computation is hard to ignore as a future alternative to be pursued in emerging architectures. 相似文献
8.
A costeffective secondary storage architecture for parallel computers is to distribute storage across all processors, which then engage in either computation or I/O, depending on the demands of the moment. A difficulty associated with this architecture is that access to storage on another processor typically requires the cooperation of that processor, which can be hard to arrange if the processor is engaged in other computation. One partial solution to this problem is to require that remote I/O operations occur only via collective calls. In this paper, we describe an alternative approach based on the use of singlesided communication operations such as Active Messages. We present an implementation of this basic approach called Distant I/O and present experimental results that quantify the lowlevel performance of DIO mechanisms. This technique is exploited to support noncollective parallel shared file model for a large outofcore scientific application with very high I/O bandwidth requirements. The achieved performance exceeds by a wide margin the performance of a well equipped PIOFS parallel filesystem on the IBM SP. 相似文献
9.
Delta Execution is a preemptive and transparent thread migration mechanism for supporting load distribution and balancing
in a cluster of workstations. The design of Delta Execution allows the execution system to migrate threads of a Java application
to different nodes of a cluster so as to achieve parallel execution. The approach is to break down and group the execution
context of a migrating thread into sets of consecutive machine-dependent and machine-independent execution sub-contexts. Each
set of machine-independent sub-contexts, also known as a delta set, is then migrated to a remote node in a regulated manner
for continuing the execution. Since Delta Execution is implemented at the virtual machine level, all the migration-related
activities are conducted transparently with respect to the applications. No new migration-related instructions need to be
added to the programs and existing applications can immediately benefit from the parallel execution capability of Delta Execution
without any code modification. Furthermore, because the Delta Execution approach identifies and migrates only the machine-independent
part of a thread's execution context, the implementation is therefore reasonably manageable and the resulting software is
portable.
This revised version was published online in July 2006 with corrections to the Cover Date. 相似文献
10.
A Load Balancing Tool for Distributed Parallel Loops 总被引:1,自引:0,他引:1
Large scale applications typically contain parallel loops with many iterates. The iterates of a parallel loop may have variable execution times which translate into performance degradation of an application due to load imbalance. This paper describes a tool for load balancing parallel loops on distributed-memory systems. The tool assumes that the data for a parallel loop to be executed is already partitioned among the participating processors. The tool utilizes the MPI library for interprocessor coordination, and determines processor workloads by loop scheduling techniques. The tool was designed independent of any application; hence, it must be supplied with a routine that encapsulates the computations for a chunk of loop iterates, as well as the routines to transfer data and results between processors. Performance evaluation on a Linux cluster indicates that the tool reduces the cost of executing a simulated irregular loop without load balancing by up to 81%. The tool is useful for parallelizing sequential applications with parallel loops, or as an alternate load balancing routine for existing parallel applications. 相似文献
11.
Background
Next-generation sequencing can determine DNA bases and the results of sequence alignments are generally stored in files in the Sequence Alignment/Map (SAM) format and the compressed binary version (BAM) of it. SAMtools is a typical tool for dealing with files in the SAM/BAM format. SAMtools has various functions, including detection of variants, visualization of alignments, indexing, extraction of parts of the data and loci, and conversion of file formats. It is written in C and can execute fast. However, SAMtools requires an additional implementation to be used in parallel with, for example, OpenMP (Open Multi-Processing) libraries. For the accumulation of next-generation sequencing data, a simple parallelization program, which can support cloud and PC cluster environments, is required.Results
We have developed cljam using the Clojure programming language, which simplifies parallel programming, to handle SAM/BAM data. Cljam can run in a Java runtime environment (e.g., Windows, Linux, Mac OS X) with Clojure.Conclusions
Cljam can process and analyze SAM/BAM files in parallel and at high speed. The execution time with cljam is almost the same as with SAMtools. The cljam code is written in Clojure and has fewer lines than other similar tools.12.
The UAH Logging, Trace Recording, and Analysis instrumentation (ULTRA) provides highly repeatable (0.0002% variation) application instruction counts for parallel programs which are invariant to the communication network used, the number of processors used, and the MPI communication library used. ULTRA, implemented as an MPI profiling wrapper, avoids the data collection system artifacts of time-based measurements by using instruction counts as the basic measure of work performed and records the operation performed and the amount of data sent for each network operation. These measurements can be scaled appropriately for various target architectures. ULTRA's instrumentation overhead is minimized by using the Pentium II processors's performance monitoring hardware, allowing large, production-run applications to be quickly characterized. Traces of the NAS benchmarks representing 6.67×1012 application instructions were generated by ULTRA. The application instructions executed per byte injected into the network and the instructions executed per message sent were computed from the traces. These values can be scaled by the expected processor performance to estimate the minimum network performance required to support the programs. It is impossible to use time-based measurements for this purpose due to measurement artifacts caused by the background processes and the communication network of the data collection system. 相似文献
13.
Damián A. Mallón Guillermo L. Taboada Carlos Teijeiro Jorge González-Domínguez Andrés Gómez Brian Wibecan 《Cluster computing》2014,17(4):1473-1495
The increasing number of cores per processor is turning manycore-based systems in pervasive. This involves dealing with multiple levels of memory in non uniform memory access (NUMA) systems and processor cores hierarchies, accessible via complex interconnects in order to dispatch the increasing amount of data required by the processing elements. The key for efficient and scalable provision of data is the use of collective communication operations that minimize the impact of bottlenecks. Leveraging one sided communications becomes more important in these systems, to avoid unnecessary synchronization between pairs of processes in collective operations implemented in terms of two sided point to point functions. This work proposes a series of algorithms that provide a good performance and scalability in collective operations, based on the use of hierarchical trees, overlapping one-sided communications, message pipelining and the available NUMA binding features. An implementation has been developed for Unified Parallel C, a Partitioned Global Address Space language, which presents a shared memory view across the nodes for programmability, while keeping private memory regions for performance. The performance evaluation of the proposed implementation, conducted on five representative systems (JuRoPA, JUDGE, Finis Terrae, SVG and Superdome), has shown generally good performance and scalability, even outperforming MPI in some cases, which confirms the suitability of the developed algorithms for manycore architectures. 相似文献
14.
George Teodoro Timothy D. R. Hartley Umit V. Catalyurek Renato Ferreira 《Cluster computing》2012,15(2):125-144
The increases in multi-core processor parallelism and in the flexibility of many-core accelerator processors, such as GPUs, have turned traditional SMP systems into hierarchical, heterogeneous computing environments. Fully exploiting these improvements in parallel system design remains an open problem. Moreover, most of the current tools for the development of parallel applications for hierarchical systems concentrate on the use of only a single processor type (e.g., accelerators) and do not coordinate several heterogeneous processors. Here, we show that making use of all of the heterogeneous computing resources can significantly improve application performance. Our approach, which consists of optimizing applications at run-time by efficiently coordinating application task execution on all available processing units is evaluated in the context of replicated dataflow applications. The proposed techniques were developed and implemented in an integrated run-time system targeting both intra- and inter-node parallelism. The experimental results with a real-world complex biomedical application show that our approach nearly doubles the performance of the GPU-only implementation on a distributed heterogeneous accelerator cluster. 相似文献
15.
Daniel Jünger Christian Hundt Jorge González Domínguez Bertil Schmidt 《Cluster computing》2017,20(3):1899-1908
The discovery of higher-order epistatic interactions is an important task in the field of genome wide association studies which allows for the identification of complex interaction patterns between multiple genetic markers. Some existing bruteforce approaches explore the whole space of k-interactions in an exhaustive manner resulting in almost intractable execution times. Computational cost can be reduced drastically by restricting the search space with suitable preprocessing filters which prune unpromising candidates. Other approaches mitigate the execution time by employing massively parallel accelerators in order to benefit from the vast computational resources of these architectures. In this paper, we combine a novel preprocessing filter, namely SingleMI, with massively parallel computation on modern GPUs to further accelerate epistasis discovery. Our implementation improves both the runtime and accuracy when compared to a previous GPU counterpart that employs mutual information clustering for prefiltering. SingleMI is open source software and publicly available at: https://github.com/sleeepyjack/singlemi/. 相似文献
16.
Clusters of Symmetrical Multiprocessors (SMPs) have recently become the norm for high-performance economical computing solutions.
Multiple nodes in a cluster can be used for parallel programming using a message passing library. An alternate approach is
to use a software Distributed Shared Memory (DSM) to provide a view of shared memory to the application programmer. This paper
describes Strings, a high performance distributed shared memory system designed for such SMP clusters. The distinguishing
feature of this system is the use of a fully multi-threaded runtime system, using kernel level threads. Strings allows multiple
application threads to be run on each node in a cluster. Since most modern UNIX systems can multiplex these threads on kernel
level light weight processes, applications written using Strings can exploit multiple processors on a SMP machine. This paper
describes some of the architectural details of the system and illustrates the performance improvements with benchmark programs
from the SPLASH-2 suite, some computational kernels as well as a full fledged application. It is found that using multiple
processes on SMP nodes provides good speedups only for a few of the programs. Multiple application threads can improve the
performance in some cases, but other programs show a slowdown. If kernel threads are used additionally, the overall performance
improves significantly in all programs tested. Other design decisions also have a beneficial impact, though to a lesser degree.
This revised version was published online in July 2006 with corrections to the Cover Date. 相似文献
17.
Parallel application-level behavioral attributes for performance and energy management of high-performance computing systems 总被引:1,自引:0,他引:1
Run time variability of parallel applications continues to present significant challenges to their performance and energy efficiency in high-performance computing (HPC) systems. When run times are extended and unpredictable, application developers perceive this as a degradation of system (or subsystem) performance. Extended run times directly contribute to proportionally higher energy consumption, potentially negating efforts by applications, or the HPC system, to optimize energy consumption using low-level control techniques, such as dynamic voltage and frequency scaling (DVFS). Therefore, successful systemic management of application run time performance can result in less wasted energy, or even energy savings. We have been studying run time variability in terms of communication time, from the perspective of the application, focusing on the interconnection network. More recently, our focus has shifted to developing a more complete understanding of the effects of HPC subsystem interactions on parallel applications. In this context, the set of executing applications on the HPC system is treated as a subsystem, along with more traditional subsystems like the communication subsystem, storage subsystem, etc. To gain insight into the run time variability problem, our earlier work developed a framework to emulate parallel applications (PACE) that stresses the communication subsystem. Evaluation of run time sensitivity to network performance of real applications is performed with a tool called PARSE, which uses PACE. In this paper, we propose a model defining application-level behavioral attributes, that collectively describes how applications behave in terms of their run time performance, as functions of their process distribution on the system (spacial locality), and subsystem interactions (communication subsystem degradation). These subsystem interactions are produced when multiple applications execute concurrently on the same HPC system. We also revisit our evaluation framework and tools to demonstrate the flexibility of our application characterization techniques, and the ease with which attributes can be quantified. The validity of the model is demonstrated using our tools with several parallel benchmarks and application fragments. Results suggest that it is possible to articulate application-level behavioral attributes as a tuple of numeric values that describe course-grained performance behavior. 相似文献
18.
Event traces are helpful in understanding the performance behavior of parallel applications since they allow the in-depth analysis of communication and synchronization patterns. However, the absence of synchronized clocks on most cluster systems may render the analysis ineffective because inaccurate relative event timings may misrepresent the logical event order and lead to errors when quantifying the impact of certain behaviors or confuse the users of time-line visualization tools by showing messages flowing backward in time. In our earlier work, we have developed a scalable algorithm called the controlled logical clock that eliminates inconsistent inter-process timings postmortem in traces of pure MPI applications, potentially running on large processor configurations. In this paper, we first demonstrate that our algorithm also proves beneficial in computational grids, where a single application is executed using the combined computational power of several geographically dispersed clusters. Second, we present an extended version of the algorithm that—in addition to message-passing event semantics—also preserves and restores shared-memory event semantics, enabling the correction of traces from hybrid applications. 相似文献
19.
Previously, DAG scheduling schemes used the mean (average) of computation or communication time in dealing with temporal heterogeneity.
However, it is not optimal to consider only the means of computation and communication times in DAG scheduling on a temporally
(and spatially) heterogeneous distributed computing system. In this paper, it is proposed that the second order moments of
computation and communication times, such as the standard deviations, be taken into account in addition to their means, in
scheduling “stochastic” DAGs. An effective scheduling approach which accurately estimates the earliest start time of each
node and derives a schedule leading to a shorter average parallel execution time has been developed. Through an extensive
computer simulation, it has been shown that a significant improvement (reduction) in the average parallel execution times
of stochastic DAGs can be achieved by the proposed approach. 相似文献
20.
Minchao Wang Wu Zhang Wang Ding Dongbo Dai Huiran Zhang Hao Xie Luonan Chen Yike Guo Jiang Xie 《PloS one》2014,9(4)