期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

MPI/FT: A Model-Based Approach to Low-Overhead Fault Tolerant Message-Passing Middleware

Batchu Rajanikanth Dandass Yoginder S. Skjellum Anthony Beddhu Murali 《Cluster computing》2004,7(4):303-315

Fault tolerance in parallel systems has traditionally been achieved through a combination of redundancy and checkpointing methods. This notion has also been extended to message-passing systems with user-transparent process checkpointing and message logging. Furthermore, studies of multiple types of rollback and recovery have been reported in literature, ranging from communication-induced checkpointing to pessimistic and synchronous solutions. However, many of these solutions incorporate high overhead because of their inability to utilize application level information.This paper describes the design and implementation of MPI/FT, a high-performance MPI-1.2 implementation enhanced with low-overhead functionality to detect and recover from process failures. The strategy behind MPI/FT is that fault tolerance in message-passing middleware can be optimized based on an application's execution model derived from its communication topology and parallel programming semantics. MPI/FT exploits the specific characteristics of two parallel application execution models in order to optimize performance. MPI/FT also introduces the self-checking thread that monitors the functioning of the middleware itself. User aware checkpointing and user-assisted recovery are compatible with MPI/FT and complement the techniques used here.This paper offers a classification of MPI applications for fault tolerant MPI purposes and MPI/FT implementation discussed here provides different middleware versions specifically tailored to each of the two models studied in detail. The interplay of various parameters affecting the cost of fault tolerance is investigated. Experimental results demonstrate that the approach used to design and implement MPI/FT results in a low-overhead MPI-based fault tolerant communication middleware implementation. 相似文献

2.

A resource query interface for network-aware applications 总被引：2，自引：0，他引：2

Bruce Lowekamp Nancy Miller Thomas Gross Peter Steenkiste Jaspal Subhlok Dean Sutherland 《Cluster computing》1999,2(2):139-151

Networked systems provide a cost-effective platform for parallel computing, but the applications have to deal with the changing availability of computation and communication resources. Network-awareness is a recent attempt to bridge the gap between the realities of networks and the demands of applications. Network-aware applications obtain information about their execution environment and dynamically adapt to enhance their performance. Adaptation is especially important for synchronous parallel applications because a single busy communication link can become the bottleneck and degrade overall performance dramatically. This paper presents Remos, a uniform API that allows applications to obtain relevant network information, and reports on the development of parallel applications in this environment. The challenges in defining a uniform interface include network heterogeneity, diversity and variability in network traffic, and resource sharing in the network and even inside an application. The first implementation of the Remos interface uses SNMP to monitor IP-based networks. This paper reports on our methodology for developing adaptive parallel applications for high-speed networks with Remos and presents experimental results using applications generated by the Fx parallelizing compiler. The results highlight the importance and effectiveness of adaptive parallel computing. This revised version was published online in July 2006 with corrections to the Cover Date. 相似文献

3.

Parallel application-level behavioral attributes for performance and energy management of high-performance computing systems 总被引：1，自引：0，他引：1

Jeffrey J. Evans Charles E. Lucas 《Cluster computing》2013,16(1):91-115

Run time variability of parallel applications continues to present significant challenges to their performance and energy efficiency in high-performance computing (HPC) systems. When run times are extended and unpredictable, application developers perceive this as a degradation of system (or subsystem) performance. Extended run times directly contribute to proportionally higher energy consumption, potentially negating efforts by applications, or the HPC system, to optimize energy consumption using low-level control techniques, such as dynamic voltage and frequency scaling (DVFS). Therefore, successful systemic management of application run time performance can result in less wasted energy, or even energy savings. We have been studying run time variability in terms of communication time, from the perspective of the application, focusing on the interconnection network. More recently, our focus has shifted to developing a more complete understanding of the effects of HPC subsystem interactions on parallel applications. In this context, the set of executing applications on the HPC system is treated as a subsystem, along with more traditional subsystems like the communication subsystem, storage subsystem, etc. To gain insight into the run time variability problem, our earlier work developed a framework to emulate parallel applications (PACE) that stresses the communication subsystem. Evaluation of run time sensitivity to network performance of real applications is performed with a tool called PARSE, which uses PACE. In this paper, we propose a model defining application-level behavioral attributes, that collectively describes how applications behave in terms of their run time performance, as functions of their process distribution on the system (spacial locality), and subsystem interactions (communication subsystem degradation). These subsystem interactions are produced when multiple applications execute concurrently on the same HPC system. We also revisit our evaluation framework and tools to demonstrate the flexibility of our application characterization techniques, and the ease with which attributes can be quantified. The validity of the model is demonstrated using our tools with several parallel benchmarks and application fragments. Results suggest that it is possible to articulate application-level behavioral attributes as a tuple of numeric values that describe course-grained performance behavior. 相似文献

4.

A middleware approach for pipelining communications in clusters

Sevin Fide Stephen Jenks 《Cluster computing》2007,10(4):409-424

The Pipelining Communications Middleware (PCM) approach provides a flexible, simple, high-performance mechanism to connect parallel programs running on high performance computers or clusters. This approach enables parallel programs to communicate and coordinate with each other to address larger problems than a single program can solve. The motivation behind the PCM approach grew out of using files as an intermediate transfer stage between processing by different programs. Our approach supersedes this practice by using streaming data set transfers as an “online” communication channel between simultaneously active parallel programs. Thus, the PCM approach addresses the issue of sending data from a parallel program to another parallel program without exposing details such as number of nodes allocated to the program, specific node identifiers, etc. This paper outlines and analyzes our proposed computation and communication model to provide efficient and convenient communications between parallel programs running on high performance computing systems or clusters. We also discuss the PCM challenges as well as current PCM implementations. Our approach achieves scalability, transparency, coordination, synchronization and flow control, and efficient programming. We experimented with data parallel applications to evaluate the performance of the PCM approach. Our experiment results show that the PCM approach achieves nearly ideal throughput that scales linearly with the underlying network medium speed. PCM performs well with small and large data transfers. Furthermore, our experiments show that network infrastructure plays the most significant role in the PCM performance.

Stephen JenksEmail:

相似文献

5.

Scheduling parallel applications in distributed networks

Jon B. Weissman Xin Zhao 《Cluster computing》1998,1(1):109-118

Prophet is a run-time scheduling system designed to support the efficient execution of parallel applications written in the Mentat programming language (Grimshaw, 1993). Prior results demonstrated that SPMD applications could be scheduled automatically in an Ethernet-based local-area workstation network with good performance (Weissman and Grimshaw, 1994 and 1995). This paper describes our recent efforts to extend Prophet along several dimensions: improved overhead control, greater resource sharing, greater resource heterogeneity, wide-area scheduling, and new application types. We show that both SPMD and task parallel applications can be scheduled effectively in a shared heterogeneous LAN environment containing ethernet and ATM networks by exploiting the application structure and dynamic run-time information. This revised version was published online in July 2006 with corrections to the Cover Date. 相似文献

6.

Modular implementation of dynamic algorithm switching in parallel simulations

Pilsung Kang 《Cluster computing》2012,15(3):321-332

We present a modular approach to implementing dynamic algorithm switching for parallel scientific software. By using a compositional framework based on function call interception techniques, our proposed method transparently integrates algorithm switching code with a given program without directly modifying the original code structure. Through fine-grained control of algorithmic behavior of an application at the level of functions, our approach supports design and implementation of application-specific switching scenarios in a modular way. Our approach encourages algorithm switching to dynamically perform at the loop end of a parallel simulation, where cooperating processes in concurrent execution typically synchronize and intermediate computation results are consistent. In this way, newly added switching operations do not cause race conditions that may produce unreliable computation results in parallel simulations. By applying our method to a real-world scientific application and adapting its algorithmic behavior to the properties of input problems, we demonstrate the applicability and effectiveness of our approach to constructing efficient parallel simulations. 相似文献

7.

Optimizing MPI collective communication by orthogonal structures

Matthias Kühnemann Thomas Rauber Gudula Rünger 《Cluster computing》2006,9(3):257-279

MPI collective communication operations to distribute or gather data are used for many parallel applications from scientific computing, but they may lead to scalability problems since their execution times increase with the number of participating processors. In this article, we show how the execution time of collective communication operations can be improved significantly by an internal restructuring based on orthogonal processor structures with two or more levels. The execution time of operations like MPI_Bcast() or MPI_Allgather() can be reduced by 40% and 70% on a dual Xeon cluster and a Beowulf cluster with single-processor nodes. But also on a Cray T3E a significant performance improvement can be obtained by a careful selection of the processor structure. The use of these optimized communication operations can reduce the execution time of data parallel implementations of complex application programs significantly without requiring any other change of the computation and communication structure. We present runtime functions for the modeling of two-phase realizations and verify that these runtime functions can predict the execution time both for communication operations in isolation and in the context of application programs. 相似文献

8.

Optimizing dataflow applications on heterogeneous environments

George Teodoro Timothy D. R. Hartley Umit V. Catalyurek Renato Ferreira 《Cluster computing》2012,15(2):125-144

The increases in multi-core processor parallelism and in the flexibility of many-core accelerator processors, such as GPUs, have turned traditional SMP systems into hierarchical, heterogeneous computing environments. Fully exploiting these improvements in parallel system design remains an open problem. Moreover, most of the current tools for the development of parallel applications for hierarchical systems concentrate on the use of only a single processor type (e.g., accelerators) and do not coordinate several heterogeneous processors. Here, we show that making use of all of the heterogeneous computing resources can significantly improve application performance. Our approach, which consists of optimizing applications at run-time by efficiently coordinating application task execution on all available processing units is evaluated in the context of replicated dataflow applications. The proposed techniques were developed and implemented in an integrated run-time system targeting both intra- and inter-node parallelism. The experimental results with a real-world complex biomedical application show that our approach nearly doubles the performance of the GPU-only implementation on a distributed heterogeneous accelerator cluster. 相似文献

9.

Performance prediction with skeletons

Sukhdeep Sodhi Jaspal Subhlok Qiang Xu 《Cluster computing》2008,11(2):151-165

The performance skeleton of an application is a short running program whose performance in any scenario reflects the performance of the application it represents. Specifically, the execution time of the performance skeleton is a small fixed fraction of the execution time of the corresponding application in any execution environment. Such a skeleton can be employed to quickly estimate the performance of a large application under existing network and node sharing. This paper presents a framework for automatic construction of performance skeletons of a specified execution time and evaluates their use in performance prediction with CPU and network sharing. The approach is based on capturing the execution behavior of an application and automatically generating a synthetic skeleton program that reflects that execution behavior. The paper demonstrates that performance skeletons running for a few seconds can predict the application execution time fairly accurately. Relationship of skeleton execution time, application characteristics, and nature of resource sharing, to accuracy of skeleton based performance prediction, is analyzed in detail. The goal of this research is accurate performance estimation in heterogeneous and shared computational grids.

Jaspal Subhlok (Corresponding author)Email:

相似文献

10.

Time-sharing parallel applications through performance-targeted feedback-controlled real-time scheduling

Bin Lin Ananth I. Sundararaj Peter A. Dinda 《Cluster computing》2008,11(3):273-285

Most parallel machines, such as clusters, are space-shared in order to isolate batch parallel applications from each other and optimize their performance. However, this leads to low utilization or potentially long waiting times. We propose a self-adaptive approach to time-sharing such machines that provides isolation and allows the execution rate of an application to be tightly controlled by the administrator. Our approach combines a periodic real-time scheduler on each node with a global feedback-based control system that governs the local schedulers. We have developed an online system that implements our approach. The system takes as input a target execution rate for each application, and automatically and continuously adjusts the applications’ real-time schedules to achieve those rates with proportional CPU utilization. Target rates can be dynamically adjusted. Applications are performance-isolated from each other and from other work that is not using our system. We present an extensive evaluation that shows that the system remains stable with low response times, and that our focus on CPU isolation and control does not come at the significant expense of network I/O, disk I/O, or memory isolation.

Peter A. DindaEmail:

相似文献

11.

Delta Execution: A preemptive Java thread migration mechanism

Matchy J.M. Ma Cho-Li Wang Francis C.M. Lau 《Cluster computing》2000,3(2):83-94

Delta Execution is a preemptive and transparent thread migration mechanism for supporting load distribution and balancing in a cluster of workstations. The design of Delta Execution allows the execution system to migrate threads of a Java application to different nodes of a cluster so as to achieve parallel execution. The approach is to break down and group the execution context of a migrating thread into sets of consecutive machine-dependent and machine-independent execution sub-contexts. Each set of machine-independent sub-contexts, also known as a delta set, is then migrated to a remote node in a regulated manner for continuing the execution. Since Delta Execution is implemented at the virtual machine level, all the migration-related activities are conducted transparently with respect to the applications. No new migration-related instructions need to be added to the programs and existing applications can immediately benefit from the parallel execution capability of Delta Execution without any code modification. Furthermore, because the Delta Execution approach identifies and migrates only the machine-independent part of a thread's execution context, the implementation is therefore reasonably manageable and the resulting software is portable. This revised version was published online in July 2006 with corrections to the Cover Date. 相似文献

12.

Seamless Paxos coordinators

Gustavo M. D. Vieira Islene C. Garcia Luiz E. Buzato 《Cluster computing》2014,17(2):463-473

The Paxos algorithm requires a single correct coordinator process to operate. After a failure, the replacement of the coordinator may lead to a temporary unavailability of the application implemented atop Paxos. So far, this unavailability has been addressed by reducing the coordinator replacement rate through the use of stable coordinator selection algorithms. We have observed that the cost of recovery of the newly elected coordinator’s state is at the core of this unavailability problem. In this paper we present a new technique to manage coordinator replacement that allows the recovery to occur concurrently with new consensus rounds. Experimental results show that our seamless approach effectively solves the temporary unavailability problem, its adoption entails uninterrupted execution of the application. Our solution removes the restriction that the occurrence of coordinator replacements is something to be avoided, allowing the decoupling of the application execution from the accuracy of the mechanism used to choose a coordinator. This result increases the performance of the application even in the presence of failures, it is of special importance to the autonomous operation of replicated applications that have to adapt to varying network conditions and partial failures. 相似文献

13.

Supporting parallel applications on clusters of workstations: The Virtual Communication Machine‐based architecture

Marcel‐Cătălin Roşu Karsten Schwan Richard Fujimoto 《Cluster computing》1998,1(1):51-67

This paper presents a novel networking architecture designed for communication intensive parallel applications running on clusters of workstations (COWs) connected by high speed networks. The architecture addresses what is considered one of the most important problems of cluster-based parallel computing: the inherent inability of scaling the performance of communication software along with the host CPU performance. The Virtual Communication Machine (VCM), resident on the network coprocessor, presents a scalable software solution by providing configurable communication functionality directly accessible at user-level. The VCM architecture is configurable in that it enables the transfer to the VCM of selected communication-related functionality that is traditionally part of the application and/or the host kernel. Such transfers are beneficial when a significant reduction of the host CPU's load translates into a small increase in the coprocessor's load. The functionality implemented by the coprocessor is available at the application level as VCM instructions. Host CPU(s) and coprocessor interact through shared memory regions, thereby avoiding expensive CPU context switches. The host kernel is not involved in this interaction; it simply “connects” the application to the VCM during the initialization phase and is called infrequently to handle exceptional conditions. Protection is enforced by the VCM based on information supplied by the kernel. The VCM-based communication architecture admits low cost and open implementations, as demonstrated by its current ATM-based implementation based on off-the-shelf hardware components and using standard AAL5 packets. The architecture makes it easy to implement communication software that exhibits negligible overheads on the host CPU(s) and offers latencies and bandwidths close to the hardware limits of the underlying network. These characteristics are due to the VCM's support for zero-copy messaging with gather/scatter capabilities and the VCM's direct access to any data structure in an application's address space. This paper describes two versions of an ATM-based VCM implementation, which differ in the way they use the memory on the network adapter. Their performance under heavy load is compared in the context of a synthetic client/server application. The same application is used to evaluate the scalability of the architecture to multiple VCM-based network interfaces per host. Parallel implementations of the Traveling Salesman Problem and of Georgia Tech Time Warp, an engine for discrete-event simulation, are used to demonstrate VCM functionality and the high performance of its implementation. The distributed- and shared-memory versions of these two applications exhibit comparable performance, despite the significant cost-performance advantage of the distributed-memory platform. This revised version was published online in July 2006 with corrections to the Cover Date. 相似文献

14.

DataSpaces: an interaction and coordination framework for?coupled simulation workflows

Ciprian Docan Manish Parashar Scott Klasky 《Cluster computing》2012,15(2):163-181

Emerging high-performance distributed computing environments are enabling new end-to-end formulations in science and engineering that involve multiple interacting processes and data-intensive application workflows. For example, current fusion simulation efforts are exploring coupled models and codes that simultaneously simulate separate application processes, such as the core and the edge turbulence. These components run on different high performance computing resources, need to interact at runtime with each other and with services for data monitoring, data analysis and visualization, and data archiving. As a result, they require efficient and scalable support for dynamic and flexible couplings and interactions, which remains a challenge. This paper presents DataSpaces a flexible interaction and coordination substrate that addresses this challenge. DataSpaces essentially implements a semantically specialized virtual shared space abstraction that can be associatively accessed by all components and services in the application workflow. It enables live data to be extracted from running simulation components, indexes this data online, and then allows it to be monitored, queried and accessed by other components and services via the space using semantically meaningful operators. The underlying data transport is asynchronous, low-overhead and largely memory-to-memory. The design, implementation, and experimental evaluation of DataSpaces using a coupled fusion simulation workflow is presented. 相似文献

15.

Parallel FFT on ATM‐based networks of workstations

Suresh Chalasani Parameswaran Ramanathan 《Cluster computing》1998,1(1):13-26

We consider parallel computing on a network of workstations using a connection-oriented protocol (e.g., Asynchronous Transfer Mode) for data communication. In a connection-oriented protocol, a virtual circuit of guaranteed bandwidth is established for each pair of communicating workstations. Since all virtual circuits do not have the same guaranteed bandwidth, a parallel application must deal with the unequal bandwidths between workstations. Since most works in the design of parallel algorithms assume equal bandwidths on all the communication links, they often do not perform well when executed on networks of workstations using connection-oriented protocols. In this paper, we first evaluate the performance degradation caused by unequal bandwidths on the execution of conventional parallel algorithms such as the fast Fourier transform and bitonic sort. We then present a strategy based on dynamic redistribution of data points to reduce the bottlenecks caused by unequal bandwidths. We also extend this strategy to deal with processor heterogeneity. Using analysis and simulation we show that there is a considerable reduction in the runtime if the proposed redistribution strategy is adopted. The basic idea presented in this paper can also be used to improve the runtimes of other parallel applications in connection-oriented environments. This revised version was published online in July 2006 with corrections to the Cover Date. 相似文献

16.

Performance portability on EARTH: a case study across several parallel architectures

Weirong Zhu Yanwei Niu Guang R. Gao 《Cluster computing》2007,10(2):115-126

Due to the increase of the diversity of parallel architectures, and the increasing development time for parallel applications, performance portability has become one of the major considerations when designing the next generation of parallel program execution models, APIs, and runtime system software. This paper analyzes both code portability and performance portability of parallel programs for fine-grained multi-threaded execution and architecture models. We concentrate on one particular event-driven fine-grained multi-threaded execution model—EARTH, and discuss several design considerations of the EARTH model and runtime system that contribute to the performance portability of parallel applications. We believe that these are important issues for future high end computing system software design. Four representative benchmarks were conducted on several different parallel architectures, including two clusters listed in the 23rd supercomputer TOP500 list. The results demonstrate that EARTH based programs can achieve robust performance portability across the selected hardware platforms without any code modification or tuning. 相似文献

17.

On-line feedback-based automatic resource configuration for distributed applications

Hao Liu Søren-Aksel Sørensen 《Cluster computing》2010,13(4):397-419

A key problem in executing performance critical applications on distributed computing environments (e.g. the Grid) is the selection of resources. Research related to “automatic resource selection” aims to allocate resources on behalf of users to optimize the execution performance. However, most of current approaches are based on the static principle (i.e. resource selection is performed prior to execution) and need detailed application-specific information. In the paper, we introduce a novel on-line automatic resource selection approach. This approach is based on a simple control theory: the application continuously reports the Execution Satisfaction Degree (ESD) to the middleware Application Agent (AA), which relies on the reported ESD values to learn the execution behavior and tune the computing environment by adding/replacing/deleting resources during the execution in order to satisfy users’ performance requirements. We introduce two different policies applied to this approach to enable the AA to learn and tune the computing environment: the Utility Classification policy and the Desired Processing Power Estimation (DPPE) policy. Each policy is validated by an iterative application and a non-iterative application to demonstrate that both policies are effective to support most kinds of applications. 相似文献

18.

The design and evaluation of a virtual distributed computing environment

Haluk Topcuoglu Salim Hariri Dongmin Kim Yoonhee Kim Xue Bing Baoqing Ye Ilkyeun Ra Jon Valente 《Cluster computing》1998,1(1):81-93

Current advances in high-speed networks such as ATM and fiber-optics, and software technologies such as the JAVA programming language and WWW tools, have made network-based computing a cost-effective, high-performance distributed computing environment. Metacomputing, a special subset of network-based computing, is a well-integrated execution environment derived by combining diverse and distributed resources such as MPPs, workstations, mass storage, and databases that show a heterogeneous nature in terms of hardware, software, and organization. In this paper we present the Virtual Distributed Computing Environment (VDCE), a metacomputing environment currently being developed at Syracuse University. VDCE provides an efficient web-based approach for developing, evaluating, and visualizing large-scale distributed applications that are based on predefined task libraries on diverse platforms. The VDCE task libraries relieve end-users of tedious task implementations and also support reusability. The VDCE software architecture is described in terms of three modules: (a) the Application Editor, a user-friendly application development environment that generates the Application Flow Graph (AFG) of an application; (b) the Application Scheduler, which provides an efficient task-to-resource mapping of AFG; and (c) the VDCE Runtime System, which is responsible for running and managing application execution and for monitoring the VDCE resources. We present experimental results of an application execution on the VDCE prototype for evaluating the performance of different machine and network configurations. We also show how the VDCE can be used as a problem-solving environment on which large-scale, network-centric applications can be developed by a novice programmer rather than by an expert in low-level details of parallel programming languages. This revised version was published online in July 2006 with corrections to the Cover Date. 相似文献

19.

Extending the scope of the controlled logical clock

Daniel?Becker Email author Markus?Geimer Rolf?Rabenseifner Felix?Wolf 《Cluster computing》2013,16(1):171-189

Event traces are helpful in understanding the performance behavior of parallel applications since they allow the in-depth analysis of communication and synchronization patterns. However, the absence of synchronized clocks on most cluster systems may render the analysis ineffective because inaccurate relative event timings may misrepresent the logical event order and lead to errors when quantifying the impact of certain behaviors or confuse the users of time-line visualization tools by showing messages flowing backward in time. In our earlier work, we have developed a scalable algorithm called the controlled logical clock that eliminates inconsistent inter-process timings postmortem in traces of pure MPI applications, potentially running on large processor configurations. In this paper, we first demonstrate that our algorithm also proves beneficial in computational grids, where a single application is executed using the combined computational power of several geographically dispersed clusters. Second, we present an extended version of the algorithm that—in addition to message-passing event semantics—also preserves and restores shared-memory event semantics, enabling the correction of traces from hybrid applications. 相似文献

20.

A Framework for Automatic Adaptation of Tunable Distributed Applications

Fangzhe Chang Vijay Karamcheti 《Cluster computing》2001,4(1):49-62

Increased platform heterogeneity and varying resource availability in distributed systems motivate the design of resource-aware applications, which ensure a desired performance level by continuously adapting their behavior to changing resource characteristics. In this paper, we describe an application-independent adaptation framework that simplifies the design of resource-aware applications. This framework eliminates the need for adaptation decisions to be explicitly programmed into the application by relying on two novel components: (1) a tunability interface, which exposes adaptation choices in the form of alternate application configurations while encapsulating core application functionality; and (2) a virtual execution environment, which emulates application execution under diverse resource availability enabling off-line collection of information about resulting behavior. Together, these components permit automatic run-time decisions on when to adapt by continuously monitoring resource conditions and application progress, and how to adapt by dynamically choosing an application configuration most appropriate for the prescribed user preference. We evaluate the framework using an interactive distributed image visualization application and a parallel image processing application. The framework permits automatic adaptation to changes in execution environment characteristics such as available network bandwidth or data arrival pattern by choosing a different application configuration that satisfies user preferences of output quality and timeliness. 相似文献