期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Key Message Approach to Optimize Communication of Parallel Applications on Clusters

Ming Zhu Wentong Cai Bu-Sung Lee 《Cluster computing》2003,6(3):253-265

Over the past few years, cluster/distributed computing has been gaining popularity. The proliferation of the cluster/distributed computing is due to the improved performance and increased reliability of these systems. Many parallel programming languages and related parallel programming models have become widely accepted. However, one of the major shortcomings of running parallel applications on cluster/distributed computing environments is the high communication overhead incurred. To reduce the communication overhead, and thus the completion time of a parallel application, this paper describes a simple, efficient and portable Key Message (KM) approach to support parallel computing on cluster/distributed computing environments. To demonstrate the advantage of the KM approach, a prototype runtime system has been implemented and evaluated. Our preliminary experimental results show that the KM approach has better improvement on communication of a parallel application when network background load increases or the computation to communication ratio of the application decreases. 相似文献

2.

Optimizing dataflow applications on heterogeneous environments

George Teodoro Timothy D. R. Hartley Umit V. Catalyurek Renato Ferreira 《Cluster computing》2012,15(2):125-144

The increases in multi-core processor parallelism and in the flexibility of many-core accelerator processors, such as GPUs, have turned traditional SMP systems into hierarchical, heterogeneous computing environments. Fully exploiting these improvements in parallel system design remains an open problem. Moreover, most of the current tools for the development of parallel applications for hierarchical systems concentrate on the use of only a single processor type (e.g., accelerators) and do not coordinate several heterogeneous processors. Here, we show that making use of all of the heterogeneous computing resources can significantly improve application performance. Our approach, which consists of optimizing applications at run-time by efficiently coordinating application task execution on all available processing units is evaluated in the context of replicated dataflow applications. The proposed techniques were developed and implemented in an integrated run-time system targeting both intra- and inter-node parallelism. The experimental results with a real-world complex biomedical application show that our approach nearly doubles the performance of the GPU-only implementation on a distributed heterogeneous accelerator cluster. 相似文献

3.

Fast VMM-based overlay networking for bridging the cloud and high performance computing

Lei Xia Zheng Cui John Lange Yuan Tang Peter Dinda Patrick Bridges 《Cluster computing》2014,17(1):39-59

A collection of virtual machines (VMs) interconnected with an overlay network with a layer 2 abstraction has proven to be a powerful, unifying abstraction for adaptive distributed and parallel computing on loosely-coupled environments. It is now feasible to allow VMs hosting high performance computing (HPC) applications to seamlessly bridge distributed cloud resources and tightly-coupled supercomputing and cluster resources. However, to achieve the application performance that the tightly-coupled resources are capable of, it is important that the overlay network not introduce significant overhead relative to the native hardware, which is not the case for current user-level tools, including our own existing VNET/U system. In response, we describe the design, implementation, and evaluation of a virtual networking system that has negligible latency and bandwidth overheads in 1–10 Gbps networks. Our system, VNET/P, is directly embedded into our publicly available Palacios virtual machine monitor (VMM). VNET/P achieves native performance on 1 Gbps Ethernet networks and very high performance on 10 Gbps Ethernet networks. The NAS benchmarks generally achieve over 95 % of their native performance on both 1 and 10 Gbps. We have further demonstrated that VNET/P can operate successfully over more specialized tightly-coupled networks, such as Infiniband and Cray Gemini. Our results suggest it is feasible to extend a software-based overlay network designed for computing at wide-area scales into tightly-coupled environments. 相似文献

4.

A Distributed Multi-Storage Resource Architecture and I/O Performance Prediction for Scientific Computing 总被引：1，自引：0，他引：1

X. Shen A. Choudhary C. Matarazzo P. Sinha 《Cluster computing》2003,6(3):189-200

I/O intensive applications have posed great challenges to computational scientists. A major problem of these applications is that users have to sacrifice performance requirements in order to satisfy storage capacity requirements in a conventional computing environment. Further performance improvement is impeded by the physical nature of these storage media even when state-of-the-art I/O optimizations are employed.In this paper, we present a distributed multi-storage resource architecture, which can satisfy both performance and capacity requirements by employing multiple storage resources. Compared to a traditional single storage resource architecture, our architecture provides a more flexible and reliable computing environment. This architecture can bring new opportunities for high performance computing as well as inherit state-of-the-art I/O optimization approaches that have already been developed. It provides application users with high-performance storage access even when they do not have the availability of a single large local storage archive at their disposal. We also develop an Application Programming Interface (API) that provides transparent management and access to various storage resources in our computing environment. Since I/O usually dominates the performance in I/O intensive applications, we establish an I/O performance prediction mechanism which consists of a performance database and a prediction algorithm to help users better evaluate and schedule their applications. A tool is also developed to help users automatically generate performance data stored in databases. The experiments show that our multi-storage resource architecture is a promising platform for high performance distributed computing. 相似文献

5.

Parallel Sequence Matching with TACO's Distributed Object Groups – A Case Study from Molecular Biology

Jörg Nolte Paul Horton 《Cluster computing》2001,4(1):71-77

TACO is a template library that implements higher-order parallel operations on distributed object sets by means of reusable topology classes and C++ function templates. In this paper we discuss an experimental application that exploits TACO's distributed object groups and collective operations for computing the similarity between groups of molecular sequences, a computationally intensive core problem in molecular biology research. In particular we show how TACO's distributed collections can be conveniently combined with well known concepts found in the C++ standard template library (STL) to solve matching and sorting problems effectively on distributed hardware platforms. The resulting implementation is concise and gives excellent parallel performance on PC- and workstation clusters. 相似文献

6.

Cactus Tools for Grid Applications 总被引：3，自引：0，他引：3

Gabrielle Allen Werner Benger Thomas Dramlitsch Tom Goodale Hans-Christian Hege Gerd Lanfermann André Merzky Thomas Radke Edward Seidel John Shalf 《Cluster computing》2001,4(3):179-188

Cactus is an open source problem solving environment designed for scientists and engineers. Its modular structure facilitates parallel computation across different architectures and collaborative code development between different groups. The Cactus Code originated in the academic research community, where it has been developed and used over many years by a large international collaboration of physicists and computational scientists. We discuss here how the intensive computing requirements of physics applications now using the Cactus Code encourage the use of distributed and metacomputing, and detail how its design makes it an ideal application test-bed for Grid computing. We describe the development of tools, and the experiments which have already been performed in a Grid environment with Cactus, including distributed simulations, remote monitoring and steering, and data handling and visualization. Finally, we discuss how Grid portals, such as those already developed for Cactus, will open the door to global computing resources for scientific users. 相似文献

7.

The design and evaluation of a virtual distributed computing environment

Haluk Topcuoglu Salim Hariri Dongmin Kim Yoonhee Kim Xue Bing Baoqing Ye Ilkyeun Ra Jon Valente 《Cluster computing》1998,1(1):81-93

Current advances in high-speed networks such as ATM and fiber-optics, and software technologies such as the JAVA programming language and WWW tools, have made network-based computing a cost-effective, high-performance distributed computing environment. Metacomputing, a special subset of network-based computing, is a well-integrated execution environment derived by combining diverse and distributed resources such as MPPs, workstations, mass storage, and databases that show a heterogeneous nature in terms of hardware, software, and organization. In this paper we present the Virtual Distributed Computing Environment (VDCE), a metacomputing environment currently being developed at Syracuse University. VDCE provides an efficient web-based approach for developing, evaluating, and visualizing large-scale distributed applications that are based on predefined task libraries on diverse platforms. The VDCE task libraries relieve end-users of tedious task implementations and also support reusability. The VDCE software architecture is described in terms of three modules: (a) the Application Editor, a user-friendly application development environment that generates the Application Flow Graph (AFG) of an application; (b) the Application Scheduler, which provides an efficient task-to-resource mapping of AFG; and (c) the VDCE Runtime System, which is responsible for running and managing application execution and for monitoring the VDCE resources. We present experimental results of an application execution on the VDCE prototype for evaluating the performance of different machine and network configurations. We also show how the VDCE can be used as a problem-solving environment on which large-scale, network-centric applications can be developed by a novice programmer rather than by an expert in low-level details of parallel programming languages. This revised version was published online in July 2006 with corrections to the Cover Date. 相似文献

8.

Research on heterogeneous data integration model of group enterprise based on cluster computing

Qingyuan Zhou 《Cluster computing》2016,19(3):1275-1282

Cluster, consisting of a group of computers, is to act as a whole system to provide users with computer resources. Each computer is a node of this cluster. Cluster computer refers to a system consisting of a complete set of computers connected to each other. With the rapid development of computer technology, cluster computing technique with high performance–cost ratio has been widely applied in distributed parallel computing. For the large-scale close data in group enterprise, a heterogeneous data integration model was built under cluster environment based on cluster computing, XML technology and ontology theory. Such model could provide users unified and transparent access interfaces. Based on cluster computing, the work has solved the heterogeneous data integration problems by means of Ontology and XML technology. Furthermore, good application effect has been achieved compared with traditional data integration model. Furthermore, it was proved that this model improved the computing capacity of system, with high performance–cost ratio. Thus, it is hoped to provide support for decision-making of enterprise managers. 相似文献

9.

On-line feedback-based automatic resource configuration for distributed applications

Hao Liu Søren-Aksel Sørensen 《Cluster computing》2010,13(4):397-419

A key problem in executing performance critical applications on distributed computing environments (e.g. the Grid) is the selection of resources. Research related to “automatic resource selection” aims to allocate resources on behalf of users to optimize the execution performance. However, most of current approaches are based on the static principle (i.e. resource selection is performed prior to execution) and need detailed application-specific information. In the paper, we introduce a novel on-line automatic resource selection approach. This approach is based on a simple control theory: the application continuously reports the Execution Satisfaction Degree (ESD) to the middleware Application Agent (AA), which relies on the reported ESD values to learn the execution behavior and tune the computing environment by adding/replacing/deleting resources during the execution in order to satisfy users’ performance requirements. We introduce two different policies applied to this approach to enable the AA to learn and tune the computing environment: the Utility Classification policy and the Desired Processing Power Estimation (DPPE) policy. Each policy is validated by an iterative application and a non-iterative application to demonstrate that both policies are effective to support most kinds of applications. 相似文献

10.

Design issues for a high-performance distributed shared memory on symmetrical multiprocessor clusters

Sumit Roy Vipin Chaudhary 《Cluster computing》1999,2(3):177-186

Clusters of Symmetrical Multiprocessors (SMPs) have recently become the norm for high-performance economical computing solutions. Multiple nodes in a cluster can be used for parallel programming using a message passing library. An alternate approach is to use a software Distributed Shared Memory (DSM) to provide a view of shared memory to the application programmer. This paper describes Strings, a high performance distributed shared memory system designed for such SMP clusters. The distinguishing feature of this system is the use of a fully multi-threaded runtime system, using kernel level threads. Strings allows multiple application threads to be run on each node in a cluster. Since most modern UNIX systems can multiplex these threads on kernel level light weight processes, applications written using Strings can exploit multiple processors on a SMP machine. This paper describes some of the architectural details of the system and illustrates the performance improvements with benchmark programs from the SPLASH-2 suite, some computational kernels as well as a full fledged application. It is found that using multiple processes on SMP nodes provides good speedups only for a few of the programs. Multiple application threads can improve the performance in some cases, but other programs show a slowdown. If kernel threads are used additionally, the overall performance improves significantly in all programs tested. Other design decisions also have a beneficial impact, though to a lesser degree. This revised version was published online in July 2006 with corrections to the Cover Date. 相似文献

11.

DataSpaces: an interaction and coordination framework for?coupled simulation workflows

Ciprian Docan Manish Parashar Scott Klasky 《Cluster computing》2012,15(2):163-181

Emerging high-performance distributed computing environments are enabling new end-to-end formulations in science and engineering that involve multiple interacting processes and data-intensive application workflows. For example, current fusion simulation efforts are exploring coupled models and codes that simultaneously simulate separate application processes, such as the core and the edge turbulence. These components run on different high performance computing resources, need to interact at runtime with each other and with services for data monitoring, data analysis and visualization, and data archiving. As a result, they require efficient and scalable support for dynamic and flexible couplings and interactions, which remains a challenge. This paper presents DataSpaces a flexible interaction and coordination substrate that addresses this challenge. DataSpaces essentially implements a semantically specialized virtual shared space abstraction that can be associatively accessed by all components and services in the application workflow. It enables live data to be extracted from running simulation components, indexes this data online, and then allows it to be monitored, queried and accessed by other components and services via the space using semantically meaningful operators. The underlying data transport is asynchronous, low-overhead and largely memory-to-memory. The design, implementation, and experimental evaluation of DataSpaces using a coupled fusion simulation workflow is presented. 相似文献

12.

neCODEC: nearline data compression for scientific applications

Yuan Tian Cong Xu Weikuan Yu Jeffrey S. Vetter Scott Klasky Honggao Liu Saad Biaz 《Cluster computing》2014,17(2):475-486

Advances on multicore technologies lead to processors with tens and soon hundreds of cores in a single socket, resulting in an ever growing gap between computing power and available memory and I/O bandwidths for data handling. It would be beneficial if some of the computing power can be transformed into gains of I/O efficiency, thereby reducing this speed disparity between computing and I/O. In this paper, we design and implement a NEarline data COmpression and DECompression (neCODEC) scheme for data-intensive parallel applications. Several salient techniques are introduced in neCODEC, including asynchronous compression threads, elastic file representation, distributed metadata handling, and balanced subfile distribution. Our performance evaluation indicates that neCODEC can improve the performance of a variety of data-intensive microbenchmarks and scientific applications. Particularly, neCODEC is capable of increasing the effective bandwidth of S3D, a combustion simulation code, by more than 5 times. 相似文献

13.

Collaborative Filtering for Brain-Computer Interaction Using Transfer Learning and Active Class Selection

Dongrui Wu Brent J. Lance Thomas D. Parsons 《PloS one》2013,8(2)

Brain-computer interaction (BCI) and physiological computing are terms that refer to using processed neural or physiological signals to influence human interaction with computers, environment, and each other. A major challenge in developing these systems arises from the large individual differences typically seen in the neural/physiological responses. As a result, many researchers use individually-trained recognition algorithms to process this data. In order to minimize time, cost, and barriers to use, there is a need to minimize the amount of individual training data required, or equivalently, to increase the recognition accuracy without increasing the number of user-specific training samples. One promising method for achieving this is collaborative filtering, which combines training data from the individual subject with additional training data from other, similar subjects. This paper describes a successful application of a collaborative filtering approach intended for a BCI system. This approach is based on transfer learning (TL), active class selection (ACS), and a mean squared difference user-similarity heuristic. The resulting BCI system uses neural and physiological signals for automatic task difficulty recognition. TL improves the learning performance by combining a small number of user-specific training samples with a large number of auxiliary training samples from other similar subjects. ACS optimally selects the classes to generate user-specific training samples. Experimental results on 18 subjects, using both nearest neighbors and support vector machine classifiers, demonstrate that the proposed approach can significantly reduce the number of user-specific training data samples. This collaborative filtering approach will also be generalizable to handling individual differences in many other applications that involve human neural or physiological data, such as affective computing. 相似文献

14.

Design and Development of a Scalable Distributed Debugger for Cluster Computing

Xingfu Wu Qingping Chen Xian-He Sun 《Cluster computing》2002,5(4):365-375

Debugging is an essential part of parallel and distributed processing. However, developing parallel and distributed debugger is difficult. This is especially true for cluster computing where heterogeneity presents. In this paper, we first give a survey of the current debugging techniques and existing tools, and then present a client–server debugging model. Based on this model, we discuss the design and development of a practical scalable distributed debugging system for cluster computing in detail, and give two case studies to show how the distributed debugging system efficiently supports debugging message-passing programs such as various MPI and PVM programs. The newly developed distributed debugger is based on the sequential debugger gdb and dbx. It has the capability of scaling to handle hundreds of processes. Its interfaces are completely implemented in Java, and its graphical user interface is the same on all computing platforms. In addition, it is portable, easy to learn and use. 相似文献

15.

ORTHRUS: a lightweighted block-level cloud storage system

Jian Wan Jianliang Zhang Li Zhou Yicheng Wang Congfeng Jiang Yongjian Ren Jue Wang 《Cluster computing》2013,16(4):625-638

Taking advantage of distributed storage technology and virtualization technology, cloud storage systems provide virtual machine clients customizable storage service. They can be divided into two types: distributed file system and block level storage system. There are two disadvantages in existing block level storage system: Firstly, Some of them are tightly coupled with their cloud computing environments. As a result, it’s hard to extend them to support other cloud computing platforms; Secondly, The bottleneck of volume server seriously affects the performance and reliability of the whole system. In this paper we present a lightweighted block-level storage system for clouds—ORTHRUS, based on virtualization technology. We first design the architecture with multiple volume servers and its workflows, which can improve system performance and avoid the problem. Secondly, we propose a Listen-Detect-Switch mechanism for ORTHRUS to deal with contingent volume servers’ failure. At last we design a strategy that dynamically balances load between multiple volume servers. We characterize machine capability and load quantity with black box model, and implement the dynamic load balance strategy which is based on genetic algorithm. Extensive experimental results show that the aggregated I/O throughputs of ORTHRUS are significantly improved (approximately two times of that in Orthrus), and both I/O throughputs and IOPS are also remarkably improved (about 1.8 and 1.2 times, respectively) by our dynamic load balance strategy. 相似文献

16.

CloudDMSS: robust Hadoop-based multimedia streaming service architecture for a cloud computing environment

Myoungjin Kim Seungho Han Yun Cui Hanku Lee Hogyeon Cho Sungdae Hwang 《Cluster computing》2014,17(3):605-628

The delivery of scalable, rich multimedia applications and services on the Internet requires sophisticated technologies for transcoding, distributing, and streaming content. Cloud computing provides an infrastructure for such technologies, but specific challenges still remain in the areas of task management, load balancing, and fault tolerance. To address these issues, we propose a cloud-based distributed multimedia streaming service (CloudDMSS), which is designed to run on all major cloud computing services. CloudDMSS is highly adapted to the structure and policies of Hadoop, thus it has additional capacities for transcoding, task distribution, load balancing, and content replication and distribution. To satisfy the design requirements of our service architecture, we propose four important algorithms: content replication, system recovery for Hadoop distributed multimedia streaming, management for cloud multimedia management, and streaming resource-based connection (SRC) for streaming job distribution. To evaluate the proposed system, we conducted several different performance tests on a local testbed: transcoding, streaming job distribution using SRC, streaming service deployment and robustness to data node and task failures. In addition, we performed three different tests in an actual cloud computing environment, Cloudit 2.0: transcoding, streaming job distribution using SRC, and streaming service deployment. 相似文献

17.

A Framework for Adaptive Cluster Computing Using JavaSpaces

Jyoti Batheja Manish Parashar 《Cluster computing》2003,6(3):201-213

Heterogeneous networked clusters are being increasingly used as platforms for resource-intensive parallel and distributed applications. The fundamental underlying idea is to provide large amounts of processing capacity over extended periods of time by harnessing the idle and available resources on the network in an opportunistic manner. In this paper we present the design, implementation and evaluation of a framework that uses JavaSpaces to support this type of opportunistic adaptive parallel/distributed computing over networked clusters in a non-intrusive manner. The framework targets applications exhibiting coarse grained parallelism and has three key features: (1) portability across heterogeneous platforms, (2) minimal configuration overheads for participating nodes, and (3) automated system state monitoring (using SNMP) to ensure non-intrusive behavior. Experimental results presented in this paper demonstrate that for applications that can be broken into coarse-grained, relatively independent tasks, the opportunistic adaptive parallel computing framework can provide performance gains. Furthermore, the results indicate that monitoring and reacting to the current system state minimizes the intrusiveness of the framework. 相似文献

18.

Finding a suitable system scale to optimize program performance on software DSM systems

Yi-Chang Zhuang Jyh-Biau Chang Tyng-Yeu Liang Ce-Kuen Shieh Laurence Tianruo Yang 《Cluster computing》2006,9(3):223-236

Recently, software distributed shared memory systems have successfully provided an easy user interface to parallel user applications on distributed systems. In order to prompt program performance, most of DSM systems usually were greedy to utilize all of available processors in a computer network to execute user programs. However, using more processors to execute programs cannot necessarily guarantee to obtain better program performance. The overhead of paralleling programs is increased by the addition in the number of processors used for program execution. If the performance gain from program parallel cannot compensate for the overhead, increasing the number of execution processors will result in performance degradation and resource waste. In this paper, we proposed a mechanism to dynamically find a suitable system scale to optimize performance for DSM applications according to run-time information. The experimental results show that the proposed mechanism can precisely predict the processor number that will result in the best performance and then effectively optimize the performance of the test applications by adapting system scale according to the predicted result. Yi-Chang Zhuang received his B.S., M.S. and Ph.D. degrees in electrical engineering from National Cheng Kung University in 1995, 1997, and 2004. He is currently working as an engineer at Industrial Technology Research Institute in Taiwan. His research interests include object-based storage, file systems, distributed systems, and grid computing. Jyh-Biau Chang is currently an assistant professor at the Information Management Department of Leader University in Taiwan. He received his B.S., M.S. and Ph.D. degrees from Electrical Engineering Department of National Cheng Kung University in 1994, 1996, and 2005. His research interest is focused on cluster and grid computing, parallel and distributed system, and operating system. Tyng-Yeu Liang is currently an assistant professor who teaches and studies at Department of Electrical Engineering, National Kaohsiung University of Applied Sciences in Taiwan. He received his B.S., M.S. and Ph.D. degrees from National Cheng Kung University in 1992, 1994, and 2000. His study is interested in cluster and grid computing, image processing and multimedia. Ce-Kuen Shieh currently is a professor at the Electrical Engineering Department of National Cheng Kung University in Taiwan. He is also the chief of computation center at National Cheng Kung University. He received his Ph.D. degree from the Department of Electrical Engineering of National Cheng Kung University in 1988. He was the chairman of the Electrical Engineering Department of National Cheng Kung University from 2002 to 2005. His research interest is focused on computer network, and parallel and distributed system. Laurence T. Yang is a professor at the Department of Computer Science, St. Francis Xavier University, Canada. His research includes high performance computing and networking, embedded systems, ubiquitous/pervasive computing and intelligence, and autonomic and trusted computing. 相似文献

19.

Aspect-oriented development of cluster computing software

Hyuck Han Hyungsoo Jung Heon Y. Yeom 《Cluster computing》2011,14(4):357-375

In complex software systems, modularity and readability tend to be degraded owing to inseparable interactions between concerns that are distinct features in a program. Such interactions result in tangled code that is hard to develop and maintain. Aspect-Oriented Programming (AOP) is a powerful method for modularizing source code and for decoupling cross-cutting concerns. A decade of growing research on AOP has brought the paradigm into many exciting areas. However, pioneering work on AOP has not flourished enough to enrich the design of distributed systems using the refined AOP paradigm. This article investigates three case studies that cover time-honored issues such as fault-tolerant computing, network heterogeneity, and object replication in the cluster computing community using the AOP paradigm. The aspects that we define here are simple, intuitive, and reusable. Our intensive experiences show that (i) AOP can improve the modularity of cluster computing software by separating the source code into base and instrumented parts, and (ii) AOP helps developers to deploy additional features to legacy cluster computing software without harming code modularity and system performance. 相似文献

20.

Harnessing parallelism in multicore clusters with the All-Pairs,Wavefront, and Makeflow abstractions

Li Yu Christopher Moretti Andrew Thrasher Scott Emrich Kenneth Judd Douglas Thain 《Cluster computing》2010,13(3):243-256

Both distributed systems and multicore systems are difficult programming environments. Although the expert programmer may be able to carefully tune these systems to achieve high performance, the non-expert may struggle. We argue that high level abstractions are an effective way of making parallel computing accessible to the non-expert. An abstraction is a regularly structured framework into which a user may plug in simple sequential programs to create very large parallel programs. By virtue of a regular structure and declarative specification, abstractions may be materialized on distributed, multicore, and distributed multicore systems with robust performance across a wide range of problem sizes. In previous work, we presented the All-Pairs abstraction for computing on distributed systems of single CPUs. In this paper, we extend All-Pairs to multicore systems, and introduce the Wavefront and Makeflow abstractions, which represent a number of problems in economics and bioinformatics. We demonstrate good scaling of both abstractions up to 32 cores on one machine and hundreds of cores in a distributed system. 相似文献