期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

High Performance Network of PC Cluster Maestro

Koichi Wada Shinichi Yamagiwa Munehiro Fukuda 《Cluster computing》2002,5(1):33-42

This paper presents a design, an architecture, and performance evaluation of high-performance network of PC cluster, called Maestro. Most networks of recent clusters have been organized based on WAN or LAN technology, due to their market availability. However, communication protocols and functions of such conventional networks are not optimal for parallel computing, which requires low latency and high bandwidth communication. In this paper, we propose two optimizations for high-performance communication: (1) transferring in burst as many packets as the receiving buffer accepts at once, and (2) having each hardware component pass one data unit to another in a pipelined manner. We have developed a network interface and a switch, which are composed of dedicated hardware modules to realize these optimizations. An implementatin of the message passing library developed on Maestro cluster is also described. Performance evaluation shows that the proposed optimizations can extract the potential performance of the physical layer efficiently and improve the performance in communication. 相似文献

2.

Block‐cyclic redistribution over heterogeneous networks

Prashanth B. Bhat Viktor K. Prasanna C.S. Raghavendra 《Cluster computing》2000,3(1):25-34

Clusters of workstations and networked parallel computing systems are emerging as promising computational platforms for HPC applications. The processors in such systems are typically interconnected by a collection of heterogeneous networks such as Ethernet, ATM, and FDDI, among others. In this paper, we develop techniques to perform block-cyclic redistribution over P processors interconnected by such a collection of heterogeneous networks. We represent the communication scheduling problem using a timing diagram formalism. Here, each interprocessor communication event is represented by a rectangle whose height denotes the time to perform this event over the heterogeneous network. The communication scheduling problem is then one of appropriately positioning the rectangles so as to minimize the completion time of all the communication events. For the important case where the block size changes by a factor of K, we develop a heuristic algorithm whose completion time is at most twice the optimal. The running time of the heuristic is O(PK ²). Our heuristic algorithm is adaptive to variations in network performance, and derives schedules at run-time, based on current information about the available network bandwidth. Our experimental results show that our schedules always have communication times that are very close to optimal. This revised version was published online in July 2006 with corrections to the Cover Date. 相似文献

3.

ACS: An adaptive communication system for heterogeneous wide-area ATM clusters

Sung-Yong Park Salim Hariri 《Cluster computing》1999,2(3):229-246

This paper presents an architecture, implementation, and performance evaluation of an adaptive message-passing system for a heterogeneous wide-area ATM cluster that we call the Adaptive Communication System (ACS). ACS uses multithreading to provide efficient techniques for overlapping computation and communication in wide-area computing. By separating control and data activities, ACS eliminates unnecessary control transfers over the data path. This optimizes the data path and improves the performance. ACS supports several different flow control algorithms, error control algorithms, and multicasting algorithms. Furthermore, ACS allows programmers to select at runtime the suitable communication schemes per-connection basis to meet the requirements of a given application. ACS provides three application communication interfaces: Socket Communication Interface (SCI), ATM Communication Interface (ACI), and High Performance Interface (HPI) to support various classes of applications. The SCI is provided mainly for applications that must be portable to many different computing platforms. The ACI provides services that are compatible with ATM connection oriented services where each connection can be configured to meet the Quality of Service (QOS) requirements of that connection. This allows programmers to fully utilize the benefits of the ATM network. The HPI supports applications that demand low-latency and high-throughput communication services. In this interface, ACS uses read/write trap routines to reduce latency and data transfer time, and to avoid using traditional communication protocols. We analyze and compare the performance of ACS with those of other message-passing systems such as p4, PVM, and MPI in terms of point-to-point, multicasting, and application performance. The benchmarking results show that ACS outperforms other message-passing systems and provides flexible communication services for various classes of applications. This revised version was published online in July 2006 with corrections to the Cover Date. 相似文献

4.

Supporting parallel applications on clusters of workstations: The Virtual Communication Machine‐based architecture

Marcel‐Cătălin Roşu Karsten Schwan Richard Fujimoto 《Cluster computing》1998,1(1):51-67

This paper presents a novel networking architecture designed for communication intensive parallel applications running on clusters of workstations (COWs) connected by high speed networks. The architecture addresses what is considered one of the most important problems of cluster-based parallel computing: the inherent inability of scaling the performance of communication software along with the host CPU performance. The Virtual Communication Machine (VCM), resident on the network coprocessor, presents a scalable software solution by providing configurable communication functionality directly accessible at user-level. The VCM architecture is configurable in that it enables the transfer to the VCM of selected communication-related functionality that is traditionally part of the application and/or the host kernel. Such transfers are beneficial when a significant reduction of the host CPU's load translates into a small increase in the coprocessor's load. The functionality implemented by the coprocessor is available at the application level as VCM instructions. Host CPU(s) and coprocessor interact through shared memory regions, thereby avoiding expensive CPU context switches. The host kernel is not involved in this interaction; it simply “connects” the application to the VCM during the initialization phase and is called infrequently to handle exceptional conditions. Protection is enforced by the VCM based on information supplied by the kernel. The VCM-based communication architecture admits low cost and open implementations, as demonstrated by its current ATM-based implementation based on off-the-shelf hardware components and using standard AAL5 packets. The architecture makes it easy to implement communication software that exhibits negligible overheads on the host CPU(s) and offers latencies and bandwidths close to the hardware limits of the underlying network. These characteristics are due to the VCM's support for zero-copy messaging with gather/scatter capabilities and the VCM's direct access to any data structure in an application's address space. This paper describes two versions of an ATM-based VCM implementation, which differ in the way they use the memory on the network adapter. Their performance under heavy load is compared in the context of a synthetic client/server application. The same application is used to evaluate the scalability of the architecture to multiple VCM-based network interfaces per host. Parallel implementations of the Traveling Salesman Problem and of Georgia Tech Time Warp, an engine for discrete-event simulation, are used to demonstrate VCM functionality and the high performance of its implementation. The distributed- and shared-memory versions of these two applications exhibit comparable performance, despite the significant cost-performance advantage of the distributed-memory platform. This revised version was published online in July 2006 with corrections to the Cover Date. 相似文献

5.

Object Placement Using Performance Surfaces

André Turgeon Quinn Snell Mark Clement 《Cluster computing》2001,4(3):263-273

Heterogeneous parallel clusters of workstations are being used to solve many important computational problems. Scheduling parallel applications on the best collection of machines in a heterogeneous computing environment is a complex problem. Performance prediction is vital to good application performance in this environment since utilization of an ill-suited machine can slow the computation down significantly. This paper addresses the problem of network performance prediction. A new methodology for characterizing network links and application's need for network resources is developed which makes use of Performance Surfaces [3]. This Performance Surface abstraction is used to schedule a parallel application on resources where it will run most efficiently. 相似文献

6.

Fast VMM-based overlay networking for bridging the cloud and high performance computing

Lei Xia Zheng Cui John Lange Yuan Tang Peter Dinda Patrick Bridges 《Cluster computing》2014,17(1):39-59

A collection of virtual machines (VMs) interconnected with an overlay network with a layer 2 abstraction has proven to be a powerful, unifying abstraction for adaptive distributed and parallel computing on loosely-coupled environments. It is now feasible to allow VMs hosting high performance computing (HPC) applications to seamlessly bridge distributed cloud resources and tightly-coupled supercomputing and cluster resources. However, to achieve the application performance that the tightly-coupled resources are capable of, it is important that the overlay network not introduce significant overhead relative to the native hardware, which is not the case for current user-level tools, including our own existing VNET/U system. In response, we describe the design, implementation, and evaluation of a virtual networking system that has negligible latency and bandwidth overheads in 1–10 Gbps networks. Our system, VNET/P, is directly embedded into our publicly available Palacios virtual machine monitor (VMM). VNET/P achieves native performance on 1 Gbps Ethernet networks and very high performance on 10 Gbps Ethernet networks. The NAS benchmarks generally achieve over 95 % of their native performance on both 1 and 10 Gbps. We have further demonstrated that VNET/P can operate successfully over more specialized tightly-coupled networks, such as Infiniband and Cray Gemini. Our results suggest it is feasible to extend a software-based overlay network designed for computing at wide-area scales into tightly-coupled environments. 相似文献

7.

Optimizing transport protocol parameters for large scale PC cluster and its evaluation with parallel data mining

Masato Oguchi Masaru Kitsuregawa 《Cluster computing》2000,3(1):15-23

Recently, PC clusters have come to be studied intensively for large scale parallel computers of the next generation. ATM technology is a strong candidate as a de facto standard of high speed communication networks. Therefore, an ATM-connected PC cluster is a promising platform from the cost/performance point of view, as a future high performance computing environment. Data intensive applications, such as data mining and ad hoc query processing in databases, are considered very important for massively parallel processors, as well as for conventional scientific calculations. Thus, investigating the feasibility of applications on an ATM-connected PC cluster is meaningful. In this paper, an ATM-connected PC cluster consisting of 100 PCs is reported, and characteristics of a transport layer protocol for the PC cluster are evaluated. Point-to-point communication performance is measured and discussed, when a TCP window size parameter is changed. Parallel data mining is implemented and evaluated on the cluster. Retransmission caused by cell loss at the ATM switch is analyzed, and parameters of retransmission mechanism suitable for parallel processing on the large scale PC cluster are clarified. Default TCP protocol cannot provide good performance, since a lot of collisions happen during all-to-all multicasting executed on the large scale PC cluster. Using TCP parameters with the proposed optimization, performance improvement is achieved for parallel data mining on 100 PCs. This revised version was published online in July 2006 with corrections to the Cover Date. 相似文献

8.

Fast parallel Markov clustering in bioinformatics using massively parallel computing on GPU with CUDA and ELLPACK-R sparse format

Bustamam A Burrage K Hamilton NA 《IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM》2012,9(3):679-692

Markov clustering (MCL) is becoming a key algorithm within bioinformatics for determining clusters in networks. However,with increasing vast amount of data on biological networks, performance and scalability issues are becoming a critical limiting factor in applications. Meanwhile, GPU computing, which uses CUDA tool for implementing a massively parallel computing environment in the GPU card, is becoming a very powerful, efficient, and low-cost option to achieve substantial performance gains over CPU approaches. The use of on-chip memory on the GPU is efficiently lowering the latency time, thus, circumventing a major issue in other parallel computing environments, such as MPI. We introduce a very fast Markov clustering algorithm using CUDA (CUDA-MCL) to perform parallel sparse matrix-matrix computations and parallel sparse Markov matrix normalizations, which are at the heart of MCL. We utilized ELLPACK-R sparse format to allow the effective and fine-grain massively parallel processing to cope with the sparse nature of interaction networks data sets in bioinformatics applications. As the results show, CUDA-MCL is significantly faster than the original MCL running on CPU. Thus, large-scale parallel computation on off-the-shelf desktop-machines, that were previously only possible on supercomputing architectures, can significantly change the way bioinformaticians and biologists deal with their data. 相似文献

9.

Planning a secure and reliable IoT-enabled FOG-assisted computing infrastructure for healthcare

Ali Hafiz Munsub Liu Jun Bukhari Syed Ahmad Chan Rauf Hafiz Tayyab 《Cluster computing》2022,25(3):2143-2161

Transmitting electronic medical records (EMR) and other communication in modern Internet of Things (IoT) healthcare ecosystem is both delay and integrity-sensitive. Transmitting and computing volumes of EMR data on traditional clouds away from healthcare facilities is a main source of trust-deficit using IoT-enabled applications. Reliable IoT-enabled healthcare (IoTH) applications demand careful deployment of computing and communication infrastructure (CnCI). This paper presents a FOG-assisted CnCI model for reliable healthcare facilities. Planning a secure and reliable CnCI for IoTH networks is a challenging optimization task. We proposed a novel mathematical model (i.e., integer programming) to plan FOG-assisted CnCI for IoTH networks. It considers wireless link interfacing gateways as a virtual machine (VM). An IoTH network contains three wirelessly communicating nodes: VMs, reduced computing power gateways (RCPG), and full computing power gateways (FCPG). The objective is to minimize the weighted sum of infrastructure and operational costs of the IoTH network planning. Swarm intelligence-based evolutionary approach is used to solve IoTH networks planning for superior quality solutions in a reasonable time. The discrete fireworks algorithm with three local search methods (DFWA-3-LSM) outperformed other experimented algorithms in terms of average planning cost for all experimented problem instances. The DFWA-3-LSM lowered the average planning cost by 17.31%, 17.23%, and 18.28% when compared against discrete artificial bee colony with 3 LSM (DABC-3-LSM), low-complexity biogeography-based optimization (LC-BBO), and genetic algorithm, respectively. Statistical analysis demonstrates that the performance of DFWA-3-LSM is better than other experimented algorithms. The proposed mathematical model is envisioned for secure, reliable and cost-effective EMR data manipulation and other communication in healthcare.

相似文献

10.

Scalable remote homology detection and fold recognition in massive protein networks

Raphael Petegrosso Zhuliu Li Molly A. Srour Yousef Saad Wei Zhang Rui Kuang 《Proteins》2019,87(6):478-491

The global connectivities in very large protein similarity networks contain traces of evolution among the proteins for detecting protein remote evolutionary relations or structural similarities. To investigate how well a protein network captures the evolutionary information, a key limitation is the intensive computation of pairwise sequence similarities needed to construct very large protein networks. In this article, we introduce label propagation on low-rank kernel approximation (LP-LOKA) for searching massively large protein networks. LP-LOKA propagates initial protein similarities in a low-rank graph by Nyström approximation without computing all pairwise similarities. With scalable parallel implementations based on distributed-memory using message-passing interface and Apache-Hadoop/Spark on cloud, LP-LOKA can search protein networks with one million proteins or more. In the experiments on Swiss-Prot/ADDA/CASP data, LP-LOKA significantly improved protein ranking over the widely used HMM-HMM or profile-sequence alignment methods utilizing large protein networks. It was observed that the larger the protein similarity network, the better the performance, especially on relatively small protein superfamilies and folds. The results suggest that computing massively large protein network is necessary to meet the growing need of annotating proteins from newly sequenced species and LP-LOKA is both scalable and accurate for searching massively large protein networks. 相似文献

11.

Key Message Approach to Optimize Communication of Parallel Applications on Clusters

Ming Zhu Wentong Cai Bu-Sung Lee 《Cluster computing》2003,6(3):253-265

Over the past few years, cluster/distributed computing has been gaining popularity. The proliferation of the cluster/distributed computing is due to the improved performance and increased reliability of these systems. Many parallel programming languages and related parallel programming models have become widely accepted. However, one of the major shortcomings of running parallel applications on cluster/distributed computing environments is the high communication overhead incurred. To reduce the communication overhead, and thus the completion time of a parallel application, this paper describes a simple, efficient and portable Key Message (KM) approach to support parallel computing on cluster/distributed computing environments. To demonstrate the advantage of the KM approach, a prototype runtime system has been implemented and evaluated. Our preliminary experimental results show that the KM approach has better improvement on communication of a parallel application when network background load increases or the computation to communication ratio of the application decreases. 相似文献

12.

Adaptive resource utilization and remote access capabilities in high‐performance distributed systems: The Open HPC++ approach

Shridhar Diwan Dennis Gannon 《Cluster computing》2000,3(1):1-14

Development of high-performance distributed applications, called metaapplications, is extremely challenging because of their complex runtime environment coupled with their requirements of high-performance and Quality of Service (QoS). Such applications typically run on a set of heterogeneous machines with dynamically varying loads, connected by heterogeneous networks possibly supporting a wide variety of communication protocols. In spite of the size and complexity of such applications, they must provide the high-performance and QoS mandated by their users. In order to achieve the goal of high-performance, they need to adaptively utilize their computational and communication resources. Apart from the requirements of adaptive resource utilization, such applications have a third kind of requirement related to remote access QoS. Different clients, although accessing a single server resource, may have differing QoS requirements from their remote connections. A single server resource may also need to provide different QoS for different clients, depending on various issues such as the amount of trust between the server and a given client. These QoS requirements can be encapsulated under the abstraction of remote access capabilities. Metaapplications need to address all the above three requirements in order to achieve the goal of high-performance and satisfy user expectations of QoS. This paper presents Open HPC++, a programming environment for high-performance applications running in a complex and heterogeneous run-time environment. Open HPC++ provides application level tools and mechanisms to satisfy application requirements of adaptive resource utilization and remote access capabilities. Open HPC++ is designed on the lines of CORBA and uses an Object Request Broker (ORB) to support seamless communication between distributed application components. In order to provide adaptive utilization of communication resources, it uses the principle of open implementation to open up the communication mechanisms of its ORB. By virtue of its open architecture, the ORB supports multiple, possibly custom, communication protocols, along with automatic and user controlled protocol selection at run-time. An extension of the same mechanism is used to support the concept of remote access capabilities. In order to support adaptive utilization of computational resources, Open HPC++ also provides a flexible yet powerful set of load-balancing mechanisms that can be used to implement custom load-balancing strategies. The paper also presents performance evaluations of Open HPC++ adaptivity and load-balancing mechanisms. This revised version was published online in July 2006 with corrections to the Cover Date. 相似文献

13.

Scheduling parallel applications in distributed networks

Jon B. Weissman Xin Zhao 《Cluster computing》1998,1(1):109-118

Prophet is a run-time scheduling system designed to support the efficient execution of parallel applications written in the Mentat programming language (Grimshaw, 1993). Prior results demonstrated that SPMD applications could be scheduled automatically in an Ethernet-based local-area workstation network with good performance (Weissman and Grimshaw, 1994 and 1995). This paper describes our recent efforts to extend Prophet along several dimensions: improved overhead control, greater resource sharing, greater resource heterogeneity, wide-area scheduling, and new application types. We show that both SPMD and task parallel applications can be scheduled effectively in a shared heterogeneous LAN environment containing ethernet and ATM networks by exploiting the application structure and dynamic run-time information. This revised version was published online in July 2006 with corrections to the Cover Date. 相似文献

14.

Parallel FFT on ATM‐based networks of workstations

Suresh Chalasani Parameswaran Ramanathan 《Cluster computing》1998,1(1):13-26

We consider parallel computing on a network of workstations using a connection-oriented protocol (e.g., Asynchronous Transfer Mode) for data communication. In a connection-oriented protocol, a virtual circuit of guaranteed bandwidth is established for each pair of communicating workstations. Since all virtual circuits do not have the same guaranteed bandwidth, a parallel application must deal with the unequal bandwidths between workstations. Since most works in the design of parallel algorithms assume equal bandwidths on all the communication links, they often do not perform well when executed on networks of workstations using connection-oriented protocols. In this paper, we first evaluate the performance degradation caused by unequal bandwidths on the execution of conventional parallel algorithms such as the fast Fourier transform and bitonic sort. We then present a strategy based on dynamic redistribution of data points to reduce the bottlenecks caused by unequal bandwidths. We also extend this strategy to deal with processor heterogeneity. Using analysis and simulation we show that there is a considerable reduction in the runtime if the proposed redistribution strategy is adopted. The basic idea presented in this paper can also be used to improve the runtimes of other parallel applications in connection-oriented environments. This revised version was published online in July 2006 with corrections to the Cover Date. 相似文献

15.

A Framework for Adaptive Cluster Computing Using JavaSpaces

Jyoti Batheja Manish Parashar 《Cluster computing》2003,6(3):201-213

Heterogeneous networked clusters are being increasingly used as platforms for resource-intensive parallel and distributed applications. The fundamental underlying idea is to provide large amounts of processing capacity over extended periods of time by harnessing the idle and available resources on the network in an opportunistic manner. In this paper we present the design, implementation and evaluation of a framework that uses JavaSpaces to support this type of opportunistic adaptive parallel/distributed computing over networked clusters in a non-intrusive manner. The framework targets applications exhibiting coarse grained parallelism and has three key features: (1) portability across heterogeneous platforms, (2) minimal configuration overheads for participating nodes, and (3) automated system state monitoring (using SNMP) to ensure non-intrusive behavior. Experimental results presented in this paper demonstrate that for applications that can be broken into coarse-grained, relatively independent tasks, the opportunistic adaptive parallel computing framework can provide performance gains. Furthermore, the results indicate that monitoring and reacting to the current system state minimizes the intrusiveness of the framework. 相似文献

16.

Cluster optimization algorithm based on CPU and GPU hybrid architecture

Yin Fei Shi Feng 《Cluster computing》2022,25(4):2601-2611

With the rapid development of network technology and parallel computing, clusters formed by connecting a large number of PCs with high-speed networks have gradually replaced the status of supercomputers in scientific research and production and high-performance computing with cost-effective advantages. The research purpose of this paper is to integrate the Kriging proxy model method and energy efficiency modeling method into a cluster optimization algorithm of CPU and GPU hybrid architecture. This paper proposes a parallel computing model for large-scale CPU/GPU heterogeneous high-performance computing systems, which can effectively describe the computing capabilities and various communication behaviors of CPU/GPU heterogeneous systems, and finally provide algorithm optimization for CPU/GPU heterogeneous clusters. According to the GPU architecture, an efficient method of constructing a Kriging proxy model and an optimized search algorithm are designed. The experimental results in this paper show that the construction of the Kriging proxy model can obtain a 220 times speedup ratio, and the search algorithm can reach an 8 times speedup ratio. It can be seen that this heterogeneous cluster optimization algorithm has high feasibility.

相似文献

17.

Finding a suitable system scale to optimize program performance on software DSM systems

Yi-Chang Zhuang Jyh-Biau Chang Tyng-Yeu Liang Ce-Kuen Shieh Laurence Tianruo Yang 《Cluster computing》2006,9(3):223-236

Recently, software distributed shared memory systems have successfully provided an easy user interface to parallel user applications on distributed systems. In order to prompt program performance, most of DSM systems usually were greedy to utilize all of available processors in a computer network to execute user programs. However, using more processors to execute programs cannot necessarily guarantee to obtain better program performance. The overhead of paralleling programs is increased by the addition in the number of processors used for program execution. If the performance gain from program parallel cannot compensate for the overhead, increasing the number of execution processors will result in performance degradation and resource waste. In this paper, we proposed a mechanism to dynamically find a suitable system scale to optimize performance for DSM applications according to run-time information. The experimental results show that the proposed mechanism can precisely predict the processor number that will result in the best performance and then effectively optimize the performance of the test applications by adapting system scale according to the predicted result. Yi-Chang Zhuang received his B.S., M.S. and Ph.D. degrees in electrical engineering from National Cheng Kung University in 1995, 1997, and 2004. He is currently working as an engineer at Industrial Technology Research Institute in Taiwan. His research interests include object-based storage, file systems, distributed systems, and grid computing. Jyh-Biau Chang is currently an assistant professor at the Information Management Department of Leader University in Taiwan. He received his B.S., M.S. and Ph.D. degrees from Electrical Engineering Department of National Cheng Kung University in 1994, 1996, and 2005. His research interest is focused on cluster and grid computing, parallel and distributed system, and operating system. Tyng-Yeu Liang is currently an assistant professor who teaches and studies at Department of Electrical Engineering, National Kaohsiung University of Applied Sciences in Taiwan. He received his B.S., M.S. and Ph.D. degrees from National Cheng Kung University in 1992, 1994, and 2000. His study is interested in cluster and grid computing, image processing and multimedia. Ce-Kuen Shieh currently is a professor at the Electrical Engineering Department of National Cheng Kung University in Taiwan. He is also the chief of computation center at National Cheng Kung University. He received his Ph.D. degree from the Department of Electrical Engineering of National Cheng Kung University in 1988. He was the chairman of the Electrical Engineering Department of National Cheng Kung University from 2002 to 2005. His research interest is focused on computer network, and parallel and distributed system. Laurence T. Yang is a professor at the Department of Computer Science, St. Francis Xavier University, Canada. His research includes high performance computing and networking, embedded systems, ubiquitous/pervasive computing and intelligence, and autonomic and trusted computing. 相似文献

18.

The design and evaluation of a virtual distributed computing environment

Haluk Topcuoglu Salim Hariri Dongmin Kim Yoonhee Kim Xue Bing Baoqing Ye Ilkyeun Ra Jon Valente 《Cluster computing》1998,1(1):81-93

Current advances in high-speed networks such as ATM and fiber-optics, and software technologies such as the JAVA programming language and WWW tools, have made network-based computing a cost-effective, high-performance distributed computing environment. Metacomputing, a special subset of network-based computing, is a well-integrated execution environment derived by combining diverse and distributed resources such as MPPs, workstations, mass storage, and databases that show a heterogeneous nature in terms of hardware, software, and organization. In this paper we present the Virtual Distributed Computing Environment (VDCE), a metacomputing environment currently being developed at Syracuse University. VDCE provides an efficient web-based approach for developing, evaluating, and visualizing large-scale distributed applications that are based on predefined task libraries on diverse platforms. The VDCE task libraries relieve end-users of tedious task implementations and also support reusability. The VDCE software architecture is described in terms of three modules: (a) the Application Editor, a user-friendly application development environment that generates the Application Flow Graph (AFG) of an application; (b) the Application Scheduler, which provides an efficient task-to-resource mapping of AFG; and (c) the VDCE Runtime System, which is responsible for running and managing application execution and for monitoring the VDCE resources. We present experimental results of an application execution on the VDCE prototype for evaluating the performance of different machine and network configurations. We also show how the VDCE can be used as a problem-solving environment on which large-scale, network-centric applications can be developed by a novice programmer rather than by an expert in low-level details of parallel programming languages. This revised version was published online in July 2006 with corrections to the Cover Date. 相似文献

19.

Developing Subdomain Allocation Algorithms Based on Spatial and Communicational Constraints to Accelerate Dust Storm Simulation 总被引：1，自引：0，他引：1

Zhipeng Gui Manzhu Yu Chaowei Yang Yunfeng Jiang Songqing Chen Jizhe Xia Qunying Huang Kai Liu Zhenlong Li Mohammed Anowarul Hassan Baoxuan Jin 《PloS one》2016,11(4)

Dust storm has serious disastrous impacts on environment, human health, and assets. The developments and applications of dust storm models have contributed significantly to better understand and predict the distribution, intensity and structure of dust storms. However, dust storm simulation is a data and computing intensive process. To improve the computing performance, high performance computing has been widely adopted by dividing the entire study area into multiple subdomains and allocating each subdomain on different computing nodes in a parallel fashion. Inappropriate allocation may introduce imbalanced task loads and unnecessary communications among computing nodes. Therefore, allocation is a key factor that may impact the efficiency of parallel process. An allocation algorithm is expected to consider the computing cost and communication cost for each computing node to minimize total execution time and reduce overall communication cost for the entire simulation. This research introduces three algorithms to optimize the allocation by considering the spatial and communicational constraints: 1) an Integer Linear Programming (ILP) based algorithm from combinational optimization perspective; 2) a K-Means and Kernighan-Lin combined heuristic algorithm (K&K) integrating geometric and coordinate-free methods by merging local and global partitioning; 3) an automatic seeded region growing based geometric and local partitioning algorithm (ASRG). The performance and effectiveness of the three algorithms are compared based on different factors. Further, we adopt the K&K algorithm as the demonstrated algorithm for the experiment of dust model simulation with the non-hydrostatic mesoscale model (NMM-dust) and compared the performance with the MPI default sequential allocation. The results demonstrate that K&K method significantly improves the simulation performance with better subdomain allocation. This method can also be adopted for other relevant atmospheric and numerical modeling. 相似文献

20.

Parallel application-level behavioral attributes for performance and energy management of high-performance computing systems 总被引：1，自引：0，他引：1

Jeffrey J. Evans Charles E. Lucas 《Cluster computing》2013,16(1):91-115

Run time variability of parallel applications continues to present significant challenges to their performance and energy efficiency in high-performance computing (HPC) systems. When run times are extended and unpredictable, application developers perceive this as a degradation of system (or subsystem) performance. Extended run times directly contribute to proportionally higher energy consumption, potentially negating efforts by applications, or the HPC system, to optimize energy consumption using low-level control techniques, such as dynamic voltage and frequency scaling (DVFS). Therefore, successful systemic management of application run time performance can result in less wasted energy, or even energy savings. We have been studying run time variability in terms of communication time, from the perspective of the application, focusing on the interconnection network. More recently, our focus has shifted to developing a more complete understanding of the effects of HPC subsystem interactions on parallel applications. In this context, the set of executing applications on the HPC system is treated as a subsystem, along with more traditional subsystems like the communication subsystem, storage subsystem, etc. To gain insight into the run time variability problem, our earlier work developed a framework to emulate parallel applications (PACE) that stresses the communication subsystem. Evaluation of run time sensitivity to network performance of real applications is performed with a tool called PARSE, which uses PACE. In this paper, we propose a model defining application-level behavioral attributes, that collectively describes how applications behave in terms of their run time performance, as functions of their process distribution on the system (spacial locality), and subsystem interactions (communication subsystem degradation). These subsystem interactions are produced when multiple applications execute concurrently on the same HPC system. We also revisit our evaluation framework and tools to demonstrate the flexibility of our application characterization techniques, and the ease with which attributes can be quantified. The validity of the model is demonstrated using our tools with several parallel benchmarks and application fragments. Results suggest that it is possible to articulate application-level behavioral attributes as a tuple of numeric values that describe course-grained performance behavior. 相似文献