首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Reverse computation is presented here as an important future direction in addressing the challenge of fault tolerant execution on very large cluster platforms for parallel computing. As the scale of parallel jobs increases, traditional checkpointing approaches suffer scalability problems ranging from computational slowdowns to high congestion at the persistent stores for checkpoints. Reverse computation can overcome such problems and is also better suited for parallel computing on newer architectures with smaller, cheaper or energy-efficient memories and file systems. Initial evidence for the feasibility of reverse computation in large systems is presented with detailed performance data from a particle (ideal gas) simulation scaling to 65,536 processor cores and 950 accelerators (GPUs). Reverse computation is observed to deliver very large gains relative to checkpointing schemes when nodes rely on their host processors/memory to tolerate faults at their accelerators. A comparison between reverse computation and checkpointing with measurements such as cache miss ratios, TLB misses and memory usage indicates that reverse computation is hard to ignore as a future alternative to be pursued in emerging architectures.  相似文献   

2.
ClustalW-MPI: ClustalW analysis using distributed and parallel computing   总被引:7,自引:0,他引:7  
ClustalW is a tool for aligning multiple protein or nucleotide sequences. The alignment is achieved via three steps: pairwise alignment, guide-tree generation and progressive alignment. ClustalW-MPI is a distributed and parallel implementation of ClustalW. All three steps have been parallelized to reduce the execution time. The software uses a message-passing library called MPI (Message Passing Interface) and runs on distributed workstation clusters as well as on traditional parallel computers.  相似文献   

3.
Developing an efficient parallel application is not an easy task, and achieving a good performance requires a thorough understanding of the program’s behavior. Careful performance analysis and optimization are crucial. To help developers or users of these applications to analyze the program’s behavior, it is necessary to provide them with an abstraction of the application performance. In this paper, we propose a dynamic performance abstraction technique, which enables the automated discovery of causal execution paths, composed of communication and computational activities, in MPI parallel programs. This approach enables autonomous and low-overhead execution monitoring that generates performance knowledge about application behavior for the purpose of online performance diagnosis. Our performance abstraction technique reflects an application behavior and is made up of elements correlated with high-level program structures, such as loops and communication operations. Moreover, it characterizes all elements with statistical execution profiles. We have evaluated our approach on a variety of scientific parallel applications. In all scenarios, our online performance abstraction technique proved effective for low-overhead capturing of the program’s behavior and facilitated performance understanding.  相似文献   

4.
In this paper, we present a fault tolerant and recovery system called FRASystem (Fault Tolerant & Recovery Agent System) using multi-agent in distributed computing systems. Previous rollback-recovery protocols were dependent on an inherent communication and an underlying operating system, which caused a decline of computing performance. We propose a rollback-recovery protocol that works independently on an operating system and leads to an increasing portability and extensibility. We define four types of agents: (1) a recovery agent performs a rollback-recovery protocol after a failure, (2) an information agent constructs domain knowledge as a rule of fault tolerance and information during a failure-free operation, (3) a facilitator agent controls the communication between agents, (4) a garbage collection agent performs garbage collection of the useless fault tolerance information. Since agent failures may lead to inconsistent states of a system and a domino effect, we propose an agent recovery algorithm. A garbage collection protocol addresses the performance degradation caused by the increment of saved fault tolerance information in a stable storage. We implemented a prototype of FRASystem using Java and CORBA and experimented the proposed rollback-recovery protocol. The simulations results indicate that the performance of our protocol is better than previous rollback-recovery protocols which use independent checkpointing and pessimistic message logging without using agents. Our contributions are as follows: (1) this is the first rollback-recovery protocol using agents, (2) FRASystem is not dependent on an operating system, and (3) FRASystem provides a portability and extensibility.  相似文献   

5.
Effective overlap of computation and communication is a well understood technique for latency hiding and can yield significant performance gains for applications on high-end computers. In this paper, we propose an instrumentation framework for message-passing systems to characterize the degree of overlap of communication with computation in the execution of parallel applications. The inability to obtain precise time-stamps for pertinent communication events is a significant problem, and is addressed by generation of minimum and maximum bounds on achieved overlap. The overlap measures can aid application developers and system designers in investigating scalability issues. The approach has been used to instrument two MPI implementations as well as the ARMCI system. The implementation resides entirely within the communication library and thus integrates well with existing approaches that operate outside the library. The utility of the framework is demonstrated by analyzing communication-computation overlap for micro-benchmarks and the NAS benchmarks, and the insights obtained are used to modify the NAS SP benchmark, resulting in improved overlap.
Vinod TipparajuEmail:
  相似文献   

6.
Tadjfar M  Himeno R 《Biorheology》2002,39(3-4):379-384
A parallel, time-accurate flow solver is devised to study the human cardio-vascular system. The solver is capable of dealing with moving boundaries and moving grids. It is designed to handle complex, three-dimensional vascular systems. The computational domain is divided into multiple block subdomains. At each cross section the plane is divided into twelve sub-zones to allow flexibility for handling complex geometries and, if needed, appropriate parallel data partitioning. The unsteady, three-dimensional, incompressible Navier-Stokes equations are solved numerically. A second-order in time and third-order upwind finite volume method for solving time-accurate incompressible flows based on pseudo-compressibility and dual time-stepping technique is used. For parallel execution, the flow domain is partitioned. Communication between the subdomains of the flow on Riken's VPP/700E supercomputer is implemented using MPI message-passing library. A series of numerical simulations of biologically relevant flows is used to validate this code.  相似文献   

7.
This paper presents an architecture, implementation, and performance evaluation of an adaptive message-passing system for a heterogeneous wide-area ATM cluster that we call the Adaptive Communication System (ACS). ACS uses multithreading to provide efficient techniques for overlapping computation and communication in wide-area computing. By separating control and data activities, ACS eliminates unnecessary control transfers over the data path. This optimizes the data path and improves the performance. ACS supports several different flow control algorithms, error control algorithms, and multicasting algorithms. Furthermore, ACS allows programmers to select at runtime the suitable communication schemes per-connection basis to meet the requirements of a given application. ACS provides three application communication interfaces: Socket Communication Interface (SCI), ATM Communication Interface (ACI), and High Performance Interface (HPI) to support various classes of applications. The SCI is provided mainly for applications that must be portable to many different computing platforms. The ACI provides services that are compatible with ATM connection oriented services where each connection can be configured to meet the Quality of Service (QOS) requirements of that connection. This allows programmers to fully utilize the benefits of the ATM network. The HPI supports applications that demand low-latency and high-throughput communication services. In this interface, ACS uses read/write trap routines to reduce latency and data transfer time, and to avoid using traditional communication protocols. We analyze and compare the performance of ACS with those of other message-passing systems such as p4, PVM, and MPI in terms of point-to-point, multicasting, and application performance. The benchmarking results show that ACS outperforms other message-passing systems and provides flexible communication services for various classes of applications. This revised version was published online in July 2006 with corrections to the Cover Date.  相似文献   

8.
Metaheuristics are gaining increasing recognition in many research areas, computational systems biology among them. Recent advances in metaheuristics can be helpful in locating the vicinity of the global solution in reasonable computation times, with Differential Evolution (DE) being one of the most popular methods. However, for most realistic applications, DE still requires excessive computation times. With the advent of Cloud Computing effortless access to large number of distributed resources has become more feasible, and new distributed frameworks, like Spark, have been developed to deal with large scale computations on commodity clusters and cloud resources. In this paper we propose a parallel implementation of an enhanced DE using Spark. The proposal drastically reduces the execution time, by means of including a selected local search and exploiting the available distributed resources. The performance of the proposal has been thoroughly assessed using challenging parameter estimation problems from the domain of computational systems biology. Two different platforms have been used for the evaluation, a local cluster and the Microsoft Azure public cloud. Additionally, it has been also compared with other parallel approaches, another cloud-based solution (a MapReduce implementation) and a traditional HPC solution (a MPI implementation)  相似文献   

9.
MPI collective communication operations to distribute or gather data are used for many parallel applications from scientific computing, but they may lead to scalability problems since their execution times increase with the number of participating processors. In this article, we show how the execution time of collective communication operations can be improved significantly by an internal restructuring based on orthogonal processor structures with two or more levels. The execution time of operations like MPI_Bcast() or MPI_Allgather() can be reduced by 40% and 70% on a dual Xeon cluster and a Beowulf cluster with single-processor nodes. But also on a Cray T3E a significant performance improvement can be obtained by a careful selection of the processor structure. The use of these optimized communication operations can reduce the execution time of data parallel implementations of complex application programs significantly without requiring any other change of the computation and communication structure. We present runtime functions for the modeling of two-phase realizations and verify that these runtime functions can predict the execution time both for communication operations in isolation and in the context of application programs.  相似文献   

10.
This paper describes the design of a fault-tolerant classification system for medical applications. The design process follows the systems engineering methodology: in the agreement phase, we make the case for fault tolerance in diagnosis systems for biomedical applications. The argument extends the idea that machine diagnosis systems mimic the functionality of human decision-making, but in many cases they do not achieve the fault tolerance of the human brain. After making the case for fault tolerance, both requirements and specification for the fault-tolerant system are introduced before the implementation is discussed. The system is tested with fault and use cases to build up trust in the implemented system. This structured approach aided in the realisation of the fault-tolerant classification system. During the specification phase, we produced a formal model that enabled us to discuss what fault tolerance, reliability and safety mean for this particular classification system. Furthermore, such a formal basis for discussion is extremely useful during the initial stages of the design, because it helps to avoid big mistakes caused by a lack of overview later on in the project. During the implementation, we practiced component reuse by incorporating a reliable classification block, which was developed during a previous project, into the current design. Using a well-structured approach and practicing component reuse we follow best practice for both research and industry projects, which enabled us to realise the fault-tolerant classification system on time and within budget. This system can serve in a wide range of future health care systems.  相似文献   

11.
In this paper we present the design and implementation of a Pluggable Fault-Tolerant CORBA Infrastructure that provides fault tolerance for CORBA applications by utilizing the pluggable protocols framework that most CORBA ORBs provide. Our approach does not require any modification to the CORBA ORB, and requires only minimal modification to the application. Moreover, it avoids the difficulty of retrieving and assigning the ORB state by embedding the fault tolerance mechanisms into the ORB. The Pluggable Fault-Tolerant CORBA Infrastructure exhibits similar or better performance than other Fault-Tolerant CORBA systems, while providing strong replica consistency.  相似文献   

12.
Event traces are helpful in understanding the performance behavior of parallel applications since they allow the in-depth analysis of communication and synchronization patterns. However, the absence of synchronized clocks on most cluster systems may render the analysis ineffective because inaccurate relative event timings may misrepresent the logical event order and lead to errors when quantifying the impact of certain behaviors or confuse the users of time-line visualization tools by showing messages flowing backward in time. In our earlier work, we have developed a scalable algorithm called the controlled logical clock that eliminates inconsistent inter-process timings postmortem in traces of pure MPI applications, potentially running on large processor configurations. In this paper, we first demonstrate that our algorithm also proves beneficial in computational grids, where a single application is executed using the combined computational power of several geographically dispersed clusters. Second, we present an extended version of the algorithm that—in addition to message-passing event semantics—also preserves and restores shared-memory event semantics, enabling the correction of traces from hybrid applications.  相似文献   

13.
Artificial neural networks (ANNs) are powerful computational tools that are designed to replicate the human brain and adopted to solve a variety of problems in many different fields. Fault tolerance (FT), an important property of ANNs, ensures their reliability when significant portions of a network are lost. In this paper, a fault/noise injection-based (FIB) genetic algorithm (GA) is proposed to construct fault-tolerant ANNs. The FT performance of an FIB-GA was compared with that of a common genetic algorithm, the back-propagation algorithm, and the modification of weights algorithm. The FIB-GA showed a slower fitting speed when solving the exclusive OR (XOR) problem and the overlapping classification problem, but it significantly reduced the errors in cases of single or multiple faults in ANN weights or nodes. Further analysis revealed that the fit weights showed no correlation with the fitting errors in the ANNs constructed with the FIB-GA, suggesting a relatively even distribution of the various fitting parameters. In contrast, the output weights in the training of ANNs implemented with the use the other three algorithms demonstrated a positive correlation with the errors. Our findings therefore indicate that a combination of the fault/noise injection-based method and a GA is capable of introducing FT to ANNs and imply that the distributed ANNs demonstrate superior FT performance.  相似文献   

14.
Event Services in High Performance Systems   总被引:2,自引:0,他引:2  
The Internet and the Grid are changing the face of high performance computing. Rather than tightly-coupled SPMD-style components running in a single cluster, on a parallel machine, or even on the Internet programmed in MPI, applications are evolving into sets of cooperating components scattered across diverse computational elements. These components may run on different operating systems and hardware platforms and may be written by different organizations in different languages. Complete applications are constructed by assembling these components in a plug-and-play fashion. This new vision for high performance computing demands features and characteristics not easily provided by traditional high-performance communications middleware. In response to these needs, we have developed ECho, a high-performance event-delivery middleware that meets the new demands of the Grid environment. ECho provides efficient binary transmission of event data with unique features that support data-type discovery and enterprise-scale application evolution. We present measurements detailing ECho's performance to show that ECho significantly outperforms other systems intended to provide this functionality and provides throughput and latency comparable to the most efficient middleware infrastructures available.  相似文献   

15.
As the number of cores per node keeps increasing, it becomes increasingly important for MPI to leverage shared memory for intranode communication. This paper investigates the design and optimization of MPI collectives for clusters of NUMA nodes. We develop performance models for collective communication using shared memory and we demonstrate several algorithms for various collectives. Experiments are conducted on both Xeon X5650 and Opteron 6100 InfiniBand clusters. The measurements agree with the model and indicate that different algorithms dominate for short vectors and long vectors. We compare our shared-memory allreduce with several MPI implementations—Open MPI, MPICH2, and MVAPICH2—that utilize system shared memory to facilitate interprocess communication. On a 16-node Xeon cluster and 8-node Opteron cluster, our implementation achieves on geometric average 2.3X and 2.1X speedup over the best MPI implementation, respectively. Our techniques enable an efficient implementation of collective operations on future multi- and manycore systems.  相似文献   

16.
This paper describes an efficient implementation of one-sided communication on top of the GM low-level message-passing library for clusters with Myrinet. This approach is compatible with shared memory, exploits pipelining, nonblocking communication, and overlapping memory registration with memory copy to maximize the transfer rate. The paper addresses critical design issues faced on the commodity clusters and then describes possible solutions for matching the low-level network protocol with user-level interfaces. The performance implications of the design decisions are presented and discussed in context of a standalone communication benchmark as well as two applications. Finally, the paper offers some indications on what additional features would be desirable in a communication library like GM to better support one-sided communication.  相似文献   

17.
Several systems have been presented in the last years in order to manage the complexity of large microarray experiments. Although good results have been achieved, most systems tend to lack in one or more fields. A Grid based approach may provide a shared, standardized and reliable solution for storage and analysis of biological data, in order to maximize the results of experimental efforts. A Grid framework has been therefore adopted due to the necessity of remotely accessing large amounts of distributed data as well as to scale computational performances for terabyte datasets. Two different biological studies have been planned in order to highlight the benefits that can emerge from our Grid based platform. The described environment relies on storage services and computational services provided by the gLite Grid middleware. The Grid environment is also able to exploit the added value of metadata in order to let users better classify and search experiments. A state-of-art Grid portal has been implemented in order to hide the complexity of framework from end users and to make them able to easily access available services and data. The functional architecture of the portal is described. As a first test of the system performances, a gene expression analysis has been performed on a dataset of Affymetrix GeneChip Rat Expression Array RAE230A, from the ArrayExpress database. The sequence of analysis includes three steps: (i) group opening and image set uploading, (ii) normalization, and (iii) model based gene expression (based on PM/MM difference model). Two different Linux versions (sequential and parallel) of the dChip software have been developed to implement the analysis and have been tested on a cluster. From results, it emerges that the parallelization of the analysis process and the execution of parallel jobs on distributed computational resources actually improve the performances. Moreover, the Grid environment have been tested both against the possibility of uploading and accessing distributed datasets through the Grid middleware and against its ability in managing the execution of jobs on distributed computational resources. Results from the Grid test will be discussed in a further paper.  相似文献   

18.
Ciobanu G 《Bio Systems》2003,70(2):123-133
This paper presents fundamental distributed algorithms over membrane systems with antiport carriers. We describe distributed algorithms for collecting and dispersing information, leader election in these systems, and the mutual exclusion problem. Finally, we consider membrane systems producing correct results despite some failures at some of the components or the communication links. We show that membrane systems with antiport carriers provide an appropriate model for distributed computing, particularly for message-passing algorithms interpreted here as membrane transport in both directions, namely when two chemicals behave as input and output messages and pass the membranes in both directions using antiport carriers.  相似文献   

19.
We describe our experience of designing, implementing, and evaluating two generations of high performance communication libraries, Fast Messages (FM) for Myrinet. In FM 1, we designed a simple interface and provided guarantees of reliable and in-order delivery, and flow control. While this was a significant improvement over previous systems, it was not enough. Layering MPI atop FM 1 showed that only about 35% of the FM 1 bandwidth could be delivered to higher level communication APIs. Our second generation communication layer, FM 2, addresses the identified problems, providing gather-scatter, interlayer scheduling, receiver flow control, as well as some convenient API features which simplify programming. FM 2 can deliver 55–95% to higher level APIs such as MPI. This is especially impressive as the absolute bandwidths delivered have increased over fourfold to 90 MB/s. We describe general issues encountered in matching two communication layers, and our solutions as embodied in FM 2. This revised version was published online in July 2006 with corrections to the Cover Date.  相似文献   

20.
The study of desiccation tolerance in bryophytes avoids thecomplications of higher-plant vascular systems and complex leaf structures, butremains a multifaceted problem. Some of the pertinent questions have at leastpartial analogues in seed biology – events during a drying-rewettingcyclewith processes in seed maturation and germination, and the gradual loss ofviability on prolonged desiccation, and the relation of this to intensity ofdesiccation and temperature, with parallel questions in seed storage. Pastresearch on bryophyte desiccation tolerance is briefly reviewed. Evidence ispresented from chlorophyll-fluorescence measurements and experiments withmetabolic inhibitors that recovery of photosynthesis in bryophytes followingdesiccation depends mainly on rapid reactivation of pre-existing structures andinvolves only limited de novo protein synthesis. Followinginitial recovery, protein synthesis is demonstrably essential to themaintenanceof photosynthetic function in the light, but the rate of maintenance turnoverinthe dark appears to be slow. Factors leading to long-term desiccation damagearediverse; indications are that desiccation tolerant species often survive bestinthe range –100 to –200 MPa.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号