首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 468 毫秒
1.
As the number of cores per node keeps increasing, it becomes increasingly important for MPI to leverage shared memory for intranode communication. This paper investigates the design and optimization of MPI collectives for clusters of NUMA nodes. We develop performance models for collective communication using shared memory and we demonstrate several algorithms for various collectives. Experiments are conducted on both Xeon X5650 and Opteron 6100 InfiniBand clusters. The measurements agree with the model and indicate that different algorithms dominate for short vectors and long vectors. We compare our shared-memory allreduce with several MPI implementations—Open MPI, MPICH2, and MVAPICH2—that utilize system shared memory to facilitate interprocess communication. On a 16-node Xeon cluster and 8-node Opteron cluster, our implementation achieves on geometric average 2.3X and 2.1X speedup over the best MPI implementation, respectively. Our techniques enable an efficient implementation of collective operations on future multi- and manycore systems.  相似文献   

2.
The performance and scalability of communications are key for high performance computing (HPC) applications in the current multi-core era. Despite the significant benefits (e.g., productivity, portability, multithreading) of Java for parallel programming, its poor communications support has hindered its adoption in the HPC community. This paper presents FastMPJ, an efficient message-passing in Java (MPJ) library, boosting Java for HPC by: (1) providing high-performance shared memory communications using Java threads; (2) taking full advantage of high-speed cluster networks (e.g., InfiniBand) to provide low-latency and high bandwidth communications; (3) including a scalable collective library with topology aware primitives, automatically selected at runtime; (4) avoiding Java data buffering overheads through zero-copy protocols; and (5) implementing the most widely extended MPI-like Java bindings for a highly productive development. The comprehensive performance evaluation on representative testbeds (InfiniBand, 10 Gigabit Ethernet, Myrinet, and shared memory systems) has shown that FastMPJ communication primitives rival native MPI implementations, significantly improving the efficiency and scalability of Java HPC parallel applications.  相似文献   

3.
Over the last several years, many sequence alignment tools have appeared and become popular for the fast evolution of next generation sequencing technologies. Obviously, researchers that use such tools are interested in getting maximum performance when they execute them in modern infrastructures. Today’s NUMA (Non-uniform memory access) architectures present major challenges in getting such applications to achieve good scalability as more processors/cores are used. The memory system in NUMA systems shows a high complexity and may be the main cause for the loss of an application’s performance. The existence of several memory banks in NUMA systems implies a logical increase in latency associated with the accesses of a given processor to a remote bank. This phenomenon is usually attenuated by the application of strategies that tend to increase the locality of memory accesses. However, NUMA systems may also suffer from contention problems that can occur when concurrent accesses are concentrated on a reduced number of banks. Sequence alignment tools use large data structures to contain reference genomes to which all reads are aligned. Therefore, these tools are very sensitive to performance problems related to the memory system. The main goal of this study is to explore the trade-offs between data locality and data dispersion in NUMA systems. We have performed experiments with several popular sequence alignment tools on two widely available NUMA systems to assess the performance of different memory allocation policies and data partitioning strategies. We find that there is not one method that is best in all cases. However, we conclude that memory interleaving is the memory allocation strategy that provides the best performance when a large number of processors and memory banks are used. In the case of data partitioning, the best results are usually obtained when the number of partitions used is greater, sometimes combined with an interleave policy.  相似文献   

4.
A costeffective secondary storage architecture for parallel computers is to distribute storage across all processors, which then engage in either computation or I/O, depending on the demands of the moment. A difficulty associated with this architecture is that access to storage on another processor typically requires the cooperation of that processor, which can be hard to arrange if the processor is engaged in other computation. One partial solution to this problem is to require that remote I/O operations occur only via collective calls. In this paper, we describe an alternative approach based on the use of singlesided communication operations such as Active Messages. We present an implementation of this basic approach called Distant I/O and present experimental results that quantify the lowlevel performance of DIO mechanisms. This technique is exploited to support noncollective parallel shared file model for a large outofcore scientific application with very high I/O bandwidth requirements. The achieved performance exceeds by a wide margin the performance of a well equipped PIOFS parallel filesystem on the IBM SP.  相似文献   

5.
Data Grids provide environment for huge, data-intensive applications that produce and process enormous data. Such environments are thus asked to manage data and schedule jobs at the same time. These two important operations have to be tightly coupled to achieve the best results. Replication techniques are widely used to increase the availability of data, improving performance of query latency and load balancing in Data Grid. Also effective resource scheduling is a challenging research issue. In this paper we propose a job scheduling policy, called Parallel Job Scheduling (PJS), and a dynamic data replication strategy, called Threshold-based Dynamic Data Replication (TDDR), to improve the data access efficiencies in a hierarchical Data Grid. The PJS uses hierarchical scheduling to reduce the search time for an appropriate computing node. It considers network characteristics, number of jobs waiting in queue, file locations, and disk read speed of storage drive at data sources. The main idea of TDDR strategy is using a threshold value to determine if the requested replica needs to be copied to the node. The TDDR determines this threshold dynamically based on data request arrival rates and available storage capacities. Then, in order to overcome the problem of limited storage space in each node, we design an efficient replica replacement strategy, which is developed as a two stages process. First, it deletes those files with minimum time for transferring. Second, if space is still insufficient then it considers the last time the replica was requested, number of access, size of replica and file transfer time. Results from the simulation show that our proposed algorithms have better performance in comparison with other algorithms in terms of Mean Job Time, Number of Intercommunications, Number of Replications, Computing Resource Usage, and Effective Network Usage.  相似文献   

6.
MPI collective communication operations to distribute or gather data are used for many parallel applications from scientific computing, but they may lead to scalability problems since their execution times increase with the number of participating processors. In this article, we show how the execution time of collective communication operations can be improved significantly by an internal restructuring based on orthogonal processor structures with two or more levels. The execution time of operations like MPI_Bcast() or MPI_Allgather() can be reduced by 40% and 70% on a dual Xeon cluster and a Beowulf cluster with single-processor nodes. But also on a Cray T3E a significant performance improvement can be obtained by a careful selection of the processor structure. The use of these optimized communication operations can reduce the execution time of data parallel implementations of complex application programs significantly without requiring any other change of the computation and communication structure. We present runtime functions for the modeling of two-phase realizations and verify that these runtime functions can predict the execution time both for communication operations in isolation and in the context of application programs.  相似文献   

7.
Reverse computation is presented here as an important future direction in addressing the challenge of fault tolerant execution on very large cluster platforms for parallel computing. As the scale of parallel jobs increases, traditional checkpointing approaches suffer scalability problems ranging from computational slowdowns to high congestion at the persistent stores for checkpoints. Reverse computation can overcome such problems and is also better suited for parallel computing on newer architectures with smaller, cheaper or energy-efficient memories and file systems. Initial evidence for the feasibility of reverse computation in large systems is presented with detailed performance data from a particle (ideal gas) simulation scaling to 65,536 processor cores and 950 accelerators (GPUs). Reverse computation is observed to deliver very large gains relative to checkpointing schemes when nodes rely on their host processors/memory to tolerate faults at their accelerators. A comparison between reverse computation and checkpointing with measurements such as cache miss ratios, TLB misses and memory usage indicates that reverse computation is hard to ignore as a future alternative to be pursued in emerging architectures.  相似文献   

8.
Performance analysis of MPI collective operations   总被引:1,自引:0,他引:1  
Previous studies of application usage show that the performance of collective communications are critical for high-performance computing. Despite active research in the field, both general and feasible solution to the optimization of collective communication problem is still missing. In this paper, we analyze and attempt to improve intra-cluster collective communication in the context of the widely deployed MPI programming paradigm by extending accepted models of point-to-point communication, such as Hockney, LogP/LogGP, and PLogP, to collective operations. We compare the predictions from models against the experimentally gathered data and using these results, construct optimal decision function for broadcast collective. We quantitatively compare the quality of the model-based decision functions to the experimentally-optimal one. Additionally, in this work, we also introduce a new form of an optimized tree-based broadcast algorithm, splitted-binary. Our results show that all of the models can provide useful insights into various aspects of the different algorithms as well as their relative performance. Still, based on our findings, we believe that the complete reliance on models would not yield optimal results. In addition, our experimental results have identified the gap parameter as being the most critical for accurate modeling of both the classical point-to-point-based pipeline and our extensions to fan-out topologies.
Jack J. DongarraEmail:
  相似文献   

9.
Nowadays, the evolution of information technologies requires fast similarity search tools for analyzing new data types as audio, video, or images. The usual search by keys or records is not possible and to search on these databases is a compute-intensive problem. Regarding this, in the latest years, compute-intensive coprocessors (mainly NVIDIA GPUs) have been studied as a tool for accelerating sequential processing algorithms. In this work, we implement kNN and range queries on the recently launched Intel Xeon Phi coprocessor. We developed exhaustive and also indexing algorithms using the LC index. This index has been widely studied in sequential computing to accelerate similarity search on multimedia databases. We implement and compare different exhaustive and indexing versions showing some key factors in Xeon Phi to deal with this type of search. For indexing algorithms, we used a strategy based on cluster distribution among cores LC MIC Dist-C obtaining up to 168\(\times \) over the sequential exhaustive algorithm. Our algorithms using exhaustive strategies in Xeon Phi for range queries achieve up to 22\(\times \) speed-up over the sequential counterpart compared to the 12\(\times \) of a 20-core machine, and a similar advantage is achieved for kNN queries. Comparing with GPUs, we obtain higher performance on our indexing algorithms on Intel Xeon Phi. However, GPU works faster with memory-aligned access exhaustive algorithms. Our exhaustive approaches on Xeon Phi can be used on a wide class of databases, for example, non-metric spaces. Finally, we extend our algorithms to be used with large databases that do not fit in the coprocessor memory, showing a good scalability with the number of elements.  相似文献   

10.
Clusters of Symmetric Multiprocessors (SMP) are more commonplace than ever in achieving high-performance. Scientific applications running on clusters employ collective communications extensively. Shared memory communication and Remote Direct Memory Access (RDMA) over multi-rail networks are promising approaches in addressing the increasing demand on intra-node and inter-node communications, and thereby in boosting the performance of collectives in emerging multi-core SMP clusters. In this regard, this paper designs and evaluates two classes of collective communication algorithms directly at the Elan user-level over multi-rail Quadrics QsNetII with message striping: 1) RDMA-based traditional multi-port algorithms for gather, all-gather, and all-to-all collectives for medium to large messages, and 2) RDMA-based and SMP-aware multi-port all-gather algorithms for small to medium size messages. The multi-port RDMA-based Direct algorithm for gather and all-to-all collectives gain an improvement of up to 2.15 for 4 KB messages over elan_gather(), and up to 2.26 for 2 KB messages over elan_alltoall(), respectively. For the all-gather, our SMP-aware Bruck algorithm outperforms all other all-gather algorithms including elan_gather() for 512 B to 8 KB messages, with a 1.96 improvement factor for 4 KB messages. Our multi-port Direct all-gather is the best algorithm for 16 KB to 1 MB, and outperforms elan_gather() by a factor of 1.49 for 32 KB messages. Experimentation with real applications has shown up to 1.47 communication speedup can be achieved using the proposed all-gather algorithms.
Ahmad Afsahi (Corresponding author)Email:
  相似文献   

11.
Program development environments have enabled graphics processing units (GPUs) to become an attractive high performance computing platform for the scientific community. A commonly posed problem in computational biology is protein database searching for functional similarities. The most accurate algorithm for sequence alignments is Smith-Waterman (SW). However, due to its computational complexity and rapidly increasing database sizes, the process becomes more and more time consuming making cluster based systems more desirable. Therefore, scalable and highly parallel methods are necessary to make SW a viable solution for life science researchers. In this paper we evaluate how SW fits onto the target GPU architecture by exploring ways to map the program architecture on the processor architecture. We develop new techniques to reduce the memory footprint of the application while exploiting the memory hierarchy of the GPU. With this implementation, GSW, we overcome the on chip memory size constraint, achieving 23× speedup compared to a serial implementation. Results show that as the query length increases our speedup almost stays stable indicating the solid scalability of our approach. Additionally this is a first of a kind implementation which purely runs on the GPU instead of a CPU-GPU integrated environment, making our design suitable for porting onto a cluster of GPUs.  相似文献   

12.
Abstract

An algorithm is described which allows Nonequilibrium Molecular Dynamics (NEMD) simulations of a fluid undergoing planar Couette flow (shear flow) to be carried out on a distributed memory parallel processor using a (spatial) domain decomposition technique. Unlike previous algorithms, this algorithm uses a co-moving, or Lagrangian, simulation box. Also, the shape of the simulation box changes throughout the course of the simulation. The algorithm, which can be used for two or three dimensional systems, has been tested on a Fujitsu AP1000 Parallel computer with 128 processors.  相似文献   

13.
This paper presents a general methodology for the communication-efficient parallelization of graph algorithms using the divide-and-conquer approach and shows that this class of problems can be solved in cluster environments with good communication efficiency. Specifically, the first practical parallel algorithm, based on a general coarse-grained model, for finding Hamiltonian paths in tournaments is presented. On any such parallel machines, this algorithm uses only (3log p+1), where p is the number of processors, communication rounds, which is independent of the tournament size, and can reuse the existing linear-time algorithm in the sequential setting. For theoretical completeness, the algorithm is revised for fine-grained models, where the ratio of computation and communication throughputs is low or the local memory size, , of each individual processor is extremely limited for any , solving the problem with O(log p) communication rounds, while the hidden constant grows with the scalability factor 1/∊. Experiments have been carried out on a Linux cluster of 32 Sun Ultra5 computers and an SGI Origin 2000 with 32 R10000 processors. The algorithm performance on the Linux Cluster reaches 75% of the performance on the SGI Origin 2000 when the tournament size is about one million. Computational resources and technical support are provided by the Center for Computational Research (CCR) at the State University of New York at Buffalo. Chun-Hsi Huang received his Ph.D. degree in Computer Science from the State University of New York at Buffalo in 2001. His is currently an Assistant Professor of Computer Science and Engineering at the University of Connecticut. His interests include High Performance Parallel Computing, Cluster and Grid Computing, Biomedical and Health Informatics, Algorithm Design and Analysis, Experimental Algorithms and Computational Biology. Sanguthevar Rajasekaran received his Ph.D. degree in Computer Science from Harvard University in 1988. Currently he is the UTC Chair Professor of Computer Science and Engineering at the University of Connecticut and the Director of Booth Engineering Center for Advanced Technologies (BECAT). His research interests include Parallel Algorithms, Bioinformatics, Data Mining, Randomized Computing, Computer Simulations, and Combinatorial Optimization. Laurence Tianruo Yang received is Ph.D. degree in Computer Science from the Oxford University. He is currently a professor of Computer Science of the St. Francis Xavier University in Canada. His research interests include high-performance computing, embedded systems, computer archtecture and high-speed networking. Xin He received his Ph.D. degree in Computer Science from the Ohio State University in 1987. He is currently Professor of Computer Science and Engineering at the State University of New York at Buffalo. His research interests include Algorithms, Data Structures, Combinatorics and Computational Geometry.  相似文献   

14.
We investigate proactive dynamic load balancing on multicore systems, in which threads are continually migrated to reduce the impact of processor/thread mismatches. Our goal is to enhance the flexibility of the SPMD-style programming model and enable SPMD applications to run efficiently in multiprogrammed environments. We present Juggle, a practical decentralized, user-space implementation of a proactive load balancer that emphasizes portability and usability. In this paper we assume perfect intrinsic load balance and focus on extrinsic imbalances caused by OS noise, multiprogramming and mismatches of threads to hardware parallelism. Juggle shows performance improvements of up to 80 % over static load balancing for oversubscribed UPC, OpenMP, and pthreads benchmarks. We also show that Juggle is effective in unpredictable, multiprogrammed environments, with up to a 50 % performance improvement over the Linux load balancer and a 25 % reduction in performance variation. We analyze the impact of Juggle on parallel applications and derive lower bounds and approximations for thread completion times. We show that results from Juggle closely match theoretical predictions across a variety of architectures, including NUMA and hyper-threaded systems.  相似文献   

15.
File systems and databases usually make several synchronous disk write accesses in order to make sure that the disk always has a consistent view of their data, so that it can be recovered in the case of a system crash. Since synchronous disk operations are slow, some systems choose to employ asynchronous disk write operations that improve performance at the cost of low reliability: in case of a system crash all data that have not yet been written to disk are lost. In this paper we describe a softwarebased NonVolatile RAM system that achieves the high performance of asynchronous write operations without sacrificing the reliability of synchronous write operations. Our system takes a set of volatile main memories residing in independent workstations and transforms it into a nonvolatile memory buffer – much like RAIDS do with magnetic disks. It then uses this nonvolatile buffer as an intermediate storage space in order to acknowledge synchronous write operations before actually writing the data to magnetic disk, but after writing the data to (intermediate) stable storage. We demonstrate the performance advantages of our system using both simulation and experimental evaluation.  相似文献   

16.

Background

The clinical decision support system can effectively break the limitations of doctors’ knowledge and reduce the possibility of misdiagnosis to enhance health care. The traditional genetic data storage and analysis methods based on stand-alone environment are hard to meet the computational requirements with the rapid genetic data growth for the limited scalability.

Methods

In this paper, we propose a distributed gene clinical decision support system, which is named GCDSS. And a prototype is implemented based on cloud computing technology. At the same time, we present CloudBWA which is a novel distributed read mapping algorithm leveraging batch processing strategy to map reads on Apache Spark.

Results

Experiments show that the distributed gene clinical decision support system GCDSS and the distributed read mapping algorithm CloudBWA have outstanding performance and excellent scalability. Compared with state-of-the-art distributed algorithms, CloudBWA achieves up to 2.63 times speedup over SparkBWA. Compared with stand-alone algorithms, CloudBWA with 16 cores achieves up to 11.59 times speedup over BWA-MEM with 1 core.

Conclusions

GCDSS is a distributed gene clinical decision support system based on cloud computing techniques. In particular, we incorporated a distributed genetic data analysis pipeline framework in the proposed GCDSS system. To boost the data processing of GCDSS, we propose CloudBWA, which is a novel distributed read mapping algorithm to leverage batch processing technique in mapping stage using Apache Spark platform.
  相似文献   

17.
Keqin Li 《Cluster computing》2005,8(2-3):119-126
Multihop wireless networks are treated as random symmetric planar point graphs, where all the nodes have the same transmission power and radius, and vertices of a graph are drawn randomly over certain geographical region. Several basic and important topological properties of random multihop wireless networks are studied, including node degree, connectivity, diameter, bisection width, and biconnectivity. It is believed that such study has very useful implication in real applications.Keqin Li is currently a full professor of computer science in State University of New York at New Paltz. His research interests are mainly in design and analysis of algorithms, parallel and distributed computing, and computer networking, with particular interests in approximation algorithms, parallel algorithms, job scheduling, task dispatching, load balancing, performance evaluation, dynamic tree embedding, scalability analysis, parallel computing using optical interconnects, optical networks, and wireless networks. He has published over 190 journal articles, book chapters, and research papers in refereed international conference proceedings. He has also co-edited six international conference proceedings and a book entitled Parallel Computing Using Optical Interconnections published by Kluwer Academic Publishers in 1998. His current research (2001–2004) is supported by US National Science Foundation.Dr. Li has served in various capacities for numerous international conferences as program/steering/advisory committee member, workshop chair, track chair, and special session organizer. He received best paper awards in 1996 International Conference on Parallel and Distributed Processing Techniques and Applications, 1997 IEEE National Aerospace and Electronics Conference, and 2000 IEEE International Parallel and Distributed Processing Symposium. He received a recognition award from International Association of Science and Technology for Development in October 1998. He is listed in Whos Who in Science and Engineering, 7th edition, 2003–2004; Whos Who in America, 58th edition, 2004; Whos Who in the World, 20th edition, 2003. Dr. Li is a senior member of IEEE and a member of IEEE Computer Society and ACM.  相似文献   

18.

Background

Signatures are short sequences that are unique and not similar to any other sequence in a database that can be used as the basis to identify different species. Even though several signature discovery algorithms have been proposed in the past, these algorithms require the entirety of databases to be loaded in the memory, thus restricting the amount of data that they can process. It makes those algorithms unable to process databases with large amounts of data. Also, those algorithms use sequential models and have slower discovery speeds, meaning that the efficiency can be improved.

Results

In this research, we are debuting the utilization of a divide-and-conquer strategy in signature discovery and have proposed a parallel signature discovery algorithm on a computer cluster. The algorithm applies the divide-and-conquer strategy to solve the problem posed to the existing algorithms where they are unable to process large databases and uses a parallel computing mechanism to effectively improve the efficiency of signature discovery. Even when run with just the memory of regular personal computers, the algorithm can still process large databases such as the human whole-genome EST database which were previously unable to be processed by the existing algorithms.

Conclusions

The algorithm proposed in this research is not limited by the amount of usable memory and can rapidly find signatures in large databases, making it useful in applications such as Next Generation Sequencing and other large database analysis and processing. The implementation of the proposed algorithm is available athttp://www.cs.pu.edu.tw/~fang/DDCSDPrograms/DDCSD.htm.  相似文献   

19.
Asymmetric multicore processors have recently emerged as an appealing technology for severely energy-constrained environments, especially in mobile appliances where heterogeneity in applications is mainstream. In addition, given the growing interest for low-power high performance computing, this type of architectures is also being investigated as a means to improve the throughput-per-Watt of complex scientific applications on clusters of commodity systems-on-chip. In this paper, we design and embed several architecture-aware optimizations into a multi-threaded general matrix multiplication (gemm), a key operation of the BLAS, in order to obtain a high performance implementation for ARM big.LITTLE AMPs. Our solution is based on the reference implementation of gemm in the BLIS library, and integrates a cache-aware configuration as well as asymmetric-static and dynamic scheduling strategies that carefully tune and distribute the operation’s micro-kernels among the big and LITTLE cores of the target processor. The experimental results on a Samsung Exynos 5422, a system-on-chip with ARM Cortex-A15 and Cortex-A7 clusters that implements the big.LITTLE model, expose that our cache-aware versions of gemm with asymmetric scheduling attain important gains in performance with respect to its architecture-oblivious counterparts while exploiting all the resources of the AMP to deliver considerable energy efficiency.  相似文献   

20.
The survival of T cells at different stages of development is dependent on extrinsic signals. IL-7 is necessary for the development of memory T cells. IL-7 could induce and maintain the differentiation, survival, and proliferation of CD4+ memory T cells, and the roles of IL-2 and IL-15 in the generation of CD4+ memory T cells were still unclear. A CD4+ memory T cells in vitro generated system by adding IL-7. The phenotype of CD4+ memory T cells was identified by FACS. The cells proliferation was analyzed by CFSE staining. The involved signal pathways were analyzed by Western blot. We found that IL-2, not IL-15, could inhibit CD4+ memory T cells generation. Western blot showed that IL-7 up-regulated the P-STAT5A expression and down-regulated Bax expression, IL-2 reduced the effect of IL-7. Besides, IL-2-combined IL-7 up-regulated the P-AKT and Foxo3a expression a little. In conclusion, our data revealed the inhibitory role of IL-2 in CD4+ memory T cells generation and indicated that PI3K/AKT signal pathway was involved.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号