首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 234 毫秒
1.
As the number of cores per node keeps increasing, it becomes increasingly important for MPI to leverage shared memory for intranode communication. This paper investigates the design and optimization of MPI collectives for clusters of NUMA nodes. We develop performance models for collective communication using shared memory and we demonstrate several algorithms for various collectives. Experiments are conducted on both Xeon X5650 and Opteron 6100 InfiniBand clusters. The measurements agree with the model and indicate that different algorithms dominate for short vectors and long vectors. We compare our shared-memory allreduce with several MPI implementations—Open MPI, MPICH2, and MVAPICH2—that utilize system shared memory to facilitate interprocess communication. On a 16-node Xeon cluster and 8-node Opteron cluster, our implementation achieves on geometric average 2.3X and 2.1X speedup over the best MPI implementation, respectively. Our techniques enable an efficient implementation of collective operations on future multi- and manycore systems.  相似文献   

2.
Software Distributed Shared Memory (DSM) systems can be used to provide a coherent shared address space on multicomputers and other parallel systems without support for shared memory in hardware. The coherency software automatically translates shared memory accesses to explicit messages exchanged among the nodes in the system. Many applications exhibit a good performance on such systems but it has been shown that, for some applications, performance critical messages can be delayed behind less important messages because of the enqueuing behavior in the communication libraries used in current systems. We present in this paper a new portable communication library that supports priorities to remedy this situation. We describe an implementation of the communication library and a quantitative model that is used to estimate the performance impact of priorities for a typical situation. Using the model, we show that the use of high-priority communication reduces the latency of performance critical messages substantially over a wide range of network design parameters. The latency is reduced with up to 10–25% for each delaying low priority message in the queue ahead.  相似文献   

3.
4.
Clusters of Symmetrical Multiprocessors (SMPs) have recently become the norm for high-performance economical computing solutions. Multiple nodes in a cluster can be used for parallel programming using a message passing library. An alternate approach is to use a software Distributed Shared Memory (DSM) to provide a view of shared memory to the application programmer. This paper describes Strings, a high performance distributed shared memory system designed for such SMP clusters. The distinguishing feature of this system is the use of a fully multi-threaded runtime system, using kernel level threads. Strings allows multiple application threads to be run on each node in a cluster. Since most modern UNIX systems can multiplex these threads on kernel level light weight processes, applications written using Strings can exploit multiple processors on a SMP machine. This paper describes some of the architectural details of the system and illustrates the performance improvements with benchmark programs from the SPLASH-2 suite, some computational kernels as well as a full fledged application. It is found that using multiple processes on SMP nodes provides good speedups only for a few of the programs. Multiple application threads can improve the performance in some cases, but other programs show a slowdown. If kernel threads are used additionally, the overall performance improves significantly in all programs tested. Other design decisions also have a beneficial impact, though to a lesser degree. This revised version was published online in July 2006 with corrections to the Cover Date.  相似文献   

5.
As local-area workstation networks are widely available, the idea of offering a software distributed shared memory (SDSM) system across interconnects of clusters is quite an attractive alternative for compute-intensive applications. However, the higher cost of sending a message over an inter-cluster link compared to an intra-cluster one can limit applications' performance on a multi-cluster SDSM system. In this paper, we present the extensions that we have added to the SDSM TreadMarks, which provides the lazy release consistency (LRC) memory model, in order to adapt it to a loosely-coupled cluster-based platform. We have implemented a logical per-cluster cache that exploits cluster locality. By accessing the cache of its cluster, a processor can share data previously requested by a second processor of its cluster, thereby, minimizing, the cost of inter-cluster communication.  相似文献   

6.
Systems biology aims at creating mathematical models, i.e., computational reconstructions of biological systems and processes that will result in a new level of understanding—the elucidation of the basic and presumably conserved “design” and “engineering” principles of biomolecular systems. Thus, systems biology will move biology from a phenomenological to a predictive science. Mathematical modeling of biological networks and processes has already greatly improved our understanding of many cellular processes. However, given the massive amount of qualitative and quantitative data currently produced and number of burning questions in health care and biotechnology needed to be solved is still in its early phases. The field requires novel approaches for abstraction, for modeling bioprocesses that follow different biochemical and biophysical rules, and for combining different modules into larger models that still allow realistic simulation with the computational power available today. We have identified and discussed currently most prominent problems in systems biology: (1) how to bridge different scales of modeling abstraction, (2) how to bridge the gap between topological and mechanistic modeling, and (3) how to bridge the wet and dry laboratory gap. The future success of systems biology largely depends on bridging the recognized gaps.  相似文献   

7.
Recent neuropsychological research has begun to reveal that neurons encode information in the timing of spikes. Spiking neural network simulations are a flexible and powerful method for investigating the behaviour of neuronal systems. Simulation of the spiking neural networks in software is unable to rapidly generate output spikes in large-scale of neural network. An alternative approach, hardware implementation of such system, provides the possibility to generate independent spikes precisely and simultaneously output spike waves in real time, under the premise that spiking neural network can take full advantage of hardware inherent parallelism. We introduce a configurable FPGA-oriented hardware platform for spiking neural network simulation in this work. We aim to use this platform to combine the speed of dedicated hardware with the programmability of software so that it might allow neuroscientists to put together sophisticated computation experiments of their own model. A feed-forward hierarchy network is developed as a case study to describe the operation of biological neural systems (such as orientation selectivity of visual cortex) and computational models of such systems. This model demonstrates how a feed-forward neural network constructs the circuitry required for orientation selectivity and provides platform for reaching a deeper understanding of the primate visual system. In the future, larger scale models based on this framework can be used to replicate the actual architecture in visual cortex, leading to more detailed predictions and insights into visual perception phenomenon.  相似文献   

8.
Advancing the size and complexity of neural network models leads to an ever increasing demand for computational resources for their simulation. Neuromorphic devices offer a number of advantages over conventional computing architectures, such as high emulation speed or low power consumption, but this usually comes at the price of reduced configurability and precision. In this article, we investigate the consequences of several such factors that are common to neuromorphic devices, more specifically limited hardware resources, limited parameter configurability and parameter variations due to fixed-pattern noise and trial-to-trial variability. Our final aim is to provide an array of methods for coping with such inevitable distortion mechanisms. As a platform for testing our proposed strategies, we use an executable system specification (ESS) of the BrainScaleS neuromorphic system, which has been designed as a universal emulation back-end for neuroscientific modeling. We address the most essential limitations of this device in detail and study their effects on three prototypical benchmark network models within a well-defined, systematic workflow. For each network model, we start by defining quantifiable functionality measures by which we then assess the effects of typical hardware-specific distortion mechanisms, both in idealized software simulations and on the ESS. For those effects that cause unacceptable deviations from the original network dynamics, we suggest generic compensation mechanisms and demonstrate their effectiveness. Both the suggested workflow and the investigated compensation mechanisms are largely back-end independent and do not require additional hardware configurability beyond the one required to emulate the benchmark networks in the first place. We hereby provide a generic methodological environment for configurable neuromorphic devices that are targeted at emulating large-scale, functional neural networks.  相似文献   

9.
This paper introduces Madeleine II, a new adaptive and portable multi-protocol implementation of the Madeleine communication library. Madeleine II has the ability to control multiple network interfaces (BIP, SISCI, VIA) and multiple network adapters (Ethernet, Myrinet, SCI) within the same application session. We report on performance measurements obtained using BIP/Myrinet and SISCI/SCI and we present preliminary results about our MPICH/Madeleine II and Nexus/Madeleine II ports. We also discuss an extension of Madeleine II for clusters of clusters which is able to handle heterogeneous networks. In particular, we present the fast internal data-forwarding mechanism that is used on gateway nodes to speed up inter-cluster transmissions. Preliminary experiments show that the resulting inter-cluster bandwidth is close to the one delivered by the hardware.  相似文献   

10.
In many types of network, the relationship between structure and function is of great significance. We are particularly interested in community structures, which arise in a wide variety of domains. We apply a simple oscillator model to networks with community structures and show that waves of regular oscillation are caused by synchronised clusters of nodes. Moreover, we show that such global oscillations may arise as a direct result of network topology. We also observe that additional modes of oscillation (as detected through frequency analysis) occur in networks with additional levels of topological hierarchy and that such modes may be directly related to network structure. We apply the method in two specific domains (metabolic networks and metropolitan transport) demonstrating the robustness of our results when applied to real world systems. We conclude that (where the distribution of oscillator frequencies and the interactions between them are known to be unimodal) our observations may be applicable to the detection of underlying community structure in networks, shedding further light on the general relationship between structure and function in complex systems.  相似文献   

11.
Mainstream computing equipment and the advent of affordable multi-Gigabit communication technology permit us to address data acquisition and processing problems with clusters of COTS machinery. Such networks typically contain heterogeneous platforms, real-time partitions and even custom devices. Vital overall system requirements are high efficiency and flexibility. In preceding projects we experienced the difficulties to meet both requirements at once. Intelligent I/O (I2O) is an industry specification that defines a uniform messaging format and execution environment for hardware and operating system independent device drivers in systems with processor based communication equipment. Mapping this concept to a distributed computing environment and encapsulating the details of the specification into an application-programming framework allow us to provide architectural support for (i) efficient and (ii) extensible cluster operation. This paper portrays our view of applying I2O to high-performance clusters. We demonstrate the feasibility of this approach and report on the efficiency of our XDAQ software framework for distributed data acquisition systems.  相似文献   

12.
《Genomics》2020,112(6):4288-4296
We posit the likely architecture of complex diseases is that subgroups of patients share variants in genes in specific networks which are sufficient to give rise to a shared phenotype. We developed Proteinarium, a multi-sample protein-protein interaction (PPI) tool, to identify clusters of patients with shared gene networks. Proteinarium converts user defined seed genes to protein symbols and maps them onto the STRING interactome. A PPI network is built for each sample using Dijkstra's algorithm. Pairwise similarity scores are calculated to compare the networks and cluster the samples. A layered graph of PPI networks for the samples in any cluster can be visualized. To test this newly developed analysis pipeline, we reanalyzed publicly available data sets, from which modest outcomes had previously been achieved. We found significant clusters of patients with unique genes which enhanced the findings in the original study.  相似文献   

13.
Wu K  Taki Y  Sato K  Sassa Y  Inoue K  Goto R  Okada K  Kawashima R  He Y  Evans AC  Fukuda H 《PloS one》2011,6(5):e19608
Community structure is a universal and significant feature of many complex networks in biology, society, and economics. Community structure has also been revealed in human brain structural and functional networks in previous studies. However, communities overlap and share many edges and nodes. Uncovering the overlapping community structure of complex networks remains largely unknown in human brain networks. Here, using regional gray matter volume, we investigated the structural brain network among 90 brain regions (according to a predefined anatomical atlas) in 462 young, healthy individuals. Overlapped nodes between communities were defined by assuming that nodes (brain regions) can belong to more than one community. We demonstrated that 90 brain regions were organized into 5 overlapping communities associated with several well-known brain systems, such as the auditory/language, visuospatial, emotion, decision-making, social, control of action, memory/learning, and visual systems. The overlapped nodes were mostly involved in an inferior-posterior pattern and were primarily related to auditory and visual perception. The overlapped nodes were mainly attributed to brain regions with higher node degrees and nodal efficiency and played a pivotal role in the flow of information through the structural brain network. Our results revealed fuzzy boundaries between communities by identifying overlapped nodes and provided new insights into the understanding of the relationship between the structure and function of the human brain. This study provides the first report of the overlapping community structure of the structural network of the human brain.  相似文献   

14.
Networks are employed to represent many nonlinear complex systems in the real world. The topological aspects and relationships between the structure and function of biological networks have been widely studied in the past few decades. However dynamic and control features of complex networks have not been widely researched, in comparison to topological network features. In this study, we explore the relationship between network controllability, topological parameters, and network medicine (metabolic drug targets). Considering the assumption that targets of approved anticancer metabolic drugs are driver nodes (which control cancer metabolic networks), we have applied topological analysis to genome-scale metabolic models of 15 normal and corresponding cancer cell types. The results show that besides primary network parameters, more complex network metrics such as motifs and clusters may also be appropriate for controlling the systems providing the controllability relationship between topological parameters and drug targets. Consequently, this study reveals the possibilities of following a set of driver nodes in network clusters instead of considering them individually according to their centralities. This outcome suggests considering distributed control systems instead of nodal control for cancer metabolic networks, leading to a new strategy in the field of network medicine.  相似文献   

15.
A collection of virtual machines (VMs) interconnected with an overlay network with a layer 2 abstraction has proven to be a powerful, unifying abstraction for adaptive distributed and parallel computing on loosely-coupled environments. It is now feasible to allow VMs hosting high performance computing (HPC) applications to seamlessly bridge distributed cloud resources and tightly-coupled supercomputing and cluster resources. However, to achieve the application performance that the tightly-coupled resources are capable of, it is important that the overlay network not introduce significant overhead relative to the native hardware, which is not the case for current user-level tools, including our own existing VNET/U system. In response, we describe the design, implementation, and evaluation of a virtual networking system that has negligible latency and bandwidth overheads in 1–10 Gbps networks. Our system, VNET/P, is directly embedded into our publicly available Palacios virtual machine monitor (VMM). VNET/P achieves native performance on 1 Gbps Ethernet networks and very high performance on 10 Gbps Ethernet networks. The NAS benchmarks generally achieve over 95 % of their native performance on both 1 and 10 Gbps. We have further demonstrated that VNET/P can operate successfully over more specialized tightly-coupled networks, such as Infiniband and Cray Gemini. Our results suggest it is feasible to extend a software-based overlay network designed for computing at wide-area scales into tightly-coupled environments.  相似文献   

16.
It is extremely important to minimize network access time in constructing a high-performance PC cluster system. For an SCI-based PC cluster, it is possible to reduce the network access time by maintaining network cache in each cluster node. This paper presents a Network-Cache-Coherent-NUMA (NCC-NUMA) card that utilizes network cache for SCI-based PC clustering. The NCC-NUMA card is directly plugged into the PCI slot of each node, and contains shared memory, network cache, and interconnection modules. The network cache is maintained for the shared memory on the PCI bus of cluster nodes. The coherency mechanism between the network cache and the shared memory is based on the IEEE SCI standard. Both a simulator and an NCC-NUMA prototype card are developed to evaluate the performance of the system. According to the experiments, the cluster system with the NCC-NUMA card showed considerable improvements compared with an SCI-based cluster without network cache.  相似文献   

17.
In this paper, we report on our “Iridis-Pi” cluster, which consists of 64 Raspberry Pi Model B nodes each equipped with a 700 MHz ARM processor, 256 Mbit of RAM and a 16 GiB SD card for local storage. The cluster has a number of advantages which are not shared with conventional data-centre based cluster, including its low total power consumption, easy portability due to its small size and weight, affordability, and passive, ambient cooling. We propose that these attributes make Iridis-Pi ideally suited to educational applications, where it provides a low-cost starting point to inspire and enable students to understand and apply high-performance computing and data handling to tackle complex engineering and scientific challenges. We present the results of benchmarking both the computational power and network performance of the “Iridis-Pi.” We also argue that such systems should be considered in some additional specialist application areas where these unique attributes may prove advantageous. We believe that the choice of an ARM CPU foreshadows a trend towards the increasing adoption of low-power, non-PC-compatible architectures in high performance clusters.  相似文献   

18.
Several MPI systems for Grid environment, in which clusters are connected by wide-area networks, have been proposed. However, the algorithms of collective communication in such MPI systems assume relatively low bandwidth wide-area networks, and they are not designed for the fast wide-area networks that are becoming available. On the other hand, for cluster MPI systems, a bcast algorithm by van de Geijn, et al. and an allreduce algorithm by Rabenseifner have been proposed, which are efficient in a high bi-section bandwidth environment. We modify those algorithms so as to effectively utilize fast wide-area inter-cluster networks and to control the number of nodes which can transfer data simultaneously through wide-area networks to avoid congestion. We confirmed the effectiveness of the modified algorithms by experiments using a 10 Gbps emulated WAN environment. The environment consists of two clusters, where each cluster consists of nodes with 1 Gbps Ethernet links and a switch with a 10 Gbps upper link. The two clusters are connected through a 10 Gbps WAN emulator which can insert latency. In a 10 millisecond latency environment, when the message size is 32 MB, the proposed bcast and allreduce are 1.6 and 3.2 times faster, respectively, than the algorithms used in existing MPI systems for Grid environment.
Motohiko MatsudaEmail:
  相似文献   

19.
20.
Dynamical processes in many engineered and living systems take place on complex networks of discrete dynamical units. We present laboratory experiments with a networked chemical system of nickel electrodissolution in which synchronization patterns are recorded in systems with smooth periodic, relaxation periodic, and chaotic oscillators organized in networks composed of up to twenty dynamical units and 140 connections. The reaction system formed domains of synchronization patterns that are strongly affected by the architecture of the network. Spatially organized partial synchronization could be observed either due to densely connected network nodes or through the ‘chimera’ symmetry breaking mechanism. Relaxation periodic and chaotic oscillators formed structures by dynamical differentiation. We have identified effects of network structure on pattern selection (through permutation symmetry and coupling directness) and on formation of hierarchical and ‘fuzzy’ clusters. With chaotic oscillators we provide experimental evidence that critical coupling strengths at which transition to identical synchronization occurs can be interpreted by experiments with a pair of oscillators and analysis of the eigenvalues of the Laplacian connectivity matrix. The experiments thus provide an insight into the extent of the impact of the architecture of a network on self-organized synchronization patterns.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号