首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 593 毫秒
1.
We present the design, implementation, and evaluation of single assignment data structures and of a software controlled cache in an existing multi-threaded architecture platform – the Efficient Architecture for Running Threads (EARTH). The I-Structure Software-Controlled Cache (ISSC) exploits temporal and spatial locality of EARTH split-phased memory transactions for single-assignment memory references. Our experimental evaluation indicates that the caching mechanism for single-assignment storage makes the EARTH memory system more robust to variations in the latency of memory operations. As a consequence the system can be ported to a wider range of machine platforms and deliver speedup for both regular and irregular application.  相似文献   

2.
Clusters of Symmetrical Multiprocessors (SMPs) have recently become the norm for high-performance economical computing solutions. Multiple nodes in a cluster can be used for parallel programming using a message passing library. An alternate approach is to use a software Distributed Shared Memory (DSM) to provide a view of shared memory to the application programmer. This paper describes Strings, a high performance distributed shared memory system designed for such SMP clusters. The distinguishing feature of this system is the use of a fully multi-threaded runtime system, using kernel level threads. Strings allows multiple application threads to be run on each node in a cluster. Since most modern UNIX systems can multiplex these threads on kernel level light weight processes, applications written using Strings can exploit multiple processors on a SMP machine. This paper describes some of the architectural details of the system and illustrates the performance improvements with benchmark programs from the SPLASH-2 suite, some computational kernels as well as a full fledged application. It is found that using multiple processes on SMP nodes provides good speedups only for a few of the programs. Multiple application threads can improve the performance in some cases, but other programs show a slowdown. If kernel threads are used additionally, the overall performance improves significantly in all programs tested. Other design decisions also have a beneficial impact, though to a lesser degree. This revised version was published online in July 2006 with corrections to the Cover Date.  相似文献   

3.
MPI collective communication operations to distribute or gather data are used for many parallel applications from scientific computing, but they may lead to scalability problems since their execution times increase with the number of participating processors. In this article, we show how the execution time of collective communication operations can be improved significantly by an internal restructuring based on orthogonal processor structures with two or more levels. The execution time of operations like MPI_Bcast() or MPI_Allgather() can be reduced by 40% and 70% on a dual Xeon cluster and a Beowulf cluster with single-processor nodes. But also on a Cray T3E a significant performance improvement can be obtained by a careful selection of the processor structure. The use of these optimized communication operations can reduce the execution time of data parallel implementations of complex application programs significantly without requiring any other change of the computation and communication structure. We present runtime functions for the modeling of two-phase realizations and verify that these runtime functions can predict the execution time both for communication operations in isolation and in the context of application programs.  相似文献   

4.
Delta Execution is a preemptive and transparent thread migration mechanism for supporting load distribution and balancing in a cluster of workstations. The design of Delta Execution allows the execution system to migrate threads of a Java application to different nodes of a cluster so as to achieve parallel execution. The approach is to break down and group the execution context of a migrating thread into sets of consecutive machine-dependent and machine-independent execution sub-contexts. Each set of machine-independent sub-contexts, also known as a delta set, is then migrated to a remote node in a regulated manner for continuing the execution. Since Delta Execution is implemented at the virtual machine level, all the migration-related activities are conducted transparently with respect to the applications. No new migration-related instructions need to be added to the programs and existing applications can immediately benefit from the parallel execution capability of Delta Execution without any code modification. Furthermore, because the Delta Execution approach identifies and migrates only the machine-independent part of a thread's execution context, the implementation is therefore reasonably manageable and the resulting software is portable. This revised version was published online in July 2006 with corrections to the Cover Date.  相似文献   

5.
We present a modular approach to implementing dynamic algorithm switching for parallel scientific software. By using a compositional framework based on function call interception techniques, our proposed method transparently integrates algorithm switching code with a given program without directly modifying the original code structure. Through fine-grained control of algorithmic behavior of an application at the level of functions, our approach supports design and implementation of application-specific switching scenarios in a modular way. Our approach encourages algorithm switching to dynamically perform at the loop end of a parallel simulation, where cooperating processes in concurrent execution typically synchronize and intermediate computation results are consistent. In this way, newly added switching operations do not cause race conditions that may produce unreliable computation results in parallel simulations. By applying our method to a real-world scientific application and adapting its algorithmic behavior to the properties of input problems, we demonstrate the applicability and effectiveness of our approach to constructing efficient parallel simulations.  相似文献   

6.
MOTIVATION: Due to the steadily growing computational demands in bioinformatics and related scientific disciplines, one is forced to make optimal use of the available resources. A straightforward solution is to build a network of idle computers and let each of them work on a small piece of a scientific challenge, as done by Seti@Home (http://setiathome.berkeley.edu), the world's largest distributed computing project. RESULTS: We developed a generally applicable distributed computing solution that uses a screensaver system similar to Seti@Home. The software exploits the coarse-grained nature of typical bioinformatics projects. Three major considerations for the design were: (1) often, many different programs are needed, while the time is lacking to parallelize them. Models@Home can run any program in parallel without modifications to the source code; (2) in contrast to the Seti project, bioinformatics applications are normally more sensitive to lost jobs. Models@Home therefore includes stringent control over job scheduling; (3) to allow use in heterogeneous environments, Linux and Windows based workstations can be combined with dedicated PCs to build a homogeneous cluster. We present three practical applications of Models@Home, running the modeling programs WHAT IF and YASARA on 30 PCs: force field parameterization, molecular dynamics docking, and database maintenance.  相似文献   

7.
TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders   总被引:5,自引:0,他引:5  
We describe two new Generalized Hidden Markov Model implementations for ab initio eukaryotic gene prediction. The C/C++ source code for both is available as open source and is highly reusable due to their modular and extensible architectures. Unlike most of the currently available gene-finders, the programs are re-trainable by the end user. They are also re-configurable and include several types of probabilistic submodels which can be independently combined, such as Maximal Dependence Decomposition trees and interpolated Markov models. Both programs have been used at TIGR for the annotation of the Aspergillus fumigatus and Toxoplasma gondii genomes. AVAILABILITY: Source code and documentation are available under the open source Artistic License from http://www.tigr.org/software/pirate  相似文献   

8.
Situs is a modular and widely used software package for the integration of biophysical data across the spatial resolution scales. It has been developed over the last decade with a focus on bridging the resolution gap between atomic structures, coarse-grained models, and volumetric data from low-resolution biophysical origins, such as electron microscopy, tomography, or small-angle scattering. Structural models can be created and refined with various flexible and rigid body docking strategies. The software consists of multiple, stand-alone programs for the format conversion, analysis, visualization, manipulation, and assembly of 3D data sets. The programs have been ported to numerous platforms in both serial and shared memory parallel architectures and can be combined in various ways for specific modeling applications. The modular design facilitates the updating of individual programs and the development of novel application workflows. This review provides an overview of the Situs package as it exists today with an emphasis on functionality and workflows supported by version 2.5.  相似文献   

9.
Previous studies have revealed that paravirtualization imposes minimal performance overhead on High Performance Computing (HPC) workloads, while exposing numerous benefits for this field. In this study, we are investigating the impact of paravirtualization on the performance of automatically-tuned software systems. We compare peak performance, performance degradation in constrained memory situations, performance degradation in multi-threaded applications, and inter-VM shared memory performance. For comparison purposes, we examine the proficiency of ATLAS, a quintessential example of an autotuning software system, in tuning the BLAS library routines for paravirtualized systems. Our results show that the combination of ATLAS and Xen paravirtualization delivers native execution performance and nearly identical memory hierarchy performance profiles in both single and multi-threaded scenarios. Furthermore, we show that it is possible to achieve memory sharing among OS instances at native speeds. These results expose new benefits to memory-intensive applications arising from the ability to slim down the guest OS without influencing the system performance. In addition, our findings support a novel and very attractive deployment scenario for computational science and engineering codes on virtual clusters and computational clouds.
Rich WolskiEmail:
  相似文献   

10.
For several applications and algorithms used in applied bioinformatics, a bottle neck in terms of computational time may arise when scaled up to facilitate analyses of large datasets and databases. Re-codification, algorithm modification or sacrifices in sensitivity and accuracy may be necessary to accommodate for limited computational capacity of single work stations. Grid computing offers an alternative model for solving massive computational problems by parallel execution of existing algorithms and software implementations. We present the implementation of a Grid-aware model for solving computationally intensive bioinformatic analyses exemplified by a blastp sliding window algorithm for whole proteome sequence similarity analysis, and evaluate the performance in comparison with a local cluster and a single workstation. Our strategy involves temporary installations of the BLAST executable and databases on remote nodes at submission, accommodating for dynamic Grid environments as it avoids the need of predefined runtime environments (preinstalled software and databases at specific Grid-nodes). Importantly, the implementation is generic where the BLAST executable can be replaced by other software tools to facilitate analyses suitable for parallelisation. This model should be of general interest in applied bioinformatics. Scripts and procedures are freely available from the authors.  相似文献   

11.
With the increasing interest in large-scale, high-resolution and real-time geographic information system (GIS) applications and spatial big data processing, traditional GIS is not efficient enough to handle the required loads due to limited computational capabilities.Various attempts have been made to adopt high performance computation techniques from different applications, such as designs of advanced architectures, strategies of data partition and direct parallelization method of spatial analysis algorithm, to address such challenges. This paper surveys the current state of parallel GIS with respect to parallel GIS architectures, parallel processing strategies, and relevant topics. We present the general evolution of the GIS architecture which includes main two parallel GIS architectures based on high performance computing cluster and Hadoop cluster. Then we summarize the current spatial data partition strategies, key methods to realize parallel GIS in the view of data decomposition and progress of the special parallel GIS algorithms. We use the parallel processing of GRASS as a case study. We also identify key problems and future potential research directions of parallel GIS.  相似文献   

12.
We consider parallel computing on a network of workstations using a connection-oriented protocol (e.g., Asynchronous Transfer Mode) for data communication. In a connection-oriented protocol, a virtual circuit of guaranteed bandwidth is established for each pair of communicating workstations. Since all virtual circuits do not have the same guaranteed bandwidth, a parallel application must deal with the unequal bandwidths between workstations. Since most works in the design of parallel algorithms assume equal bandwidths on all the communication links, they often do not perform well when executed on networks of workstations using connection-oriented protocols. In this paper, we first evaluate the performance degradation caused by unequal bandwidths on the execution of conventional parallel algorithms such as the fast Fourier transform and bitonic sort. We then present a strategy based on dynamic redistribution of data points to reduce the bottlenecks caused by unequal bandwidths. We also extend this strategy to deal with processor heterogeneity. Using analysis and simulation we show that there is a considerable reduction in the runtime if the proposed redistribution strategy is adopted. The basic idea presented in this paper can also be used to improve the runtimes of other parallel applications in connection-oriented environments. This revised version was published online in July 2006 with corrections to the Cover Date.  相似文献   

13.
Developing an efficient parallel application is not an easy task, and achieving a good performance requires a thorough understanding of the program’s behavior. Careful performance analysis and optimization are crucial. To help developers or users of these applications to analyze the program’s behavior, it is necessary to provide them with an abstraction of the application performance. In this paper, we propose a dynamic performance abstraction technique, which enables the automated discovery of causal execution paths, composed of communication and computational activities, in MPI parallel programs. This approach enables autonomous and low-overhead execution monitoring that generates performance knowledge about application behavior for the purpose of online performance diagnosis. Our performance abstraction technique reflects an application behavior and is made up of elements correlated with high-level program structures, such as loops and communication operations. Moreover, it characterizes all elements with statistical execution profiles. We have evaluated our approach on a variety of scientific parallel applications. In all scenarios, our online performance abstraction technique proved effective for low-overhead capturing of the program’s behavior and facilitated performance understanding.  相似文献   

14.
The first aim of simulation in virtual environment is to help biologists to have a better understanding of the simulated system. The cost of such simulation is significantly reduced compared to that of in vivo simulation. However, the inherent complexity of biological system makes it hard to simulate these systems on non-parallel architectures: models might be made of sub-models and take several scales into account; the number of simulated entities may be quite large. Today, graphics cards are used for general purpose computing which has been made easier thanks to frameworks like CUDA or OpenCL. Parallelization of models may however not be easy: parallel computer programing skills are often required; several hardware architectures may be used to execute models. In this paper, we present the software architecture we built in order to implement various models able to simulate multi-cellular system. This architecture is modular and it implements data structures adapted for graphics processing units architectures. It allows efficient simulation of biological mechanisms.  相似文献   

15.
Pencil beam algorithms are still considered as standard photon dose calculation methods in Radiotherapy treatment planning for many clinical applications. Despite their established role in radiotherapy planning their performance and clinical applicability has to be continuously adapted to evolving complex treatment techniques such as adaptive radiation therapy (ART). We herewith report on a new highly efficient version of a well-established pencil beam convolution algorithm which relies purely on measured input data. A method was developed that improves raytracing efficiency by exploiting the capability of modern CPU architecture for a runtime reduction. Since most of the current desktop computers provide more than one calculation unit we used symmetric multiprocessing extensively to parallelize the workload and thus decreasing the algorithmic runtime. To maximize the advantage of code parallelization, we present two implementation strategies – one for the dose calculation in inverse planning software, and one for traditional forward planning. As a result, we could achieve on a 16-core personal computer with AMD processors a superlinear speedup factor of approx. 18 for calculating the dose distribution of typical forward IMRT treatment plans.  相似文献   

16.
This paper surveys the computational strategies followed to parallelise the most used software in the bioinformatics arena. The studied algorithms are computationally expensive and their computational patterns range from regular, such as database-searching applications, to very irregularly structured patterns (phylogenetic trees). Fine- and coarse-grained parallel strategies are discussed for these very diverse sets of applications. This overview outlines computational issues related to parallelism, physical machine models, parallel programming approaches and scheduling strategies for a broad range of computer architectures. In particular, it deals with shared, distributed and shared/distributed memory architectures.  相似文献   

17.
Over the past few years, cluster/distributed computing has been gaining popularity. The proliferation of the cluster/distributed computing is due to the improved performance and increased reliability of these systems. Many parallel programming languages and related parallel programming models have become widely accepted. However, one of the major shortcomings of running parallel applications on cluster/distributed computing environments is the high communication overhead incurred. To reduce the communication overhead, and thus the completion time of a parallel application, this paper describes a simple, efficient and portable Key Message (KM) approach to support parallel computing on cluster/distributed computing environments. To demonstrate the advantage of the KM approach, a prototype runtime system has been implemented and evaluated. Our preliminary experimental results show that the KM approach has better improvement on communication of a parallel application when network background load increases or the computation to communication ratio of the application decreases.  相似文献   

18.
Biological applications, from genomics to ecology, deal with graphs that represents the structure of interactions. Analyzing such data requires searching for subgraphs in collections of graphs. This task is computationally expensive. Even though multicore architectures, from commodity computers to more advanced symmetric multiprocessing (SMP), offer scalable computing power, currently published software implementations for indexing and graph matching are fundamentally sequential. As a consequence, such software implementations (i) do not fully exploit available parallel computing power and (ii) they do not scale with respect to the size of graphs in the database. We present GRAPES, software for parallel searching on databases of large biological graphs. GRAPES implements a parallel version of well-established graph searching algorithms, and introduces new strategies which naturally lead to a faster parallel searching system especially for large graphs. GRAPES decomposes graphs into subcomponents that can be efficiently searched in parallel. We show the performance of GRAPES on representative biological datasets containing antiviral chemical compounds, DNA, RNA, proteins, protein contact maps and protein interactions networks.  相似文献   

19.
We investigate proactive dynamic load balancing on multicore systems, in which threads are continually migrated to reduce the impact of processor/thread mismatches. Our goal is to enhance the flexibility of the SPMD-style programming model and enable SPMD applications to run efficiently in multiprogrammed environments. We present Juggle, a practical decentralized, user-space implementation of a proactive load balancer that emphasizes portability and usability. In this paper we assume perfect intrinsic load balance and focus on extrinsic imbalances caused by OS noise, multiprogramming and mismatches of threads to hardware parallelism. Juggle shows performance improvements of up to 80 % over static load balancing for oversubscribed UPC, OpenMP, and pthreads benchmarks. We also show that Juggle is effective in unpredictable, multiprogrammed environments, with up to a 50 % performance improvement over the Linux load balancer and a 25 % reduction in performance variation. We analyze the impact of Juggle on parallel applications and derive lower bounds and approximations for thread completion times. We show that results from Juggle closely match theoretical predictions across a variety of architectures, including NUMA and hyper-threaded systems.  相似文献   

20.
The discovery of higher-order epistatic interactions is an important task in the field of genome wide association studies which allows for the identification of complex interaction patterns between multiple genetic markers. Some existing bruteforce approaches explore the whole space of k-interactions in an exhaustive manner resulting in almost intractable execution times. Computational cost can be reduced drastically by restricting the search space with suitable preprocessing filters which prune unpromising candidates. Other approaches mitigate the execution time by employing massively parallel accelerators in order to benefit from the vast computational resources of these architectures. In this paper, we combine a novel preprocessing filter, namely SingleMI, with massively parallel computation on modern GPUs to further accelerate epistasis discovery. Our implementation improves both the runtime and accuracy when compared to a previous GPU counterpart that employs mutual information clustering for prefiltering. SingleMI is open source software and publicly available at: https://github.com/sleeepyjack/singlemi/.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号