首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 234 毫秒
1.
In this paper, we propose a novel patient-specific method of modelling pulmonary airflow using graphics processing unit (GPU) computation that can be applied in medical practice. To overcome the barriers imposed by computation speed, installation price and footprint to the application of computational fluid dynamics, we focused on GPU computation and the lattice Boltzmann method (LBM). The GPU computation and LBM are compatible due to the characteristics of the GPU. As the optimisation of data access is essential for the performance of the GPU computation, we developed an adaptive meshing method, in which an airway model is covered by isotropic subdomains consisting of a uniform Cartesian mesh. We found that 4(3) size subdomains gave the best performance. The code was also tested on a small GPU cluster to confirm its performance and applicability, as the price and footprint are reasonable for medical applications.  相似文献   

2.
In this paper, we propose a novel patient-specific method of modelling pulmonary airflow using graphics processing unit (GPU) computation that can be applied in medical practice. To overcome the barriers imposed by computation speed, installation price and footprint to the application of computational fluid dynamics, we focused on GPU computation and the lattice Boltzmann method (LBM). The GPU computation and LBM are compatible due to the characteristics of the GPU. As the optimisation of data access is essential for the performance of the GPU computation, we developed an adaptive meshing method, in which an airway model is covered by isotropic subdomains consisting of a uniform Cartesian mesh. We found that 43 size subdomains gave the best performance. The code was also tested on a small GPU cluster to confirm its performance and applicability, as the price and footprint are reasonable for medical applications.  相似文献   

3.
光学相干断层成像(optical coherence tomography,OCT)技术在成像过程中具有极大的数据量和计算量,传统的基于中央处理器(central processing unit,CPU)的计算平台难以满足OCT实时成像的需求。图形处理器(graphics processing unit,GPU)在通用计算方面具有强大的并行处理能力和数值计算能力,可以突破OCT实时成像的瓶颈。本文对GPU做了简要介绍并阐述了GPU在OCT实时成像及功能成像中的应用及研究进展。  相似文献   

4.
The graphics processing unit (GPU), which originally was used exclusively for visualization purposes, has evolved into an extremely powerful co-processor. In the meanwhile, through the development of elaborate interfaces, the GPU can be used to process data and deal with computationally intensive applications. The speed-up factors attained compared to the central processing unit (CPU) are dependent on the particular application, as the GPU architecture gives the best performance for algorithms that exhibit high data parallelism and high arithmetic intensity. Here, we evaluate the performance of the GPU on a number of common algorithms used for three-dimensional image processing. The algorithms were developed on a new software platform called "CUDA", which allows a direct translation from C code to the GPU. The implemented algorithms include spatial transformations, real-space and Fourier operations, as well as pattern recognition procedures, reconstruction algorithms and classification procedures. In our implementation, the direct porting of C code in the GPU achieves typical acceleration values in the order of 10-20 times compared to a state-of-the-art conventional processor, but they vary depending on the type of the algorithm. The gained speed-up comes with no additional costs, since the software runs on the GPU of the graphics card of common workstations.  相似文献   

5.
The large quantities of data now being transferred via high-speed networks have made deep packet inspection indispensable for security purposes. Scalable and low-cost signature-based network intrusion detection systems have been developed for deep packet inspection for various software platforms. Traditional approaches that only involve central processing units (CPUs) are now considered inadequate in terms of inspection speed. Graphic processing units (GPUs) have superior parallel processing power, but transmission bottlenecks can reduce optimal GPU efficiency. In this paper we describe our proposal for a hybrid CPU/GPU pattern-matching algorithm (HPMA) that divides and distributes the packet-inspecting workload between a CPU and GPU. All packets are initially inspected by the CPU and filtered using a simple pre-filtering algorithm, and packets that might contain malicious content are sent to the GPU for further inspection. Test results indicate that in terms of random payload traffic, the matching speed of our proposed algorithm was 3.4 times and 2.7 times faster than those of the AC-CPU and AC-GPU algorithms, respectively. Further, HPMA achieved higher energy efficiency than the other tested algorithms.  相似文献   

6.
The Graphics Processing Unit (GPU) originally designed for rendering graphics and which is difficult to program for other tasks, has since evolved into a device suitable for general-purpose computations. As a result graphics hardware has become progressively more attractive yielding unprecedented performance at a relatively low cost. Thus, it is the ideal candidate to accelerate a wide variety of data parallel tasks in many fields such as in Machine Learning (ML). As problems become more and more demanding, parallel implementations of learning algorithms are crucial for a useful application. In particular, the implementation of Neural Networks (NNs) in GPUs can significantly reduce the long training times during the learning process. In this paper we present a GPU parallel implementation of the Back-Propagation (BP) and Multiple Back-Propagation (MBP) algorithms, and describe the GPU kernels needed for this task. The results obtained on well-known benchmarks show faster training times and improved performances as compared to the implementation in traditional hardware, due to maximized floating-point throughput and memory bandwidth. Moreover, a preliminary GPU based Autonomous Training System (ATS) is developed which aims at automatically finding high-quality NNs-based solutions for a given problem.  相似文献   

7.
Graphics processing unit (GPU) is becoming a powerful computational tool in science and engineering. In this paper, different from previous molecular dynamics (MD) simulation with pair potentials and many-body potentials, two MD simulation algorithms implemented on a single GPU are presented to describe a special category of many-body potentials – bond order potentials used frequently in solid covalent materials, such as the Tersoff potentials for silicon crystals. The simulation results reveal that the performance of GPU implementations is apparently superior to their CPU counterpart. Furthermore, the proposed algorithms are generalised, transferable and scalable, and can be extended to the simulations with general many-body interactions such as Stillinger–Weber potential and so on.  相似文献   

8.
High performance computing on the Graphics Processing Unit (GPU) is an emerging field driven by the promise of high computational power at a low cost. However, GPU programming is a non-trivial task and moreover architectural limitations raise the question of whether investing effort in this direction may be worthwhile. In this work, we use GPU programming to simulate a two-layer network of Integrate-and-Fire neurons with varying degrees of recurrent connectivity and investigate its ability to learn a simplified navigation task using a policy-gradient learning rule stemming from Reinforcement Learning. The purpose of this paper is twofold. First, we want to support the use of GPUs in the field of Computational Neuroscience. Second, using GPU computing power, we investigate the conditions under which the said architecture and learning rule demonstrate best performance. Our work indicates that networks featuring strong Mexican-Hat-shaped recurrent connections in the top layer, where decision making is governed by the formation of a stable activity bump in the neural population (a "non-democratic" mechanism), achieve mediocre learning results at best. In absence of recurrent connections, where all neurons "vote" independently ("democratic") for a decision via population vector readout, the task is generally learned better and more robustly. Our study would have been extremely difficult on a desktop computer without the use of GPU programming. We present the routines developed for this purpose and show that a speed improvement of 5x up to 42x is provided versus optimised Python code. The higher speed is achieved when we exploit the parallelism of the GPU in the search of learning parameters. This suggests that efficient GPU programming can significantly reduce the time needed for simulating networks of spiking neurons, particularly when multiple parameter configurations are investigated.  相似文献   

9.
An efficient somatic embryogenesis and plant regeneration system was developed from shoot apex explants of finger millet, Eleusine coracana. Eight genotypes, CO 7, CO 9, CO 13, CO 14, GPU 26, GPU 28, GPU 45, and GPU 48, were assessed in this study. The maximum somatic embryogenic induction, at 98.6%, was obtained from explants cultured on Murashige and Skoog medium supplemented with 18.0 μM dichlorophenoxyacetic acid and 2.3 μM kinetin. The highest number of shoot induction (26) was observed after transfer of embryonic callus to regeneration medium supplemented with 4.5 μM thidiazuran and 4.6 μM kinetin. Significant differences were observed between genotypes for somatic embryogenesis and plant regeneration. GPU 45 gave the best response, while CO 7 was the least responsive under the culture conditions tested in this study. Regenerated plants were successfully rooted and grown to maturity after hardening in soil.  相似文献   

10.
The high-order schemes have attracted more and more attention in computational fluid dynamics (CFD) simulations. As a kind of high-order schemes, weighted compact nonlinear schemes (WCNSs) have been widely applied in large eddy simulations, direct numerical simulations etc. However, due to the computational complexity, WCNSs require high-performance platforms. In recent years, the highly parallel graphics processing unit (GPU) is rapidly gaining maturity as a powerful engine for high performance computer. In this paper, we present a high-order double-precision solver of the three-dimensional, compressible viscous flow using multi-block structured grids on GPU clusters. The solver utilizes the high-order WCNS scheme for space discretization and Jacobi iteration method for time discretization. In order to utilize the computational capability of CPU and GPU for the solver, we present a workload balancing model for distributing workload among CPUs and GPUs. And we design two strategies to overlap computations with communications. The performance analyses show that the single-GPU solver achieves about 8× speed-ups relative to a serial computation on a CPU core. The performance results validate the workload distribution scheme. The strong and weak scaling analyses show that GPU clusters offer a significant advantage in performance.  相似文献   

11.
Gene co-expression networks comprise one type of valuable biological networks. Many methods and tools have been published to construct gene co-expression networks; however, most of these tools and methods are inconvenient and time consuming for large datasets. We have developed a user-friendly, accelerated and optimized tool for constructing gene co-expression networks that can fully harness the parallel nature of GPU (Graphic Processing Unit) architectures. Genetic entropies were exploited to filter out genes with no or small expression changes in the raw data preprocessing step. Pearson correlation coefficients were then calculated. After that, we normalized these coefficients and employed the False Discovery Rate to control the multiple tests. At last, modules identification was conducted to construct the co-expression networks. All of these calculations were implemented on a GPU. We also compressed the coefficient matrix to save space. We compared the performance of the GPU implementation with those of multi-core CPU implementations with 16 CPU threads, single-thread C/C++ implementation and single-thread R implementation. Our results show that GPU implementation largely outperforms single-thread C/C++ implementation and single-thread R implementation, and GPU implementation outperforms multi-core CPU implementation when the number of genes increases. With the test dataset containing 16,000 genes and 590 individuals, we can achieve greater than 63 times the speed using a GPU implementation compared with a single-thread R implementation when 50 percent of genes were filtered out and about 80 times the speed when no genes were filtered out.  相似文献   

12.
Smoothed Particle Hydrodynamics (SPH) is a numerical method commonly used in Computational Fluid Dynamics (CFD) to simulate complex free-surface flows. Simulations with this mesh-free particle method far exceed the capacity of a single processor. In this paper, as part of a dual-functioning code for either central processing units (CPUs) or Graphics Processor Units (GPUs), a parallelisation using GPUs is presented. The GPU parallelisation technique uses the Compute Unified Device Architecture (CUDA) of nVidia devices. Simulations with more than one million particles on a single GPU card exhibit speedups of up to two orders of magnitude over using a single-core CPU. It is demonstrated that the code achieves different speedups with different CUDA-enabled GPUs. The numerical behaviour of the SPH code is validated with a standard benchmark test case of dam break flow impacting on an obstacle where good agreement with the experimental results is observed. Both the achieved speed-ups and the quantitative agreement with experiments suggest that CUDA-based GPU programming can be used in SPH methods with efficiency and reliability.  相似文献   

13.
The timescales of biological processes, primarily those inherent to the molecular mechanisms of disease, are long (>μs) and involve complex interactions of systems consisting of many atoms (>106). Simulating these systems requires an advanced computational approach, and as such, coarse-grained (CG) models have been developed and highly optimised for accelerator hardware, primarily graphics processing units (GPUs). In this review, I discuss the implementation of CG models for biologically relevant systems, and show how such models can be optimised and perform well on GPU-accelerated hardware. Several examples of GPU implementations of CG models for both molecular dynamics and Monte Carlo simulations on purely GPU and hybrid CPU/GPU architectures are presented. Both the hardware and algorithmic limitations of various models, which depend greatly on the application of interest, are discussed.  相似文献   

14.
A graphics processing unit (GPU) has been widely used to accelerate discrete optimization problems. In this paper, we introduce a novel hybrid parallel algorithm to generate a shortest addition chain for a positive integer e. The main idea of the proposed algorithm is to divide the search tree into a sequence of three subtrees: top, middle, and bottom. The top subtree works using a branch and bound depth first strategy. The middle subtree works using a branch and bound breadth first strategy, while the bottom subtree works using a parallel (GPU) branch and bound depth first strategy. Our experimental results show that, compared to the fastest sequential algorithm for generating a shortest addition chain, we improve the generation by about 70% using one GPU (NVIDIA GeForce GTX 770). For generating all shortest addition chains, the percentage of the improvement is about 50%.  相似文献   

15.
Hybrid functional Petri nets are a wide-spread tool for representing and simulating biological models. Due to their potential of providing virtual drug testing environments, biological simulations have a growing impact on pharmaceutical research. Continuous research advancements in biology and medicine lead to exponentially increasing simulation times, thus raising the demand for performance accelerations by efficient and inexpensive parallel computation solutions. Recent developments in the field of general-purpose computation on graphics processing units (GPGPU) enabled the scientific community to port a variety of compute intensive algorithms onto the graphics processing unit (GPU). This work presents the first scheme for mapping biological hybrid functional Petri net models, which can handle both discrete and continuous entities, onto compute unified device architecture (CUDA) enabled GPUs. GPU accelerated simulations are observed to run up to 18 times faster than sequential implementations. Simulating the cell boundary formation by Delta-Notch signaling on a CUDA enabled GPU results in a speedup of approximately 7x for a model containing 1,600 cells.  相似文献   

16.
Han  KyungHyun  Lee  Wai-Kong  Hwang  Seong Oun 《Cluster computing》2022,25(1):433-450

Recently, National Institute of Standards and Technology (NIST) in the U.S. had initiated a global-scale competition to standardize the lightweight authenticated encryption with associated data (AEAD) and hash function. Gimli is one of the Round 2 candidates that is designed to be efficiently implemented across various platforms, including hardware (VLSI and FPGA), microprocessors, and microcontrollers. However, the performance of Gimli in massively parallel architectures like Graphics Processing Units (GPU) is still unknown. A high performance Gimli implementation on GPU can be especially useful to Internet of Things (IoT) applications, wherein the gateway devices and cloud servers need to handle a massive number of communications protected by AEAD. In this paper, we show that with careful optimization, Gimli can be efficiently implemented in desktop and embedded GPU to achieve extremely high throughput. Our experiments show that the proposed Gimli implementation can achieve 661.44 KB/s (encryption), 892.24 KB/s (decryption), and 4344.46 KB/s (hashing) in state-of-the-art GPUs.

  相似文献   

17.
Neuroimage registration is crucial for brain morphometric analysis and treatment efficacy evaluation. However, existing advanced registration algorithms such as FLIRT and ANTs are not efficient enough for clinical use. In this paper, a GPU implementation of FLIRT with the correlation ratio (CR) as the similarity metric and a GPU accelerated correlation coefficient (CC) calculation for the symmetric diffeomorphic registration of ANTs have been developed. The comparison with their corresponding original tools shows that our accelerated algorithms can greatly outperform the original algorithm in terms of computational efficiency. This paper demonstrates the great potential of applying these registration tools in clinical applications.  相似文献   

18.
Tensor contractions are generalized multidimensional matrix multiplication operations that widely occur in quantum chemistry. Efficient execution of tensor contractions on Graphics Processing Units (GPUs) requires several challenges to be addressed, including index permutation and small dimension-sizes reducing thread block utilization. Moreover, to apply the same optimizations to various expressions, we need a code generation tool. In this paper, we present our approach to automatically generate CUDA code to execute tensor contractions on GPUs, including management of data movement between CPU and GPU. To evaluate our tool, GPU-enabled code is generated for the most expensive contractions in CCSD(T), a key coupled cluster method, and incorporated into NWChem, a popular computational chemistry suite. For this method, we demonstrate speedup over a factor of 8.4 using one GPU as compared to one CPU core and over 2.6 when utilizing the entire system using hybrid CPU+GPU solution with 2 GPUs and 5 cores (instead of 7 cores per node). We further investigate tensor contraction code on a new series of GPUs, the Fermi GPUs, and provide several effective optimization algorithms. For the same computation of CCSD(T), on a cluster with Fermi GPUs, we achieve a speedup of 3.4 over a cluster with T10 GPUs. With a single Fermi GPU on each node, we achieve a speedup of 43 over the sequential CPU version.  相似文献   

19.
OLAP (On-Line Analytical Processing) is an approach to efficiently evaluate multidimensional data for business intelligence applications. OLAP contributes to business decision-making by identifying, extracting, and analyzing multidimensional data. The fundamental structure of OLAP is a data cube that enables users to interactively explore the distinct data dimensions. Processing depends on the complexity of queries, dimensionality, and growing size of the data cube. As data volumes keep on increasing and the demands by business users also increase, higher processing speed than ever is needed, as faster processing means faster decisions and more profit to industry. In this paper, we are proposing an Adaptive Hybrid OLAP Architecture that takes advantage of heterogeneous systems with GPUs and CPUs and leverages their different memory subsystems characteristics to minimize response time. Thus, our approach (a) exploits both types of hardware rather than using the CPU only as a frontend for GPU; (b) uses two different data formats (multidimensional cube and relational cube) to match the GPU and CPU memory access patterns and diverts queries adaptively to the best resource for solving the problem at hand; (c) exploits data locality of multidimensional OLAP on NUMA multicore systems through intelligent thread placement; and (d) guides its adaptation and choices by an architectural model that captures the memory access patterns and the underlying data characteristics. Results show an increase in performance by roughly four folds over the best known related approach. There is also the important economical factor. The proposed hybrid system costs only 10 % more than same system without GPU. With this small extra cost, the added GPU increases query processing by almost 2 times.  相似文献   

20.
We discuss an implementation of molecular dynamics (MD) simulations on a graphic processing unit (GPU) in the NVIDIA CUDA language. We tested our code on a modern GPU, the NVIDIA GeForce 8800 GTX. Results for two MD algorithms suitable for short-ranged and long-ranged interactions, and a congruential shift random number generator are presented. The performance of the GPU's is compared to their main processor counterpart. We achieve speedups of up to 40, 80 and 150 fold, respectively. With the latest generation of GPU's one can run standard MD simulations at 107 flops/$.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号