期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Performance evaluation of image processing algorithms on the GPU

Castaño-Díez D Moser D Schoenegger A Pruggnaller S Frangakis AS 《Journal of structural biology》2008,164(1):153-160

The graphics processing unit (GPU), which originally was used exclusively for visualization purposes, has evolved into an extremely powerful co-processor. In the meanwhile, through the development of elaborate interfaces, the GPU can be used to process data and deal with computationally intensive applications. The speed-up factors attained compared to the central processing unit (CPU) are dependent on the particular application, as the GPU architecture gives the best performance for algorithms that exhibit high data parallelism and high arithmetic intensity. Here, we evaluate the performance of the GPU on a number of common algorithms used for three-dimensional image processing. The algorithms were developed on a new software platform called "CUDA", which allows a direct translation from C code to the GPU. The implemented algorithms include spatial transformations, real-space and Fourier operations, as well as pattern recognition procedures, reconstruction algorithms and classification procedures. In our implementation, the direct porting of C code in the GPU achieves typical acceleration values in the order of 10-20 times compared to a state-of-the-art conventional processor, but they vary depending on the type of the algorithm. The gained speed-up comes with no additional costs, since the software runs on the GPU of the graphics card of common workstations. 相似文献

2.

Cluster optimization algorithm based on CPU and GPU hybrid architecture

Yin Fei Shi Feng 《Cluster computing》2022,25(4):2601-2611

With the rapid development of network technology and parallel computing, clusters formed by connecting a large number of PCs with high-speed networks have gradually replaced the status of supercomputers in scientific research and production and high-performance computing with cost-effective advantages. The research purpose of this paper is to integrate the Kriging proxy model method and energy efficiency modeling method into a cluster optimization algorithm of CPU and GPU hybrid architecture. This paper proposes a parallel computing model for large-scale CPU/GPU heterogeneous high-performance computing systems, which can effectively describe the computing capabilities and various communication behaviors of CPU/GPU heterogeneous systems, and finally provide algorithm optimization for CPU/GPU heterogeneous clusters. According to the GPU architecture, an efficient method of constructing a Kriging proxy model and an optimized search algorithm are designed. The experimental results in this paper show that the construction of the Kriging proxy model can obtain a 220 times speedup ratio, and the search algorithm can reach an 8 times speedup ratio. It can be seen that this heterogeneous cluster optimization algorithm has high feasibility.

相似文献

3.

Parallel high-dimensional multi-objective feature selection for EEG classification with dynamic workload balancing on CPU–GPU architectures

Juan?José?Escobar Email author Julio?Ortega Jesús?González Miguel?Damas Antonio?F.?Díaz 《Cluster computing》2017,20(3):1881-1897

Many bioinformatics applications that analyse large volumes of high-dimensional data comprise complex problems requiring metaheuristics approaches with different types of implicit parallelism. For example, although functional parallelism would be used to accelerate evolutionary algorithms, the fitness evaluation of the population could imply the computation of cost functions with data parallelism. This way, heterogeneous parallel architectures, including central processing unit (CPU) microprocessors with multiple superscalar cores and accelerators such as graphics processing units (GPUs) could be very useful. This paper aims to take advantage of such CPU–GPU heterogeneous architectures to accelerate electroencephalogram classification and feature selection problems by evolutionary multi-objective optimization, in the context of brain computing interface tasks. In this paper, we have used the OpenCL framework to develop parallel master-worker codes implementing an evolutionary multi-objective feature selection procedure in which the individuals of the population are dynamically distributed among the available CPU and GPU cores. 相似文献

4.

Implementation and performance evaluation of reconstruction algorithms on graphics processors

Castaño Díez D Mueller H Frangakis AS 《Journal of structural biology》2007,157(1):288-295

The high-throughput needs in electron tomography and in single particle analysis have driven the parallel implementation of several reconstruction algorithms and software packages on computing clusters. Here, we report on the implementation of popular reconstruction algorithms as weighted backprojection, simultaneous iterative reconstruction technique (SIRT) and simultaneous algebraic reconstruction technique (SART) on common graphics processors (GPUs). The speed gain achieved on the GPUs is in the order of sixty (60x) to eighty (80x) times, compared to the performance of a single central processing unit (CPU), which is comparable to the acceleration achieved on a medium-range computing cluster. This acceleration of the reconstruction is caused by the highly specialized architecture of the GPU. Further, we show that the quality of the reconstruction on the GPU is comparable to the CPU. We present detailed flow-chart diagrams of the implementation. The reconstruction software does not require special hardware apart from the commercially available graphics cards and could be easily integrated in software packages like SPIDER, XMIPP, TOM-Package and others. 相似文献

5.

FastGCN: A GPU Accelerated Tool for Fast Gene Co-Expression Networks

Meimei Liang Futao Zhang Gulei Jin Jun Zhu 《PloS one》2015,10(1)

Gene co-expression networks comprise one type of valuable biological networks. Many methods and tools have been published to construct gene co-expression networks; however, most of these tools and methods are inconvenient and time consuming for large datasets. We have developed a user-friendly, accelerated and optimized tool for constructing gene co-expression networks that can fully harness the parallel nature of GPU (Graphic Processing Unit) architectures. Genetic entropies were exploited to filter out genes with no or small expression changes in the raw data preprocessing step. Pearson correlation coefficients were then calculated. After that, we normalized these coefficients and employed the False Discovery Rate to control the multiple tests. At last, modules identification was conducted to construct the co-expression networks. All of these calculations were implemented on a GPU. We also compressed the coefficient matrix to save space. We compared the performance of the GPU implementation with those of multi-core CPU implementations with 16 CPU threads, single-thread C/C++ implementation and single-thread R implementation. Our results show that GPU implementation largely outperforms single-thread C/C++ implementation and single-thread R implementation, and GPU implementation outperforms multi-core CPU implementation when the number of genes increases. With the test dataset containing 16,000 genes and 590 individuals, we can achieve greater than 63 times the speed using a GPU implementation compared with a single-thread R implementation when 50 percent of genes were filtered out and about 80 times the speed when no genes were filtered out. 相似文献

6.

Parallel implementation of 3D protein structure similarity searches using a GPU and the CUDA

Dariusz Mrozek Miłosz Brożek Bożena Małysiak-Mrozek 《Journal of molecular modeling》2014,20(2):1-17

Searching for similar 3D protein structures is one of the primary processes employed in the field of structural bioinformatics. However, the computational complexity of this process means that it is constantly necessary to search for new methods that can perform such a process faster and more efficiently. Finding molecular substructures that complex protein structures have in common is still a challenging task, especially when entire databases containing tens or even hundreds of thousands of protein structures must be scanned. Graphics processing units (GPUs) and general purpose graphics processing units (GPGPUs) can perform many time-consuming and computationally demanding processes much more quickly than a classical CPU can. In this paper, we describe the GPU-based implementation of the CASSERT algorithm for 3D protein structure similarity searching. This algorithm is based on the two-phase alignment of protein structures when matching fragments of the compared proteins. The GPU (GeForce GTX 560Ti: 384 cores, 2GB RAM) implementation of CASSERT (“GPU-CASSERT”) parallelizes both alignment phases and yields an average 180-fold increase in speed over its CPU-based, single-core implementation on an Intel Xeon E5620 (2.40GHz, 4 cores). In this paper, we show that massive parallelization of the 3D structure similarity search process on many-core GPU devices can reduce the execution time of the process, allowing it to be performed in real time. GPU-CASSERT is available at: http://zti.polsl.pl/dmrozek/science/gpucassert/cassert.htm. 相似文献

7.

Accelerating Neuroimage Registration through Parallel Computation of Similarity Metric

Yun-gang Luo Ping Liu Lin Shi Yishan Luo Lei Yi Ang Li Jing Qin Pheng-Ann Heng Defeng Wang 《PloS one》2015,10(9)

Neuroimage registration is crucial for brain morphometric analysis and treatment efficacy evaluation. However, existing advanced registration algorithms such as FLIRT and ANTs are not efficient enough for clinical use. In this paper, a GPU implementation of FLIRT with the correlation ratio (CR) as the similarity metric and a GPU accelerated correlation coefficient (CC) calculation for the symmetric diffeomorphic registration of ANTs have been developed. The comparison with their corresponding original tools shows that our accelerated algorithms can greatly outperform the original algorithm in terms of computational efficiency. This paper demonstrates the great potential of applying these registration tools in clinical applications. 相似文献

8.

High performance hybrid functional Petri net simulations of biological pathway models on CUDA

Chalkidis G Nagasaki M Miyano S 《IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM》2011,8(6):1545-1556

Hybrid functional Petri nets are a wide-spread tool for representing and simulating biological models. Due to their potential of providing virtual drug testing environments, biological simulations have a growing impact on pharmaceutical research. Continuous research advancements in biology and medicine lead to exponentially increasing simulation times, thus raising the demand for performance accelerations by efficient and inexpensive parallel computation solutions. Recent developments in the field of general-purpose computation on graphics processing units (GPGPU) enabled the scientific community to port a variety of compute intensive algorithms onto the graphics processing unit (GPU). This work presents the first scheme for mapping biological hybrid functional Petri net models, which can handle both discrete and continuous entities, onto compute unified device architecture (CUDA) enabled GPUs. GPU accelerated simulations are observed to run up to 18 times faster than sequential implementations. Simulating the cell boundary formation by Delta-Notch signaling on a CUDA enabled GPU results in a speedup of approximately 7x for a model containing 1,600 cells. 相似文献

9.

GPU-accelerated molecular dynamics simulation of solid covalent crystals

Chaofeng Hou Wei Ge 《Molecular simulation》2013,39(1):8-15

Graphics processing unit (GPU) is becoming a powerful computational tool in science and engineering. In this paper, different from previous molecular dynamics (MD) simulation with pair potentials and many-body potentials, two MD simulation algorithms implemented on a single GPU are presented to describe a special category of many-body potentials – bond order potentials used frequently in solid covalent materials, such as the Tersoff potentials for silicon crystals. The simulation results reveal that the performance of GPU implementations is apparently superior to their CPU counterpart. Furthermore, the proposed algorithms are generalised, transferable and scalable, and can be extended to the simulations with general many-body interactions such as Stillinger–Weber potential and so on. 相似文献

10.

SHEsis PCA: A GPU-Based Software to Correct for Population Stratification that Efficiently Accelerates the Process for Handling Genome-Wide Datasets

《遗传学报》2015,(8)

Population stratification is a problem in genetic association studies because it is likely to highlight loci that underlie the population structure rather than disease-related loci. At present, principal component analysis(PCA) has been proven to be an effective way to correct for population stratification. However, the conventional PCA algorithm is time-consuming when dealing with large datasets. We developed a Graphic processing unit(GPU)-based PCA software named SHEsis PCA(http://analysis.bio-x.cn/SHEsis Main.htm) that is highly parallel with a highest speedup greater than 100 compared with its CPU version. A cluster algorithm based on X-means was also implemented as a way to detect population subgroups and to obtain matched cases and controls in order to reduce the genomic inflation and increase the power. A study of both simulated and real datasets showed that SHEsis PCA ran at an extremely high speed while the accuracy was hardly reduced. Therefore, SHEsis PCA can help correct for population stratification much more efficiently than the conventional CPU-based algorithms. 相似文献

11.

An Adaptive Hybrid OLAP Architecture with optimized memory access patterns

Lubomir Riha Maria Malik Tarek El-Ghazawi 《Cluster computing》2013,16(4):663-677

OLAP (On-Line Analytical Processing) is an approach to efficiently evaluate multidimensional data for business intelligence applications. OLAP contributes to business decision-making by identifying, extracting, and analyzing multidimensional data. The fundamental structure of OLAP is a data cube that enables users to interactively explore the distinct data dimensions. Processing depends on the complexity of queries, dimensionality, and growing size of the data cube. As data volumes keep on increasing and the demands by business users also increase, higher processing speed than ever is needed, as faster processing means faster decisions and more profit to industry. In this paper, we are proposing an Adaptive Hybrid OLAP Architecture that takes advantage of heterogeneous systems with GPUs and CPUs and leverages their different memory subsystems characteristics to minimize response time. Thus, our approach (a) exploits both types of hardware rather than using the CPU only as a frontend for GPU; (b) uses two different data formats (multidimensional cube and relational cube) to match the GPU and CPU memory access patterns and diverts queries adaptively to the best resource for solving the problem at hand; (c) exploits data locality of multidimensional OLAP on NUMA multicore systems through intelligent thread placement; and (d) guides its adaptation and choices by an architectural model that captures the memory access patterns and the underlying data characteristics. Results show an increase in performance by roughly four folds over the best known related approach. There is also the important economical factor. The proposed hybrid system costs only 10 % more than same system without GPU. With this small extra cost, the added GPU increases query processing by almost 2 times. 相似文献

12.

Advancing simulations of biological materials: applications of coarse-grained models on graphics processing unit hardware

David N. LeBard 《Molecular simulation》2014,40(10-11):802-820

The timescales of biological processes, primarily those inherent to the molecular mechanisms of disease, are long (>μs) and involve complex interactions of systems consisting of many atoms (>10⁶). Simulating these systems requires an advanced computational approach, and as such, coarse-grained (CG) models have been developed and highly optimised for accelerator hardware, primarily graphics processing units (GPUs). In this review, I discuss the implementation of CG models for biologically relevant systems, and show how such models can be optimised and perform well on GPU-accelerated hardware. Several examples of GPU implementations of CG models for both molecular dynamics and Monte Carlo simulations on purely GPU and hybrid CPU/GPU architectures are presented. Both the hardware and algorithmic limitations of various models, which depend greatly on the application of interest, are discussed. 相似文献

13.

GPUs, a new tool of acceleration in CFD: efficiency and reliability on smoothed particle hydrodynamics methods

Crespo AC Dominguez JM Barreiro A Gómez-Gesteira M Rogers BD 《PloS one》2011,6(6):e20685

Smoothed Particle Hydrodynamics (SPH) is a numerical method commonly used in Computational Fluid Dynamics (CFD) to simulate complex free-surface flows. Simulations with this mesh-free particle method far exceed the capacity of a single processor. In this paper, as part of a dual-functioning code for either central processing units (CPUs) or Graphics Processor Units (GPUs), a parallelisation using GPUs is presented. The GPU parallelisation technique uses the Compute Unified Device Architecture (CUDA) of nVidia devices. Simulations with more than one million particles on a single GPU card exhibit speedups of up to two orders of magnitude over using a single-core CPU. It is demonstrated that the code achieves different speedups with different CUDA-enabled GPUs. The numerical behaviour of the SPH code is validated with a standard benchmark test case of dam break flow impacting on an obstacle where good agreement with the experimental results is observed. Both the achieved speed-ups and the quantitative agreement with experiments suggest that CUDA-based GPU programming can be used in SPH methods with efficiency and reliability. 相似文献

14.

Fast parallel Markov clustering in bioinformatics using massively parallel computing on GPU with CUDA and ELLPACK-R sparse format

Bustamam A Burrage K Hamilton NA 《IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM》2012,9(3):679-692

Markov clustering (MCL) is becoming a key algorithm within bioinformatics for determining clusters in networks. However,with increasing vast amount of data on biological networks, performance and scalability issues are becoming a critical limiting factor in applications. Meanwhile, GPU computing, which uses CUDA tool for implementing a massively parallel computing environment in the GPU card, is becoming a very powerful, efficient, and low-cost option to achieve substantial performance gains over CPU approaches. The use of on-chip memory on the GPU is efficiently lowering the latency time, thus, circumventing a major issue in other parallel computing environments, such as MPI. We introduce a very fast Markov clustering algorithm using CUDA (CUDA-MCL) to perform parallel sparse matrix-matrix computations and parallel sparse Markov matrix normalizations, which are at the heart of MCL. We utilized ELLPACK-R sparse format to allow the effective and fine-grain massively parallel processing to cope with the sparse nature of interaction networks data sets in bioinformatics applications. As the results show, CUDA-MCL is significantly faster than the original MCL running on CPU. Thus, large-scale parallel computation on off-the-shelf desktop-machines, that were previously only possible on supercomputing architectures, can significantly change the way bioinformaticians and biologists deal with their data. 相似文献

15.

High-performance data mining with intelligent SSD

Yong-Yeon Jo Sang-Wook Kim Sung-Woo Cho Duck-Ho Bae Hyunok Oh 《Cluster computing》2017,20(2):1155-1166

An intuitive way to process the big data efficiently is to reduce the volume of data transferred over the storage interface to a host system. This is the reason that the notion of intelligent SSD (iSSD) was proposed to give processing power to SSD. There is rich literature on iSSD, however, its real implementation has not been provided to the public yet. Most prior work aims to quantify the benefits of iSSD with analytical modeling. In this paper, we first develop on iSSD simulator and present the potential of iSSD in data mining through the iSSD simulator. Our iSSD simulator performs on top of the gem 5 simulator and fully simulates all the processes of data mining algorithms running in iSSD with cycle-level accuracy. Then, we further addresse how to exploit all the computing resources for efficient processing of data mining algorithms. These days, CPU, GPU, and SSD are recently equipped together in most computing environment. If SSD is replaced with iSSD later on, we have a new computing environment where the three computing resources collaborate one another to process big data quite effectively. For this, scheduling is required to decide which computing resource is going to run for which function at which time. In our heterogeneous scheduling, types of computing resources, memory sizes in computing resources, and inter-processor communication times including IO time in SSD are considered. Our scheduling results show that processing in the collaborative environment outperforms that in the traditional one by up to about 10 times. 相似文献

16.

Double enhanced residual network for biological image denoising

《Gene expression patterns : GEP》2022

With the achievements of deep learning, applications of deep convolutional neural networks for the image denoising problem have been widely studied. However, these methods are typically limited by GPU in terms of network layers and other aspects. This paper proposes a multi-level network that can efficiently utilize GPU memory, named Double Enhanced Residual Network (DERNet), for biological-image denoising. The network consists of two sub-networks, and U-Net inspires the basic structure. For each sub-network, the encoder-decoder hierarchical structure is used for down-scaling and up-scaling feature maps so that GPU can yield large receptive fields. In the encoder process, the convolution layers are used for down-sampling to obtain image information, and residual blocks are superimposed for preliminary feature extraction. In the operation of the decoder, transposed convolution layers have the capability to up-sampling and combine with the Residual Dense Instance Normalization (RDIN) block that we propose, extract deep features and restore image details. Finally, both qualitative experiments and visual effects demonstrate the effectiveness of our proposed algorithm. 相似文献

17.

An artificial hysteresis binary neuron: a model suppressing the oscillatory behaviors of neural dynamics

Y. Takefuji K. C. Lee 《Biological cybernetics》1991,64(5):353-356

A hysteresis binary McCulloch-Pitts neuron model is proposed in order to suppress the complicated oscillatory behaviors of neural dynamics. The artificial hysteresis binary neural network is used for scheduling time-multiplex crossbar switches in order to demonstrate the effects of hysteresis. Time-multiplex crossbar switching systems must control traffic on demand such that packet blocking probability and packet waiting time are minimized. The system using n×n processing elements solves an n×n crossbar-control problem with O(1) time, while the best existing parallel algorithm requires O(n) time. The hysteresis binary neural network maximizes the throughput of packets through a crossbar switch. The solution quality of our system does not degrade with the problem size. 相似文献

18.

Improved calculations of compactness and a reevaluation of continuous compact units

Micheal H. Zehfus 《Proteins》1993,16(3):293-300

A new method for calculating compactness (Z) that uses look-up table-based algorithms for area and volume computations is introduced. These algorithms can be used in any iterative area orvolume calculation, are more than 1000 times faster than the originalalgorithms, and have equal or better precision. With the faster algorithms it is now possible to calculate the compactness of all continuous units in a protein, and to precisely locate the optimal compact units without the screening functions and limited resolution used previously. These methods have been incorporated into a fully automatic domain finding algorithm, and this method has been applied to the 21 proteins originally analyzed as well as 12 additional proteins. This method is robust, and yields similar units even when applied to coordinates of protein crystals grown under different experimental conditions. © 1993 Wiley-Liss, Inc. 相似文献

19.

Using GPUs for the exact alignment of short-read genetic sequences by means of the Burrows-Wheeler transform

Salavert Torres J Blanquer Espert I Domínguez AT Hernández García V Medina Castelló I Tárraga Giménez J Dopazo Blázquez J 《IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM》2012,9(4):1245-1256

General Purpose Graphic Processing Units (GPGPUs) constitute an inexpensive resource for computing-intensive applications that could exploit an intrinsic fine-grain parallelism. This paper presents the design and implementation in GPGPUs of an exact alignment tool for nucleotide sequences based on the Burrows-Wheeler Transform. We compare this algorithm with state-of-the-art implementations of the same algorithm over standard CPUs, and considering the same conditions in terms of I/O. Excluding disk transfers, the implementation of the algorithm in GPUs shows a speedup larger than 12, when compared to CPU execution. This implementation exploits the parallelism by concurrently searching different sequences on the same reference search tree, maximizing memory locality and ensuring a symmetric access to the data. The paper describes the behavior of the algorithm in GPU, showing a good scalability in the performance, only limited by the size of the GPU inner memory. 相似文献

20.

Multi-dimensional, mesoscopic Monte Carlo simulations of inhomogeneous reaction-drift-diffusion systems on graphics-processing units

Vigelius M Meyer B 《PloS one》2012,7(4):e33384

For many biological applications, a macroscopic (deterministic) treatment of reaction-drift-diffusion systems is insufficient. Instead, one has to properly handle the stochastic nature of the problem and generate true sample paths of the underlying probability distribution. Unfortunately, stochastic algorithms are computationally expensive and, in most cases, the large number of participating particles renders the relevant parameter regimes inaccessible. In an attempt to address this problem we present a genuine stochastic, multi-dimensional algorithm that solves the inhomogeneous, non-linear, drift-diffusion problem on a mesoscopic level. Our method improves on existing implementations in being multi-dimensional and handling inhomogeneous drift and diffusion. The algorithm is well suited for an implementation on data-parallel hardware architectures such as general-purpose graphics processing units (GPUs). We integrate the method into an operator-splitting approach that decouples chemical reactions from the spatial evolution. We demonstrate the validity and applicability of our algorithm with a comprehensive suite of standard test problems that also serve to quantify the numerical accuracy of the method. We provide a freely available, fully functional GPU implementation. Integration into Inchman, a user-friendly web service, that allows researchers to perform parallel simulations of reaction-drift-diffusion systems on GPU clusters is underway. 相似文献