期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

DNA sequences alignment in multi-GPUs: acceleration and energy payoff

Jesús Pérez-Serrano Edans Sandes Alba Cristina Magalhaes Alves de Melo Manuel Ujaldón 《BMC bioinformatics》2018,19(14):421

Background

We present a performance per watt analysis of CUDAlign 4.0, a parallel strategy to obtain the optimal pairwise alignment of huge DNA sequences in multi-GPU platforms using the exact Smith-Waterman method.

Results

Our study includes acceleration factors, performance, scalability, power efficiency and energy costs. We also quantify the influence of the contents of the compared sequences, identify potential scenarios for energy savings on speculative executions, and calculate performance and energy usage differences among distinct GPU generations and models. For a sequence alignment on chromosome-wide scale (around 2 Petacells), we are able to reduce execution times from 9.5 h on a Kepler GPU to just 2.5 h on a Pascal counterpart, with energy costs cut by 60%.

Conclusions

We find GPUs to be an order of magnitude ahead in performance per watt compared to Xeon Phis. Finally, versus typical low-power devices like FPGAs, GPUs keep similar GFLOPS/w ratios in 2017 on a five times faster execution.

相似文献

2.

Coupling SIMD and SIMT Architectures to Boost Performance of a Phylogeny-aware Alignment Kernel

NC Alachiotis SA Berger A Stamatakis 《BMC bioinformatics》2012,13(1):196

ABSTRACT: BACKGROUND: Aligning short DNA reads to a reference sequence alignment is a prerequisite fordetecting their biological origin and analyzing them in a phylogenetic context. With thePaPaRa tool we introduced a dedicated dynamic programming algorithm forsimultaneously aligning short reads to reference alignments and correspondingevolutionary reference trees. The algorithm aligns short reads to phylogenetic profiles thatcorrespond to the branches of such a reference tree. The algorithm needs to perform animmense number of pairwise alignments. Therefore, we explore vector intrinsics andGPUs to accelerate the PaPaRa alignment kernel. RESULTS: We optimized and parallelized PaPaRa on CPUs and GPUs. Via SSE 4.1 SIMD (SingleInstruction, Multiple Data) intrinsics for x86 SIMD architectures and multi-threading, weobtained a 9-fold acceleration on a single core as well as linear speedups with respect tothe number of cores. The peak CPU performance amounts to 18.1 GCUPS (Giga CellUpdates per Second) using all four physical cores on an Intel i7 2600 CPU running at 3.4GHz. The average CPU performance (averaged over all test runs) is 12.33 GCUPS. Wealso used OpenCL to execute PaPaRa on a GPU SIMT (Single Instruction, MultipleThreads) architecture. A NVIDIA GeForce 560 GPU delivered peak and averageperformance of 22.1 and 18.4 GCUPS respectively. Finally, we combined the SIMD andSIMT implementations into a hybrid CPU-GPU system that achieved an accumulatedpeak performance of 33.8 GCUPS. CONCLUSIONS: This accelerated version of PaPaRa (available at www.exelixis-lab.org/software.html)provides a significant performance improvement that allows for analyzing larger datasetsin less time. We observe that state-of-the-art SIMD and SIMT architectures delivercomparable performance for this dynamic programming kernel when the "competingprogrammer approach" is deployed. Finally, we show that overall performance can besubstantially increased by designing a hybrid CPU-GPU system with appropriate loaddistribution mechanisms. 相似文献

3.

Parallel implementation of 3D protein structure similarity searches using a GPU and the CUDA

Dariusz Mrozek Miłosz Brożek Bożena Małysiak-Mrozek 《Journal of molecular modeling》2014,20(2):1-17

Searching for similar 3D protein structures is one of the primary processes employed in the field of structural bioinformatics. However, the computational complexity of this process means that it is constantly necessary to search for new methods that can perform such a process faster and more efficiently. Finding molecular substructures that complex protein structures have in common is still a challenging task, especially when entire databases containing tens or even hundreds of thousands of protein structures must be scanned. Graphics processing units (GPUs) and general purpose graphics processing units (GPGPUs) can perform many time-consuming and computationally demanding processes much more quickly than a classical CPU can. In this paper, we describe the GPU-based implementation of the CASSERT algorithm for 3D protein structure similarity searching. This algorithm is based on the two-phase alignment of protein structures when matching fragments of the compared proteins. The GPU (GeForce GTX 560Ti: 384 cores, 2GB RAM) implementation of CASSERT (“GPU-CASSERT”) parallelizes both alignment phases and yields an average 180-fold increase in speed over its CPU-based, single-core implementation on an Intel Xeon E5620 (2.40GHz, 4 cores). In this paper, we show that massive parallelization of the 3D structure similarity search process on many-core GPU devices can reduce the execution time of the process, allowing it to be performed in real time. GPU-CASSERT is available at: http://zti.polsl.pl/dmrozek/science/gpucassert/cassert.htm. 相似文献

4.

cuGimli: optimized implementation of the Gimli authenticated encryption and hash function on GPU for IoT applications

Han KyungHyun Lee Wai-Kong Hwang Seong Oun 《Cluster computing》2022,25(1):433-450

Recently, National Institute of Standards and Technology (NIST) in the U.S. had initiated a global-scale competition to standardize the lightweight authenticated encryption with associated data (AEAD) and hash function. Gimli is one of the Round 2 candidates that is designed to be efficiently implemented across various platforms, including hardware (VLSI and FPGA), microprocessors, and microcontrollers. However, the performance of Gimli in massively parallel architectures like Graphics Processing Units (GPU) is still unknown. A high performance Gimli implementation on GPU can be especially useful to Internet of Things (IoT) applications, wherein the gateway devices and cloud servers need to handle a massive number of communications protected by AEAD. In this paper, we show that with careful optimization, Gimli can be efficiently implemented in desktop and embedded GPU to achieve extremely high throughput. Our experiments show that the proposed Gimli implementation can achieve 661.44 KB/s (encryption), 892.24 KB/s (decryption), and 4344.46 KB/s (hashing) in state-of-the-art GPUs.

相似文献

5.

Implementation and performance evaluation of reconstruction algorithms on graphics processors

Castaño Díez D Mueller H Frangakis AS 《Journal of structural biology》2007,157(1):288-295

The high-throughput needs in electron tomography and in single particle analysis have driven the parallel implementation of several reconstruction algorithms and software packages on computing clusters. Here, we report on the implementation of popular reconstruction algorithms as weighted backprojection, simultaneous iterative reconstruction technique (SIRT) and simultaneous algebraic reconstruction technique (SART) on common graphics processors (GPUs). The speed gain achieved on the GPUs is in the order of sixty (60x) to eighty (80x) times, compared to the performance of a single central processing unit (CPU), which is comparable to the acceleration achieved on a medium-range computing cluster. This acceleration of the reconstruction is caused by the highly specialized architecture of the GPU. Further, we show that the quality of the reconstruction on the GPU is comparable to the CPU. We present detailed flow-chart diagrams of the implementation. The reconstruction software does not require special hardware apart from the commercially available graphics cards and could be easily integrated in software packages like SPIDER, XMIPP, TOM-Package and others. 相似文献

6.

Consistency of optimal sequence alignments

Osamu Gotoh 《Bulletin of mathematical biology》1990,52(4):509-525

Pairwise optimal alignments between three or more sequences are not necessarily consistent as a whole, but consistent and inconsistent residues are usually distributed in clusters. An efficient method has been developed for locating consistent regions when each pairwise alignment is given in the form of a “skeletal representation” (Bull. math. Biol. 52, 359–373). This method is further extended so that the combination of pairwise alignments that gives the greatest consistency is found when possibly many alignments are equally optimal for each pairwise comparison. A method for acceleration of simultaneous multiple sequence alignment is proposed in which consistent regions serve as “anchor points” limiting application of direct multi-way alignment to the rest of “inconsistent” regions. Dedicated to Prof. Akiyoshi Wada on the occasion of his 60^th birthday. 相似文献

7.

Using GPUs for the exact alignment of short-read genetic sequences by means of the Burrows-Wheeler transform

Salavert Torres J Blanquer Espert I Domínguez AT Hernández García V Medina Castelló I Tárraga Giménez J Dopazo Blázquez J 《IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM》2012,9(4):1245-1256

General Purpose Graphic Processing Units (GPGPUs) constitute an inexpensive resource for computing-intensive applications that could exploit an intrinsic fine-grain parallelism. This paper presents the design and implementation in GPGPUs of an exact alignment tool for nucleotide sequences based on the Burrows-Wheeler Transform. We compare this algorithm with state-of-the-art implementations of the same algorithm over standard CPUs, and considering the same conditions in terms of I/O. Excluding disk transfers, the implementation of the algorithm in GPUs shows a speedup larger than 12, when compared to CPU execution. This implementation exploits the parallelism by concurrently searching different sequences on the same reference search tree, maximizing memory locality and ensuring a symmetric access to the data. The paper describes the behavior of the algorithm in GPU, showing a good scalability in the performance, only limited by the size of the GPU inner memory. 相似文献

8.

Fast Acceleration of 2D Wave Propagation Simulations Using Modern Computational Accelerators

Wei Wang Lifan Xu John Cavazos Howie H. Huang Matthew Kay 《PloS one》2014,9(1)

Recent developments in modern computational accelerators like Graphics Processing Units (GPUs) and coprocessors provide great opportunities for making scientific applications run faster than ever before. However, efficient parallelization of scientific code using new programming tools like CUDA requires a high level of expertise that is not available to many scientists. This, plus the fact that parallelized code is usually not portable to different architectures, creates major challenges for exploiting the full capabilities of modern computational accelerators. In this work, we sought to overcome these challenges by studying how to achieve both automated parallelization using OpenACC and enhanced portability using OpenCL. We applied our parallelization schemes using GPUs as well as Intel Many Integrated Core (MIC) coprocessor to reduce the run time of wave propagation simulations. We used a well-established 2D cardiac action potential model as a specific case-study. To the best of our knowledge, we are the first to study auto-parallelization of 2D cardiac wave propagation simulations using OpenACC. Our results identify several approaches that provide substantial speedups. The OpenACC-generated GPU code achieved more than speedup above the sequential implementation and required the addition of only a few OpenACC pragmas to the code. An OpenCL implementation provided speedups on GPUs of at least faster than the sequential implementation and faster than a parallelized OpenMP implementation. An implementation of OpenMP on Intel MIC coprocessor provided speedups of with only a few code changes to the sequential implementation. We highlight that OpenACC provides an automatic, efficient, and portable approach to achieve parallelization of 2D cardiac wave simulations on GPUs. Our approach of using OpenACC, OpenCL, and OpenMP to parallelize this particular model on modern computational accelerators should be applicable to other computational models of wave propagation in multi-dimensional media. 相似文献

9.

Genome skimming identifies polymorphism in tern populations and species

David George Jackson Steven D Emslie Marcel van Tuinen 《BMC research notes》2012,5(1):1-11

Background

Large-scale protein structure alignment, an indispensable tool to structural bioinformatics, poses a tremendous challenge on computational resources. To ensure structure alignment accuracy and efficiency, efforts have been made to parallelize traditional alignment algorithms in grid environments. However, these solutions are costly and of limited accessibility. Others trade alignment quality for speedup by using high-level characteristics of structure fragments for structure comparisons.

Findings

We present ppsAlign, a parallel protein structure Alignment framework designed and optimized to exploit the parallelism of Graphics Processing Units (GPUs). As a general-purpose GPU platform, ppsAlign could take many concurrent methods, such as TM-align and Fr-TM-align, into the parallelized algorithm design. We evaluated ppsAlign on an NVIDIA Tesla C2050 GPU card, and compared it with existing software solutions running on an AMD dual-core CPU. We observed a 36-fold speedup over TM-align, a 65-fold speedup over Fr-TM-align, and a 40-fold speedup over MAMMOTH.

Conclusions

ppsAlign is a high-performance protein structure alignment tool designed to tackle the computational complexity issues from protein structural data. The solution presented in this paper allows large-scale structure comparisons to be performed using massive parallel computing power of GPU. 相似文献

10.

Accelerating MapReduce framework on multi-GPU systems

Hai Jiang Yi Chen Zhi Qiao Kuan-Ching Li WonWoo Ro Jean-Luc Gaudiot 《Cluster computing》2014,17(2):293-301

Graphics processors evolve rapidly and promise to support power-efficient, cost, differentiated price-performance, and scalable high performance computing. MapReduce is a well-known distributed programming model to ease the development of applications for large-scale data processing on a large number of commodity CPUs. When compared to CPUs, GPUs are an order of magnitude faster in terms of computation power and memory bandwidth, but they are harder to program. Although several studies have implemented the MapReduce model on GPUs, most of them are based on the single GPU model and bounded by a GPU memory with inefficient atomic operations. This paper focuses on the development of MGMR, a standalone MapReduce system that utilizes multiple GPUs to manage large-scale data processing beyond the GPU memory limitation, and also to eliminate serial atomic operations. Experimental results have demonstrated the effectiveness of MGMR in handling large data sets. 相似文献

11.

An evaluation of multiple feed-forward networks on GPUs

Lopes N Ribeiro B 《International journal of neural systems》2011,21(1):31-47

The Graphics Processing Unit (GPU) originally designed for rendering graphics and which is difficult to program for other tasks, has since evolved into a device suitable for general-purpose computations. As a result graphics hardware has become progressively more attractive yielding unprecedented performance at a relatively low cost. Thus, it is the ideal candidate to accelerate a wide variety of data parallel tasks in many fields such as in Machine Learning (ML). As problems become more and more demanding, parallel implementations of learning algorithms are crucial for a useful application. In particular, the implementation of Neural Networks (NNs) in GPUs can significantly reduce the long training times during the learning process. In this paper we present a GPU parallel implementation of the Back-Propagation (BP) and Multiple Back-Propagation (MBP) algorithms, and describe the GPU kernels needed for this task. The results obtained on well-known benchmarks show faster training times and improved performances as compared to the implementation in traditional hardware, due to maximized floating-point throughput and memory bandwidth. Moreover, a preliminary GPU based Autonomous Training System (ATS) is developed which aims at automatically finding high-quality NNs-based solutions for a given problem. 相似文献

12.

Parameter based tuning model for optimizing performance on GPU

Nhat-Phuong Tran Myungho Lee Jaeyoung Choi 《Cluster computing》2017,20(3):2133-2142

Recently, the graphic processing units (GPUs) are becoming increasingly popular for the high performance computing applications. Although the GPUs provide high peak performance, exploiting the full performance potential for application programs, however, leaves a challenging task to the programmers. When launching a parallel kernel of an application on the GPU, the programmer needs to carefully select the number of blocks (grid size) and the number of threads per block (block size). These values determine the degree of SIMD parallelism and the multithreading, and greatly influence the performance. With a huge range of possible combinations of these values, choosing the right grid size and the block size is not straightforward. In this paper, we propose a mathematical model for tuning the grid size and the block size based on the GPU architecture parameters. Using our model we first calculate a small set of candidate grid size and block size values, then search for the optimal values out of the candidate values through experiments. Our approach significantly reduces the potential search space instead of exhaustive search approaches in the previous research. Thus our approach can be practically applied to the real applications. 相似文献

13.

Advancing simulations of biological materials: applications of coarse-grained models on graphics processing unit hardware

David N. LeBard 《Molecular simulation》2014,40(10-11):802-820

The timescales of biological processes, primarily those inherent to the molecular mechanisms of disease, are long (>μs) and involve complex interactions of systems consisting of many atoms (>10⁶). Simulating these systems requires an advanced computational approach, and as such, coarse-grained (CG) models have been developed and highly optimised for accelerator hardware, primarily graphics processing units (GPUs). In this review, I discuss the implementation of CG models for biologically relevant systems, and show how such models can be optimised and perform well on GPU-accelerated hardware. Several examples of GPU implementations of CG models for both molecular dynamics and Monte Carlo simulations on purely GPU and hybrid CPU/GPU architectures are presented. Both the hardware and algorithmic limitations of various models, which depend greatly on the application of interest, are discussed. 相似文献

14.

High performance hybrid functional Petri net simulations of biological pathway models on CUDA

Chalkidis G Nagasaki M Miyano S 《IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM》2011,8(6):1545-1556

Hybrid functional Petri nets are a wide-spread tool for representing and simulating biological models. Due to their potential of providing virtual drug testing environments, biological simulations have a growing impact on pharmaceutical research. Continuous research advancements in biology and medicine lead to exponentially increasing simulation times, thus raising the demand for performance accelerations by efficient and inexpensive parallel computation solutions. Recent developments in the field of general-purpose computation on graphics processing units (GPGPU) enabled the scientific community to port a variety of compute intensive algorithms onto the graphics processing unit (GPU). This work presents the first scheme for mapping biological hybrid functional Petri net models, which can handle both discrete and continuous entities, onto compute unified device architecture (CUDA) enabled GPUs. GPU accelerated simulations are observed to run up to 18 times faster than sequential implementations. Simulating the cell boundary formation by Delta-Notch signaling on a CUDA enabled GPU results in a speedup of approximately 7x for a model containing 1,600 cells. 相似文献

15.

The identification of complete domains within protein sequences using accurate E-values for semi-global alignment 总被引：2，自引：1，他引：1

Kann MG Sheetlin SL Park Y Bryant SH Spouge JL 《Nucleic acids research》2007,35(14):4678-4685

The sequencing of complete genomes has created a pressing need for automated annotation of gene function. Because domains are the basic units of protein function and evolution, a gene can be annotated from a domain database by aligning domains to the corresponding protein sequence. Ideally, complete domains are aligned to protein subsequences, in a ‘semi-global alignment’. Local alignment, which aligns pieces of domains to subsequences, is common in high-throughput annotation applications, however. It is a mature technique, with the heuristics and accurate E-values required for screening large databases and evaluating the screening results. Hidden Markov models (HMMs) provide an alternative theoretical framework for semi-global alignment, but their use is limited because they lack heuristic acceleration and accurate E-values. Our new tool, GLOBAL, overcomes some limitations of previous semi-global HMMs: it has accurate E-values and the possibility of the heuristic acceleration required for high-throughput applications. Moreover, according to a standard of truth based on protein structure, two semi-global HMM alignment tools (GLOBAL and HMMer) had comparable performance in identifying complete domains, but distinctly outperformed two tools based on local alignment. When searching for complete protein domains, therefore, GLOBAL avoids disadvantages commonly associated with HMMs, yet maintains their superior retrieval performance. 相似文献

16.

A Hybrid CPU/GPU Pattern-Matching Algorithm for Deep Packet Inspection

Chun-Liang Lee Yi-Shan Lin Yaw-Chung Chen 《PloS one》2015,10(10)

The large quantities of data now being transferred via high-speed networks have made deep packet inspection indispensable for security purposes. Scalable and low-cost signature-based network intrusion detection systems have been developed for deep packet inspection for various software platforms. Traditional approaches that only involve central processing units (CPUs) are now considered inadequate in terms of inspection speed. Graphic processing units (GPUs) have superior parallel processing power, but transmission bottlenecks can reduce optimal GPU efficiency. In this paper we describe our proposal for a hybrid CPU/GPU pattern-matching algorithm (HPMA) that divides and distributes the packet-inspecting workload between a CPU and GPU. All packets are initially inspected by the CPU and filtered using a simple pre-filtering algorithm, and packets that might contain malicious content are sent to the GPU for further inspection. Test results indicate that in terms of random payload traffic, the matching speed of our proposed algorithm was 3.4 times and 2.7 times faster than those of the AC-CPU and AC-GPU algorithms, respectively. Further, HPMA achieved higher energy efficiency than the other tested algorithms. 相似文献

17.

Scalable and highly parallel implementation of Smith-Waterman on graphics processing unit using CUDA 总被引：1，自引：0，他引：1

Ali Akoglu Gregory M. Striemer 《Cluster computing》2009,12(3):341-352

Program development environments have enabled graphics processing units (GPUs) to become an attractive high performance computing platform for the scientific community. A commonly posed problem in computational biology is protein database searching for functional similarities. The most accurate algorithm for sequence alignments is Smith-Waterman (SW). However, due to its computational complexity and rapidly increasing database sizes, the process becomes more and more time consuming making cluster based systems more desirable. Therefore, scalable and highly parallel methods are necessary to make SW a viable solution for life science researchers. In this paper we evaluate how SW fits onto the target GPU architecture by exploring ways to map the program architecture on the processor architecture. We develop new techniques to reduce the memory footprint of the application while exploiting the memory hierarchy of the GPU. With this implementation, GSW, we overcome the on chip memory size constraint, achieving 23× speedup compared to a serial implementation. Results show that as the query length increases our speedup almost stays stable indicating the solid scalability of our approach. Additionally this is a first of a kind implementation which purely runs on the GPU instead of a CPU-GPU integrated environment, making our design suitable for porting onto a cluster of GPUs. 相似文献

18.

A method for multiple sequence alignment with gaps 总被引：13，自引：0，他引：13

S Subbiah S C Harrison 《Journal of molecular biology》1989,209(4):539-548

A method that performs multiple sequence alignment by cyclical use of the standard pairwise Needleman-Wunsch algorithm is presented. The required central processor unit time is of the same order of magnitude as the standard Needleman-Wunsch pairwise implementation. Comparison with the one known case where the optimal multiple sequence alignment has been rigorously determined shows that in practice the proposed method finds the mathematically optimal solution. The more interesting question of the biological usefulness of such multiple sequence alignment over pairwise approaches is assessed using protein families whose X-ray structures are known. The two such cases studied, the subdomains of the ricin B-chain and the S-domains of virus coat proteins, have low pairwise similarity and thus fail to align correctly under standard pairwise sequence comparison. In both cases the multiple sequence alignment produced by the proposed technique, apart from minor deviations at loop regions, correctly predicts the true structural alignment. Thus, given many sequences of low pairwise similarity, the proposed multiple sequence method, can extract any familial similarity and so produce a sequence alignment consistent with the underlying structural homology. 相似文献

19.

Stable feature selection based on the ensemble L 1 -norm support vector machine for biomarker discovery

Moon Myungjin Nakai Kenta 《BMC genomics》2016,17(13):65-74

Background

Lately, biomarker discovery has become one of the most significant research issues in the biomedical field. Owing to the presence of high-throughput technologies, genomic data, such as microarray data and RNA-seq, have become widely available. Many kinds of feature selection techniques have been applied to retrieve significant biomarkers from these kinds of data. However, they tend to be noisy with high-dimensional features and consist of a small number of samples; thus, conventional feature selection approaches might be problematic in terms of reproducibility.

Results

In this article, we propose a stable feature selection method for high-dimensional datasets. We apply an ensemble L ₁-norm support vector machine to efficiently reduce irrelevant features, considering the stability of features. We define the stability score for each feature by aggregating the ensemble results, and utilize backward feature elimination on a purified feature set based on this score; therefore, it is possible to acquire an optimal set of features for performance without the need to set a specific threshold. The proposed methodology is evaluated by classifying the binary stage of renal clear cell carcinoma with RNA-seq data.

Conclusion

A comparison with established algorithms, i.e., a fast correlation-based filter, random forest, and an ensemble version of an L ₂-norm support vector machine-based recursive feature elimination, enabled us to prove the superior performance of our method in terms of classification as well as stability in general. It is also shown that the proposed approach performs moderately on high-dimensional datasets consisting of a very large number of features and a smaller number of samples. The proposed approach is expected to be applicable to many other researches aimed at biomarker discovery.

相似文献

20.

Performance improvements for iterative electron tomography reconstruction using graphics processing units (GPUs)

Palenstijn WJ Batenburg KJ Sijbers J 《Journal of structural biology》2011,176(2):250-253

Iterative reconstruction algorithms are becoming increasingly important in electron tomography of biological samples. These algorithms, however, impose major computational demands. Parallelization must be employed to maintain acceptable running times. Graphics Processing Units (GPUs) have been demonstrated to be highly cost-effective for carrying out these computations with a high degree of parallelism. In a recent paper by Xu et al. (2010), a GPU implementation strategy was presented that obtains a speedup of an order of magnitude over a previously proposed GPU-based electron tomography implementation. In this technical note, we demonstrate that by making alternative design decisions in the GPU implementation, an additional speedup can be obtained, again of an order of magnitude. By carefully considering memory access locality when dividing the workload among blocks of threads, the GPU’s cache is used more efficiently, making more effective use of the available memory bandwidth. 相似文献