Architecture-aware configuration and scheduling of matrix multiplication on asymmetric multicore processors |
| |
Authors: | Sandra Catalán Francisco D. Igual Rafael Mayo Rafael Rodríguez-Sánchez Enrique S. Quintana-Ortí |
| |
Affiliation: | 1.Depto. Ingeniería y Ciencia de Computadores,Universidad Jaume I,Castellón de la plana,Spain;2.Depto. de Arquitectura de Computadores y Automática,Universidad Complutense de Madrid,Madrid,Spain |
| |
Abstract: | Asymmetric multicore processors have recently emerged as an appealing technology for severely energy-constrained environments, especially in mobile appliances where heterogeneity in applications is mainstream. In addition, given the growing interest for low-power high performance computing, this type of architectures is also being investigated as a means to improve the throughput-per-Watt of complex scientific applications on clusters of commodity systems-on-chip. In this paper, we design and embed several architecture-aware optimizations into a multi-threaded general matrix multiplication (gemm), a key operation of the BLAS, in order to obtain a high performance implementation for ARM big.LITTLE AMPs. Our solution is based on the reference implementation of gemm in the BLIS library, and integrates a cache-aware configuration as well as asymmetric-static and dynamic scheduling strategies that carefully tune and distribute the operation’s micro-kernels among the big and LITTLE cores of the target processor. The experimental results on a Samsung Exynos 5422, a system-on-chip with ARM Cortex-A15 and Cortex-A7 clusters that implements the big.LITTLE model, expose that our cache-aware versions of gemm with asymmetric scheduling attain important gains in performance with respect to its architecture-oblivious counterparts while exploiting all the resources of the AMP to deliver considerable energy efficiency. |
| |
Keywords: | |
本文献已被 SpringerLink 等数据库收录! |
|