TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

Similar documents
Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)

SOLUTION of linear systems of equations of the form:

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

Accelerating linear algebra computations with hybrid GPU-multicore systems.

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

Introduction to numerical computations on the GPU

Some notes on efficient computing and setting up high performance computing environments

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

Level-3 BLAS on a GPU

Sparse Matrix Computations in Arterial Fluid Mechanics

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Fine-Grained Parallel Algorithms for Incomplete Factorization Preconditioning

Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and

Computing least squares condition numbers on hybrid multicore/gpu systems

Solving PDEs with CUDA Jonathan Cohen

Dense Arithmetic over Finite Fields with CUMODP

Enhancing Performance of Tall-Skinny QR Factorization using FPGAs

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION

Parallel Sparse Tensor Decompositions using HiCOO Format

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

S0214 : GPU Based Stacking Sequence Generation For Composite Skins Using GA

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

Optimizing Time Integration of Chemical-Kinetic Networks for Speed and Accuracy

Multicore Parallelization of Determinant Quantum Monte Carlo Simulations

Parallel Transposition of Sparse Data Structures

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem

Parallel sparse direct solvers for Poisson s equation in streamer discharges

Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

A CUDA Solver for Helmholtz Equation

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

Fine-grained Parallel Incomplete LU Factorization

INITIAL INTEGRATION AND EVALUATION

On the design of parallel linear solvers for large scale problems

A Banded Spike Algorithm and Solver for Shared Memory Architectures

MARCH 24-27, 2014 SAN JOSE, CA

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria

Explore Computational Power of GPU in Electromagnetics and Micromagnetics

Practical Combustion Kinetics with CUDA

Parallel Algorithms for Solution of Large Sparse Linear Systems with Applications

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

Algorithms and Methods for Fast Model Predictive Control

Intel Math Kernel Library (Intel MKL) LAPACK

Direct Self-Consistent Field Computations on GPU Clusters

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

Accelerating incompressible fluid flow simulations on hybrid CPU/GPU systems

Parallel Polynomial Evaluation

Perm State University Research-Education Center Parallel and Distributed Computing

Jacobi-Davidson Eigensolver in Cusolver Library. Lung-Sheng Chien, NVIDIA

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem

A Comparison of Solving the Poisson Equation Using Several Numerical Methods in Matlab and Octave on the Cluster maya

Strassen s Algorithm for Tensor Contraction

c 2015 Society for Industrial and Applied Mathematics

On the Computational Complexity of the Discrete Pascal Transform

Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota

Scalable and Power-Efficient Data Mining Kernels

Verbundprojekt ELPA-AEO. Eigenwert-Löser für Petaflop-Anwendungen Algorithmische Erweiterungen und Optimierungen

Acceleration of WRF on the GPU

EVALUATING SPARSE LINEAR SYSTEM SOLVERS ON SCALABLE PARALLEL ARCHITECTURES

Sparse BLAS-3 Reduction

Incomplete LU Preconditioning and Error Compensation Strategies for Sparse Matrices

ECS289: Scalable Machine Learning

Incomplete Cholesky preconditioners that exploit the low-rank property

WRF performance tuning for the Intel Woodcrest Processor

Open-Source Parallel FE Software : FrontISTR -- Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer and GPU-Clusters --

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors

Petascale Quantum Simulations of Nano Systems and Biomolecules

Implementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint:

Computers and Mathematics with Applications

ERLANGEN REGIONAL COMPUTING CENTER

A hybrid Hermitian general eigenvalue solver

CS 542G: Conditioning, BLAS, LU Factorization

Introduction to Benchmark Test for Multi-scale Computational Materials Software

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX

A robust multilevel approximate inverse preconditioner for symmetric positive definite matrices

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015

Adaptive Spike-Based Solver 1.0 User Guide

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU

Review for the Midterm Exam

First, a look at using OpenACC on WRF subroutine advance_w dynamics routine

Robust Preconditioned Conjugate Gradient for the GPU and Parallel Implementations

Bachelor-thesis: GPU-Acceleration of Linear Algebra using OpenCL

Solution to Laplace Equation using Preconditioned Conjugate Gradient Method with Compressed Row Storage using MPI

Quantum Computer Simulation Using CUDA (Quantum Fourier Transform Algorithm)

Poisson Solvers. William McLean. April 21, Return to Math3301/Math5315 Common Material.

Matrix Assembly in FEA

Interpolation with Radial Basis Functions on GPGPUs using CUDA

Weather and Climate Modeling on GPU and Xeon Phi Accelerated Systems

Accelerating interior point methods with GPUs for smart grid systems

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

Multipole-Based Preconditioners for Sparse Linear Systems.

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)

Review: From problem to parallel algorithm

PFEAST: A High Performance Sparse Eigenvalue Solver Using Distributed-Memory Linear Solvers

Transcription:

TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0

Abstract ::GPU is a solver developed in the Simulation Based Engineering Lab (SBEL) [] to solve large banded and sparse linear systems on the GPU. This report contributes the performance comparison of the banded solver of ::GPU and Intel s Math Kernel Library [] on a large set of synthetic problems. The results of several numerical experiments indicate that when it is used in conjunction with large dense banded matrices, ::GPU is two to five times faster than the latest version of the MKL dense solver when the latter is run on the Haswell, Ivy Bridge, or Phi architectures.

Contents Introduction Test Methods. Parameters..................................... Matrix generation method............................. Apple-to-apple guarantee............................. Test Environment Test Results

Introduction ::GPU is a solver developed by Simulation Based Engineering Lab (SBEL) [] to solve large banded and sparse linear systems on GPU. The solver uses preconditioned Krylovsubspace method and the truncated version of SPIKE algorithm [ 7] works as its preconditioner. By taking use of the floating-point processing ability and high bandwidth of GPU memory, and by leveraging the parallelism in truncated SPIKE, ::GPU is claimed to achieve high speedups compared to Intel s Math Kernel Library in a large set of synthetic banded matrices []. Intel s Math Kernel Library (MKL) [] includes a wealth of routines to accelerate application performance and reduce development time. It includes highly vectorized and threaded Linear Algebra, Fast Fourier Transforms (FFT), Vector Math and Statistics functions. Through a single C or Fortran API call, these functions automatically scale across previous, current and future processor architectures by selecting the best code path for each. We worked with factorization of banded matrices with routine dgbtrf and routine dgbtrs for the solution. The routine dgbtrf computes the LU factorization of a general banded m- by-n matrix A with lower half-bandwidth K l and upper half-bandwidth K u (in our cases, however, we are working with test cases where m = n and K l = K u. See. for detail). Following dgbtrf, by calling the routine dgbtrs, we will be able to achieve the solutions for the factorized banded system. The remainder of this report is organized as follows. In and, we introduce our test methods and test environment, respectively. In, we display plots and briefly summarize the observation from these plots. Test Methods This section briefly outlines the parameters of synthetic banded matrices, the methods to generate synthetic banded matrices, and how we make sure the comparison is apple-to-apple.. Parameters As described in, we intend to compare ::GPU with MKL on matrices of various sizes. Three parameters were used to control a generated matrix {a ij } n n. Dimension. The dimension n directly tells how large a matrix is. In, we use N to denote the dimension. Half-Bandwidth. The half-bandwidth is defined as max j i. This parameter i,j,a ij 0 influences the number of floating point operations required in each iteration of both factorization and foward/backward sweeps. In, we use K to denote the half-bandwidth Degree of diagonal dominance. The degree of diagonal dominance is defined as d = min sup{d i : a ii d i a ij }. This parameter intuitively shows compared i j i

to all other entries, how large the diagonal entries are. As ::GPU uses truncated version of SPIKE algorithm as a preconditioner and [,6] claims that truncated version is theoretically only applicable to matrices with d, we are interested in observing how ::GPU behaves with matrices with d <. In this report, sometimes we will use the term diagonal dominance for simplicity. In our tests, we generate sets of matrices, varying in these three parameters. The comparison is not an entire four dimensional comparisond, but one with two parameters fixed while the rest one is changing.. Matrix generation method First of all, we let the user to specify the three parameters described in.. Then for each non-zero matrix entry, a random value between 0 and 0 is assigned. At last, when all non-zero entries have been generated, for the entries except for the diagonal one in each row, we sum their absolute values up and replace the corresponding diagonal entry with value d a ij, where d is the user-specified diagonal dominance. j i. Apple-to-apple guarantee To make sure ::GPU and MKL are solving exactly the same set of problems, the matrices were all generated in ::GPU and were stored in a temporary file. After ::GPU has completed its job, MKL directly reads the file, forms its matrix and then solves. The right hand sides were assigned to all ones. For details of banded matrix storage for ::GPU and MKL, interested readers can refer to [] and [], respectively. Test Environment All numerical experiments for ::GPU were performed on a machine with an Intel Nehalem Xeon E-60 processor (in all figures, it is denoted as ) and an NVIDIA Tesla K0 GPU accelerator []. We ran MKL on two machines for comparison: Intel Nahalem Xeon E-60 (denoted as ) and Intel Nahalem Xeon E-690v (denoted as ). The solution times reported are inclusive time that account for the amount of time to move data to/from the device. Also, where applicable, the solution times include times required to do the necessary matrix reorderings. The GPU code was built with the nvcc compiler in CUDA version 6.0, while all CPU codes were built with Intel s compiler ICPC.0. In all cases, level optimization was used. Test Results Varied N, fixed K and d. In a first set of experiments, we investigate the performance and scalability of groups of solving matrices with the same half-bandwidth and diagonal

dominance (in this set of tests, we take d = all the time) while differing in dimension. We can see that ::GPU outperforms MKL (on ) in the whole range of space and outperforms outperforms MKL (on ) in the whole range of space except for one point with (N, K) = (000, 0). For scalability, ::GPU scales better than MKL (both on and ) for half-bandwidths 0, 0 and 00, but scales a bit worse for half-bandwidth 00. Varied K, fixed N and d. In the second set of experiments, groups of matrices with the same dimension and diagonal dominance while the half-bandwidth differs are tested. We can see that ::GPU outperforms MKL (on ) in the whole range of space and outperforms outperforms MKL (on ) in the whole range of space except for one point with (N, K) = (00000, 00). For scalability, ::GPU scales better than MKL (both on and ) for smaller half-bandwidths, but scales worse for larger halfbandwidths. Varied d, fixed N and K. Finally, we study how diagonal dominance impacts the efficacity of both ::GPU and MKL. As expected, the MKL banded solver is insensitive to changes in d if we discard the variations introduced by the overload of the CPU. Somewhat surprising is the fact that the ::GPU solver demonstrated uniform performance over the whole range of degrees of diagonal dominance on the matrices in Figure ; ::GPU required no more than one outer loop iteration for all d 0.. Our further studies show that, as the degree of diagonal dominance decreases further, the number of iterations and hence the time to solution increase significantly, since the assumptions supporting the truncated SPIKE approximation are strongly violated. What has not surprised us is that we can still see speedups from to almost in the whole range space. References [] Simulation based engineering lab. http://sbel.wisc.edu, 006. [] Intel Math Kernel Library (Intel MKL).0. http://software.intel.com/en-us/intel-mkl, 0. [] SPIKE GPU: A SPIKE-based preconditioned GPU Solver for Sparse Linear Systems. http://sbel.wisc.edu/documents/tr-0-0.pdf, 0. [] Upgrade to the world s fastest GPU Accelerator - NVIDIA Tesla K0. http://www.nvidia.com/content/tesla/pdf/nvidia-tesla-k0-0mar-lr.pdf, 0. [] E. Polizzi and A. Sameh. A parallel hybrid banded system solver: the SPIKE algorithm. Parallel Computing, ():77 9, 006. [6] E. Polizzi and A. Sameh. SPIKE: A parallel environment for solving banded linear systems. Computers & Fluids, 6(): 0, 007.

[7] A. Sameh and D. Kuck. On stable parallel linear system solvers. JACM, ():8 9, 978.

Speedup Speedup 0 Exec time (s) 0 0. 0 6 9 Exec time (s) 0 0. 0 6 9 0 0 6 9 N (a) Influence of the problem dimension N for fixed values K = 00 and d =. 0 6 9 N (b) Influence of the problem dimension N for fixed values K = 00 and d =. Speedup Speedup Exec time (s) 0. 0 0 6 9 Exec time (s) 0. 0 0 6 9 0 6 9 N (c) Influence of the problem dimension N for fixed values K = 0 and d =. 0 6 9 N (d) Influence of the problem dimension N for fixed values K = 0 and d =. Figure : Parametric studies on synthetic banded matrices. Timing results for and MKL are provided when N (problem dimension) is varied, while K (half-bandwidth) and d (degree of diagonal dominance) are kept constant.

6 6 7 0 6 6 7 0. 6 7 K (a) Influence of the half-bandwidth K for fixed values N =, 000, 000 and d =. 0. 6 7 K (b) Influence of the half-bandwidth K for fixed values N = 00, 000 and d =. 6 6 6 7 8 6 7 8 0. 0. 6 7 8 K (c) Influence of the half-bandwidth K for fixed values N = 00, 000 and d =. 6 7 8 K (d) Influence of the half-bandwidth K for fixed values N = 00, 000 and d =. Figure : Parametric studies on synthetic banded matrices. Timing results for and MKL are provided when K (half-bandwidth) is varied, while N (problem dimension) and d (degree of diagonal dominance) are kept constant.

6 0. 0. 0. 0.7 0.9. 6 6 0 0. 0. 0. 0.7 0.9. 0 0. 0. 0. 0.7 0.9. d (a) Influence of the degree of diagonal dominance d for fixed values N =, 000, 000 and K = 0. 0. 0. 0. 0.7 0.9. d (b) Influence of the degree of diagonal dominance d for fixed values N =, 000, 000 and K = 00... 0 0. 0. 0. 0.7 0.9. 0 0. 0. 0. 0.7 0.9. d (c) Influence of the degree of diagonal dominance d for fixed values N =, 000, 000 and K = 00. Figure : Parametric studies on synthetic banded matrices. Timing results for and MKL are provided when d (degree of diagonal dominance) is varied, while N (problem dimension) and K (degree of diagonal dominance) are kept constant.