Accelerating linear algebra computations with hybrid GPU-multicore systems.

Similar documents
Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Computing least squares condition numbers on hybrid multicore/gpu systems

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

MARCH 24-27, 2014 SAN JOSE, CA

Communication-avoiding LU and QR factorizations for multicore architectures

Using Random Butterfly Transformations to Avoid Pivoting in Sparse Direct Methods

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012

Dynamic Scheduling within MAGMA

Reducing the Amount of Pivoting in Symmetric Indefinite Systems

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

Computing least squares condition numbers on hybrid multicore/gpu systems

A hybrid Hermitian general eigenvalue solver

A parallel tiled solver for dense symmetric indefinite systems on multicore architectures

SOLUTION of linear systems of equations of the form:

MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs

Dense Arithmetic over Finite Fields with CUMODP

Spécialité: Informatique. par. Marc Baboulin. Maître de conférences, Université Paris-Sud Chaire Inria Saclay - Île-de-France

Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU

Tile QR Factorization with Parallel Panel Processing for Multicore Architectures

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem

Lightweight Superscalar Task Execution in Distributed Memory

On the design of parallel linear solvers for large scale problems

Accelerating interior point methods with GPUs for smart grid systems

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem

Structured Orthogonal Inversion of Block p-cyclic Matrices on Multicore with GPU Accelerators

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

Communication avoiding parallel algorithms for dense matrix factorizations

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

Enhancing Performance of Tall-Skinny QR Factorization using FPGAs

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

On the design of parallel linear solvers for large scale problems

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Package magma. February 15, 2013

Accelerating incompressible fluid flow simulations on hybrid CPU/GPU systems

Solving PDEs with CUDA Jonathan Cohen

Level-3 BLAS on a GPU

Implementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint:

Introduction to communication avoiding linear algebra algorithms in high performance computing

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

A High-Performance Parallel Hybrid Method for Large Sparse Linear Systems

Saving Energy in Sparse and Dense Linear Algebra Computations

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

Scientific Computing

Factorization and Inversion of a Million Matrices using GPUs: Challenges and Countermeasures

Architecture-Aware Algorithms and Software for Peta and Exascale Computing

Communication-avoiding parallel and sequential QR factorizations

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

LU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version

Binding Performance and Power of Dense Linear Algebra Operations

Modeling and Tuning Parallel Performance in Dense Linear Algebra

Accelerating Quantum Chromodynamics Calculations with GPUs

Some notes on efficient computing and setting up high performance computing environments

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

Preconditioned Parallel Block Jacobi SVD Algorithm

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015

INITIAL INTEGRATION AND EVALUATION

Direct Self-Consistent Field Computations on GPU Clusters

Introduction to numerical computations on the GPU

Contents. Preface... xi. Introduction...

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

Strassen s Algorithm for Tensor Contraction

Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method

Performance Analysis and Design of a Hessenberg Reduction using Stabilized Blocked Elementary Transformations for New Architectures

Accelerating Band Linear Algebra Operations on GPUs with Application in Model Reduction

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

GPU accelerated Arnoldi solver for small batched matrix

Minimizing Communication in Linear Algebra. James Demmel 15 June

CALU: A Communication Optimal LU Factorization Algorithm

PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Numerical Solution of Differential Riccati Equations on Hybrid CPU-GPU Platforms

Julian Merten. GPU Computing and Alternative Architecture

Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems

CRYPTOGRAPHIC COMPUTING

Numerical Linear Algebra

Program Lecture 2. Numerical Linear Algebra. Gaussian elimination (2) Gaussian elimination. Decompositions, numerical aspects

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

Power-Aware Execution of Sparse and Dense Linear Algebra Libraries

c 2015 Society for Industrial and Applied Mathematics

Roundoff Error. Monday, August 29, 11

Performance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs

Iterative methods, preconditioning, and their application to CMB data analysis. Laura Grigori INRIA Saclay

Tradeoffs between synchronization, communication, and work in parallel linear algebra computations

Table 1. Comparison of QR Factorization (Square: , Tall-Skinny (TS): )

Saving Energy in the LU Factorization with Partial Pivoting on Multi-Core Processors

Multicore Parallelization of Determinant Quantum Monte Carlo Simulations

ACCELERATED LEARNING OF GAUSSIAN PROCESS MODELS

COALA: Communication Optimal Algorithms

Communication-avoiding parallel and sequential QR factorizations

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors

LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version

Practical Combustion Kinetics with CUDA

Avoiding communication in linear algebra. Laura Grigori INRIA Saclay Paris Sud University

Numerical Linear Algebra

Transcription:

Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory) and Stanimire Tomov (University of Tennessee) 4th Workshop INRIA-Illinois 11/23/2010 Marc Baboulin 1/34 Accelerating linear algebra computations

General framework How to speed up numerical simulations? Exploit advances in hardware (e.g multicore, GPUs, FPGAs, Cell), manage to use hardware efficiently for HPC applications Better numerical methods Impact on numerical libraries LAPACK, ScaLAPACK, sparse solvers, iterative solvers...have to be rethought and rewritten Need for fast Dense Linear Algebra (DLA) kernels in scientific calculations Marc Baboulin 2/34 Accelerating linear algebra computations

Outline 1 Taking advantage of new parallel architectures Marc Baboulin 3/34 Accelerating linear algebra computations

Outline 1 Taking advantage of new parallel architectures 2 Marc Baboulin 4/34 Accelerating linear algebra computations

Outline 1 Taking advantage of new parallel architectures 2 3 Marc Baboulin 5/34 Accelerating linear algebra computations

Outline Taking advantage of new parallel architectures 1 Taking advantage of new parallel architectures 2 3 Marc Baboulin 6/34 Accelerating linear algebra computations

Outline Taking advantage of new parallel architectures 1 Taking advantage of new parallel architectures 2 3 Marc Baboulin 7/34 Accelerating linear algebra computations

Hardware to software trends Processor speed improves 59% / year but memory bandwidth only by 23%, latency by 5.5% Marc Baboulin 8/34 Accelerating linear algebra computations

Hardware to software trends Marc Baboulin 9/34 Accelerating linear algebra computations

Motivation for heterogeneity-aware algorithms GPUs evolution: applications far beyond graphics, high bandwidth, programmability (CUDA), memory hierarchy, double precision arithmetic... Architectural trends have moved towards heterogeneous (CPU+GPU) designs Fully exploit the computational power that each of the hybrid components offers Need for linear algebra routines for hybrid systems: there is no self-contained library like LAPACK Marc Baboulin 10/34 Accelerating linear algebra computations

Designing algorithms for Multicore+GPU Represent LAPACK algorithms as a collection of BLAS-based tasks and dependencies among them rely on high performance of (CU)BLAS Abstract us from specificities in programming a GPU Properly schedule the tasks execution over the multicore and the GPU MAGMA: Matrix Algebra on GPU and Multicore Architectures DLA library for heterogeneous/hybrid architectures starting with current Multicore+GPU systems LAPACK-style interface U. Tennessee, U. California Berkeley, INRIA Marc Baboulin 11/34 Accelerating linear algebra computations

Task splitting and scheduling Algorithms as Directed Acyclic Graph (DAG) (small tasks/tiles for multicore) Marc Baboulin 12/34 Accelerating linear algebra computations

Task splitting and scheduling DAGs for hybrid systems (both small and large tasks) Marc Baboulin 13/34 Accelerating linear algebra computations

Principles of hybrid implementation BLAS-level parallelism where the matrix resides on the GPU (BLAS calls replaced by CUBLAS) Offload to the CPU small kernels that are inefficient for the GPU Use asynchronism between CPU and GPU whenever possible More details in [ Dongarra, Tomov, Baboulin, 2010 ] Marc Baboulin 14/34 Accelerating linear algebra computations

Example: Cholesky factorization Marc Baboulin 15/34 Accelerating linear algebra computations

LU factorization in double precision 240 FERMI MAGMA 200 160 FERMI ASM ISTANBUL PLASMA ISTANBUL MKL GTX280 MAGMA GFlop/s 120 80 FERMI Tesla C2050: 448 CUDA cores @ 1.15GHz SP/DP peak is 1030 / 515 GFlop/s ISTANBUL AMD 8 socket 6 core (48 cores) @2.8GHz SP/DP peak is 1075 / 538 GFlop/s 40 0 1024 3072 5184 7040 9088 Matrix Size Marc Baboulin 16/34 Accelerating linear algebra computations

Time breakdown for MAGMA QR (single precision) Intel Xeon (2 x 4 cores @ 2.33 GHz) - GeForce GTX 280 (240 Cores @ 1.30 GHz). Marc Baboulin 17/34 Accelerating linear algebra computations

Outline Taking advantage of new parallel architectures 1 Taking advantage of new parallel architectures 2 3 Marc Baboulin 18/34 Accelerating linear algebra computations

Marc Baboulin 19/34 Accelerating linear algebra computations

Bulk of the computation in 32-bit arithmetic Postprocess the 32-bit solution by refining it into a solution that is 64-bit accurate Can be performed on the GPU Problem must be not ill-conditioned Software details in: M. Baboulin, A. Buttari, J. Dongarra, J. Kurzak, J. Langou, J. Langou, P. Luszczek, S. Tomov, Accelerating scientific computations with mixed precision algorithms. Computer Physics Communications, Vol. 180, No 12, pp. 2526-2533 (2009). Marc Baboulin 20/34 Accelerating linear algebra computations

Example of the Cholesky factorization 1: LL T A (ε s ) O(n 3 ) 2: solve Ly = b (ε s ) O(n 2 ) 3: solve L T x 0 = y (ε s ) O(n 2 ) do k = 1, 2,... 4: r k b Ax k 1 (ε d ) 5: solve Ly = r k (ε s ) 6: solve L T z k = y (ε s ) 7: x k x k 1 + z k (ε d ) check convergence done Marc Baboulin 21/34 Accelerating linear algebra computations

Mixed precision Cholesky factorization Solving Ax = b in DP accuracy, A is SPD (Performance on an Intel Xeon @ 2.33 GHz + NVIDIA GeForce GTX 280) 300 250 Gflop/s 200 150 100 SP Factorization Mixed Prec. Solver CPU Iterations Mixed Prec. Solver GPU Iterations DP Solver 50 0 1 2 3 4 5 6 7 Matrix Size x 1,000 Slide 15 / 13 Marc Baboulin 22/34 Accelerating linear algebra computations

Outline Taking advantage of new parallel architectures 1 Taking advantage of new parallel architectures 2 3 Marc Baboulin 23/34 Accelerating linear algebra computations

Outline Taking advantage of new parallel architectures 1 Taking advantage of new parallel architectures 2 3 Marc Baboulin 24/34 Accelerating linear algebra computations

The issue of pivoting in linear systems General square system Ax = b, solved by Gaussian Elimination (GE) We interchange rows: partial pivoting (PP) stability Factorization PA = LU, where P permutation matrix Partial pivoting implemented in LAPACK, matlab... No floating point operation in pivoting but it involves irregular movement of data Communication overhead due to pivoting: O(n 2 ) comparisons, for some architectures (multicore, GPUs), up to 50% of the global computational time Marc Baboulin 25/34 Accelerating linear algebra computations

Other approaches Communication avoiding algorithms: L. Grigori, J. Demmel, and H. Xiang, Communication avoiding Gaussian elimination Supercomupting 2008 proceedings. J. Demmel, L. Grigori, M. F. Hoemmen, and J. Langou, Communication-optimal parallel and sequential QR and LU factorizations, In review, SISC. Minimize the number of messages exchanged during the panel factorization, stable in practice. GPU algorithms: V. Volkov, J. Demmel, LU, QR, Cholesky factorizations using vector capabilities of GPUs, Lapack Working note 204. Reduce the pivoting overhead from 56% to 1-10% by using innovative data structure. Marc Baboulin 26/34 Accelerating linear algebra computations

Random butterfly transformation (RBT) [ Parker,1995 ] proposed to make the matrix sufficiently random so that, with probability close to 1, pivoting is not needed Precondition A with random matrices: UAV to solve Ax = b, we instead solve (UAV )y = Ub followed by x = Vy Random matrices U and V are chosen among a particular class of matrices ( ) called butterfly matrices which are of P Q the form. where P, Q, R and S are diagonal R S n/2 n/2 matrices. Marc Baboulin 27/34 Accelerating linear algebra computations

Random butterfly transformation (RBT) Method:LU with no pivoting on a preconditioned matrix The preconditioning is cheap (O(n 2 ) operations) We do not pivot (RBT NP) or just within the first few rows of the panel (RBT LP) we have a fully BLAS 3 algorithm RBT may require some steps of iterative refinement in the working precision We take advantage of the GPU for all these calculations (preconditioning, factorization in SP, iterative refinement) More details in [ Baboulin, Dongarra, Tomov, 2008 ] Marc Baboulin 28/34 Accelerating linear algebra computations

Outline Taking advantage of new parallel architectures 1 Taking advantage of new parallel architectures 2 3 Marc Baboulin 29/34 Accelerating linear algebra computations

Accuracy of RBT Marc Baboulin 30/34 Accelerating linear algebra computations

Hybrid RBT LU factorization Load splitting for a hybrid LU factorization (8 cores+gpu) Marc Baboulin 31/34 Accelerating linear algebra computations

Performance (DP) Performance of RBT LU factorization (double precision) Intel Xeon (2 x 4 cores @ 2.33 GHz), GeForce GTX 280 (240 Cores @ 1.30 GHz). Marc Baboulin 32/34 Accelerating linear algebra computations

Hybrid CPU/GPU library available (MAGMA 0.2) with main linear system solvers, including mixed precision iterative refinement More details at http://icl.cs.utk.edu/magma/ Randomization is very promising for accelerating linear algebra computations on multicore and/or GPU architectures Some ongoing work: -Hybrid implementation for SVD and eigensolvers -Apply statistical techniques to estimate condition number for very big problems Starting collaboration with Wen-mei Hwu (UIUC) and student Liwen Chang Marc Baboulin 33/34 Accelerating linear algebra computations

Some references for this talk [1] S. Tomov, J. Dongarra, M. Baboulin, Towards dense linear algebra for hybrid GPU accelerated manycore systems. Parallel Computing, Vol. 36, No 5&6, pp. 232-240 (2010). [2] M. Baboulin, A. Buttari, J. Dongarra, J. Kurzak, J. Langou, P. Luszczek, S. Tomov, Accelerating scientific computations with mixed precision algorithms. Computer Physics Communications, Vol. 180, No 12, pp. 2526-2533 (2009). [3] S. Tomov, J. Dongarra, Accelerating the reduction to upper-hessenberg form through hybrid GPU-based computing. LAPACK Working Note 219 (2009). [4] M. Baboulin, J. Demmel, J. Dongarra, S. Tomov, V. Volkov, Enhancing the performance of dense linear algebra on GPUs. Supercomputing (SC 08), Austin, USA, Nov. 15-21, 2008. [5] M. Baboulin, J. Dongarra, S. Tomov, Some issues in dense linear algebra for multicore and special purpose architectures. Springer LCNS Series, 9th International Workshop on State-of-the-Art in Scientific and Parallel Computing (PARA 08). Marc Baboulin 34/34 Accelerating linear algebra computations