MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012

Size: px

Start display at page:

Download "MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012"

Bertram Russell
5 years ago
Views:

1 MAGMA Matrix Algebra on GPU and Multicore Architectures Mark Gates February

2 Hardware trends Scale # cores instead of clock speed Hardware issue became software issue Multicore Hybrid 1.E+07 1e7 1.E+06 1e6 1.E+05 1e5 1.E+04 1e4 Transistors (in Thousands) Frequency (MHz) Cores 1.E E E E E Figure from Kathy Yelick, Ten Ways to Waste a Parallel Computer. Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç. 2

3 Future systems Most likely hybrid design Multicore + GPU accelerators Today accelerators attached Future accelerators integrated Intel s MIC Knight s Corner AMD s Fusion Nvidia s Project Denver 3

4 Challenges of GPUs High levels of parallelism Hybrid architecture Small, non-parallelizable tasks on CPU Large, parallel tasks on GPU Compute vs. communication gap growing Tesla C2070 has 515 Gflops, 8 GB/s PCIe Control Cache DRAM ALU ALU CPU PCIe ALU ALU DRAM GPU Figure from NVIDIA CUDA C Programming Guide 4

5 Software generations LINPACK (70 s) vector operations LAPACK (80 s) blocked, cache friendly ScaLAPACK (90 s) distributed memory PLASMA tiled, multicore MAGMA hybrid Level-1 BLAS Level-3 BLAS PBLAS message passing DAG + scheduler hybrid kernels critical path 5

6 Nvidia s CUBLAS Level 1, 2, 3 BLAS kernels cublasdgemv cublasdgemm Copy between CPU GPU cublassetmatrix cublasgetmatrix Stream support cublassetkernelstream, magmablassetkernelstream 6

7 MAGMA 1.1 LAPACK column-wise layout LAPACK-like C and Fortran interfaces CPU and GPU interfaces 7

8 MAGMA hybrid LAPACK algorithms 4 precisions (single, double, single complex, double complex) 3 mixed precision algorithms MAGMA BLAS Supplements CUBLAS Improves some routines 8

9 Naming: magma_zgesv_gpu magma_ or magmablas_ prefix Precision (single, double, single complex, z double complex). Mixed precision (ds and zc) Matrix type general symmetric hermetian positive definite orthogonal unitary triangular Operation sv solve trf triangular factorization ev eigenvalue problem gv generalized eigenvalue problem Interface _gpu suffix 9

10 Driver routines Solve entire problem Interface Matrix type Operation Routine CPU GPU General Solve dgesv dsgesv SPD Solve dposv dsposv Least squares Solve dgeqrs dsgeqrsv General SVD dgesvd Eigenvalues dgeev Symmetric Eigenvalues dsyevd* / zheevd* Generalized eigenvalues dsygvd* / zhegvd* * Additional variants; complete list at 10

11 Computational routines Solve one part of problem Interface Matrix type Operation Routine CPU GPU General LU dgetrf Solve (given LU) dgetrs Inverse dgetri SPD Cholesky dpotrf Solve (given LL T ) dpotrs Inverse dpotri General QR dgeqrf Generate Q dorgqr / zungqr Multiply by Q dormqr / zunmqr Selected routines; complete list at 11

12 One-sided factorization LU, Cholesky, QR factorizations for solving linear systems DAG Level 2 CPU Level 3 GPU Look ahead Trailing matrix A = QA 12

13 One-sided factorization LU, Cholesky, QR factorizations for solving linear systems DAG Level 2 CPU Level 3 GPU Look ahead Trailing matrix A = QA Trailing matrix Look ahead 12

14 One-sided factorization LU, Cholesky, QR factorizations for solving linear systems DAG Level 2 CPU Level 3 GPU Look ahead Trailing matrix A = QA Trailing matrix Look ahead 12

15 One-sided factorization LU, Cholesky, QR factorizations for solving linear systems DAG Level 2 CPU Level 3 GPU Look ahead Trailing matrix A = QA Trailing matrix Look ahead Trailing matrix 12

16 One-sided factorization LU, Cholesky, QR factorizations for solving linear systems DAG Level 2 CPU Level 3 GPU Look ahead Trailing matrix A = QA Trailing matrix Look ahead Trailing matrix 12

17 One-sided factorization LU, Cholesky, QR factorizations for solving linear systems DAG Level 2 CPU Level 3 GPU Look ahead Trailing matrix A = QA Trailing matrix Look ahead Trailing matrix critical path 12

18 13

19 Keeneland system, using one node 3 Nvidia GPUs 1.1 GHz, 5.4 GB) 2 x 6 Intel Cores 2.8 GHz, 23 GB) 14

20 Mixed-precision solvers MAGMA solve Single Prec Double Prec Iter Ref Factor in single precision Iterative refinement yields double precision accuracy GFlop/s Matrix size 15

21 Two-sided factorization Hessenberg, tridiagonal factorizations for eigenvalue problems Level 2 CPU Level 2 GPU y i = Av i Trailing matrix A = Q T AQ Level 3 GPU column a i 16

22 MAGMA Hessenberg in double precision Gflop/s GPUs 2 GPUs 1 GPU (MAGMA 1.1) CPU (MKL) Matrix size Keeneland system, using one node 3 Nvidia GPUs 1.1 GHz, 5.4 GB) 2 x 6 Intel Cores 2.8 GHz, 23 GB) 17

23 Autotuning kernels MAGMA BLAS matrix multiply Parameterize code, e.g., blocksize Test kernels Improved zgemm from 308 to 341 Gflops/s Improved up to 2x on specific rectangular shapes 18

24 Scheduling DAGs 48 cores Matrix is 4000 x 4000, tile is 200 x

25 Dynamic scheduling Parallelism using DAG-based runtime scheduler One-sided factorizations and solvers using StarPU // Sequential Tile Cholesky for k = 1.. ntiles! dpotrf( Akk )! for i = k+1.. ntiles!! dtrsm( Akk, Aik )! for i = k+1.. ntiles!! dsyrk( Aik, Aii )!! for j = i+1.. ntiles!!! dgemm( Ajk, Aik, Aij ) // Hybrid Tile Cholesky for k = 1.. ntiles! Insert_Task( dpotrf,... )! for i = k+1.. ntiles!! Insert_Task( dtrsm,... )! for i = k+1.. ntiles!! Insert_Task( dsyrk,... )!! for j = i+1.. ntiles!!! Insert_Task( dgemm,... ) 20

26 Documentation Nvidia Guides MAGMA PLASMA BLAS and LAPACK index 21

27 Collaborators / Support University of Tennessee, Knoxville University of California, Berkeley University of Colorado, Denver INRIA, France KAUST, Saudi Arabia 22

28 23

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) Schodinger equation: Hψ = Eψ Choose a basis set of wave functions Two cases: Orthonormal