ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU

Size: px

Start display at page:

Download "ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU"

Amberly Brooks
5 years ago
Views:

1 ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU STEVE RENNICH, SR. ENGINEER, NVIDIA DEVELOPER TECHNOLOGY DARKO STOSIC, PHD CANDIDATE, UNIV. FEDERAL DE PERNAMBUCO TIM DAVIS, PROFESSOR, CSE, TEXAS A&M UNIVERSITY

2 SPARSE MATRIX FACTORIZATION ON GPUS Objective: Find methods for GPU acceleration of Sparse Cholesky Factorization Experiment using SuiteSparse 4.4. / CHOLMOD Outline Sparse Cholesky Factorization Previous work / Issues Branches approach

3 DIRECT SPARSE FACTORIZATION Dense block Cholesky Supernodes A 11 A 21 A t 21 A 22 = L 11 0 L 21 I I 0 0 A * 22 L t 11 L t 21 0 I L 11 L t 11 = A 11 POTRF dense Cholesky L 11 L t 21 = At 21 A * 22 = A 22 L 21 Lt 21 TRSM GEMM triangular solve matrix multiplication compressed column Schur complement

4 DIRECT SPARSE FACTORIZATION Apply block Cholesky to supernodes Left-looking supernodal Elimination tree Bulk of work is in assembling supernodes (wide range of descendant sizes)

5 DIRECT SPARSE FACTORIZATION Apply block Cholesky to supernodes Left-looking supernodal Elimination tree 1 2 POTRF Bulk of work is in assembling supernodes (wide range of descendant sizes)

6 DIRECT SPARSE FACTORIZATION Apply block Cholesky to supernodes Left-looking supernodal Elimination tree 1 2 POTRF TRSM Bulk of work is in assembling supernodes (wide range of descendant sizes)

7 DIRECT SPARSE FACTORIZATION Apply block Cholesky to supernodes Left-looking supernodal Elimination tree 1 2 POTRF TRSM GEMM fill fill 7 Bulk of work is in assembling supernodes (wide range of descendant sizes)

8 DIRECT SPARSE FACTORIZATION Apply block Cholesky to supernodes Left-looking supernodal Elimination tree 1 2 POTRF TRSM GEMM POTRF fill fill 7 Bulk of work is in assembling supernodes (wide range of descendant sizes)

9 DIRECT SPARSE FACTORIZATION Apply block Cholesky to supernodes Left-looking supernodal Elimination tree 1 2 POTRF TRSM GEMM POTRF TRSM fill fill 7 Bulk of work is in assembling supernodes (wide range of descendant sizes)

10 DIRECT SPARSE FACTORIZATION Apply block Cholesky to supernodes Left-looking supernodal Elimination tree POTRF TRSM GEMM POTRF TRSM GEMM fill fill 7 Bulk of work is in assembling supernodes (wide range of descendant sizes)

11 DIRECT SPARSE FACTORIZATION Apply block Cholesky to supernodes Left-looking supernodal Elimination tree POTRF TRSM GEMM POTRF TRSM GEMM POTRF fill fill 7 Bulk of work is in assembling supernodes (wide range of descendant sizes)

12 DIRECT SPARSE FACTORIZATION Lots of small math Irregular access patterns Larger matrices -> more dense math Greater connectivity -> more dense math Factors can be large ( > 128 GB )

13 PREVIOUS WORK Just send large BLAS- to GPU WORKS! For large, dense matrices Not so good for: small matrices large matrices with low connectivity (shells / beams in FEA) Find methods for further GPU acceleration of Sparse Factorization

supernode score PREVIOUS WORK Send appropriately-sized BLAS calls to GPU hide PCIe communication Assemble supernodes on GPU Hybrid computing GPU row/column threshold ndrow >= 25 ndcol >= 2 supernodes

14 supernode score PREVIOUS WORK Send appropriately-sized BLAS calls to GPU hide PCIe communication Assemble supernodes on GPU Hybrid computing GPU row/column threshold ndrow >= 25 ndcol >= 2 supernodes CPU decreasing cost to assemble GFlops/s SuiteSparse (CHOLMOD) 4.4. CPU CPU + GPU 1.5x why so low? Florida Sparse Matrix Collec4on why not higher? 2 x Xeon E5-298 v + K40 (max boost, ECC=off) hep://faculty.cse.tamu.edu/davis/suitesparse.html

15 ISSUES PCIe communication Limits which BLAS operations can be accelerated on GPU Small BLAS Low occupancy Launch overhead Most BLAS calls don t get sent to the GPU Seek methods which better accelerate factorization of small / minimally-connected matrices audikw_1.mtx % on CPU

16 PROPOSED SOLUTION Factor branches on GPU Use previous methods for root No use of CPU Eliminates PCIe communication Requires POTRF, TRSM & GEMM on GPU Batch and stream BLAS operations level 2 Within levels Amortizes launch overhead level 1 Streamed to improve occupancy No size restriction Maps well to muti-gpu / hybrid computing level 0 branch 1 branch 2 branch branch 4

BATCHED / STREAMED BLAS Batch all BLAS calls to amortize

occupancy Simply wrap cublas subroutine with batch loop

Host <-> Device Kernel 100 Mflops : 500 Mflops batched: 1.

17 BATCHED / STREAMED BLAS Batch all BLAS calls to amortize kernel launch latency Stream multiple batches to increase occupancy Simply wrap cublas subroutine with batch loop stream DGEMM example, m,n,k=1 data on host data on device Host <-> Device Kernel 100 Mflops : 500 Mflops batched: 1.2 Gflops DGEMM w/ m,n,k=1 -> 40 GF streamed: 4.8 Gflops time

18 BATCHED / STREAMED DGEMM Square DGEMM GPU batched/streamed GPU streamed GPU streamed CPU 4 streams/threads Batched / streamed cublas performance matches MKL for small size Created by wrapping existing, non-batched routines passing lists Gflop/s DGEMM m,n,k 2 x Xeon E5-298 v + K40 (max boost, ECC=off)

19 PLENTY OF PARALLELISM audikw_1.mtx Lower levels Many supernodes Few descendants GEMM Upper levels Few supernodes Many descendants supernodes # of supernodes or GEMM + SYRK ops

20 BRANCHES Matrix # branches # levels # supernodes # root levels # root supernodes Fault_ nd24k inline_ Emilia_ bones ldoor bone Hook_ Geo_ Serena audikw_ Flan_ branch 1 branch 2 branch branch 4

21 CHOLMOD RESULTS CHOLMOD 4.4 CPU CPU + GPU GPU Branches 1.8x average speedup vs. previous CPU+GPU 2x average speedup vs. CPU Poorly performing matrices see the greatest speedup GFlop/s Florida Sparse Matrix Collection 2 x Xeon E5-298 v + K40 (max boost, ECC=off) hep://faculty.cse.tamu.edu/davis/suitesparse.html

22 PCIE DEPENDENCE 4.4. CPU+GPU GPU Branches PCIe gen1 PCIe gen PCIe gen1 PCIe gen PCIe gen -> gen1 12 GB/s -> GB/s 75% loss CPU+GPU 2% loss Branches 17% loss Gflop/s Florida Sparse Matrix Collection 1 x i7 90K + K40 (max boost, ECC=on)

23 SHELL MODEL PERFORMANCE Numerical Factorization rate GF/s socket x 1 core HSW E Ghz. w/ 25 GB + 2xK40 (ECC=ON, full boost) million degrees of freedom 50,082 supernodes 40 branches 114 1,70 supernodes 8-20 levels 49 levels in root branch 7 supernodes 4.4. CPU 4.4. CPU+GPU Branches 1xK40 PCB model courtesy of Dr. Serban Georgescu, Fujitsu Laboratories of Europe Ltd

SHELL MODEL PERFORMANCE Branches algorithm

multi-gpu 1 x K40 4 x K40 compute kernels host <->

24 SHELL MODEL PERFORMANCE Branches algorithm wellsuited for Multi-GPU 4 x K40 Overall 1.5x speedup Branches.1x speedup We ve ported the previous algorithm to multi-gpu 1 x K40 4 x K40 compute kernels host <-> device 50,082 supernodes 40 branches 114 1,70 supernodes 8-20 levels 49 levels in root branch 7 supernodes time

25 SHELL MODEL PERFORMANCE Numerical Factorization rate GF/s socket x 1 core HSW E Ghz. w/ 25 GB + 2xK40 (ECC=ON, full boost) 4x K40 2x K million degrees of freedom 1x K40 assuming 87.5% parallel efficiency 4.4. CPU 4.4. CPU+GPU Branches 1xK40 Branches 2xK40 Branches 2xK40-Proj. Branches 4xK40 Branches 4xK40-Proj. PCB model courtesy of Dr. Serban Georgescu, Fujitsu Laboratories of Europe Ltd

26 CONCLUSIONS Factoring branches on GPU avoids PCIe bottleneck Batching and streaming permits higher performance on small matrices Universally beneficial Aspects apply to other factorization methods Future Improved performance of batched routines Support hybrid computing Complete multi-gpu support

27 RELATED WORK S522 - GPU Acceleration of WSMP (Watson Sparse Matrix Package) Natalia Gimelshein, Anshul Gupta S51 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg S547 - Energy Efficient, High-Performance Solvers through Small Dense Matrix Computations on GPUs Azzam Haidar, Stanimire Tomov S Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement Tim Davis S527 - Jacobi-Davidson Eigensolver in Cusolver Library Lung-Sheng Chien

28 THANK YOU

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric