Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com

Outline Symmetric eigenvalue solver Experiment Applications Conclusions

Symmetric eigenvalue solver The standard form is QR algorithm is state-of-the-art, most popular, and implemented in LAPACK Jacobi method was introduced in 1846, prior to QR algorithm. The parallelism of Jacobi method makes GPU implementation appealing Naming convention syevd (heevd): QR algorithm syevj (heevj): Jacobi method

QR algorithm (syevd) sytrd : tridiagonalization = stedc : spectrum of tridiagonal = Σ ormtr : form eigenvectors =

Example of QR algorithm sytrd: = 1 2 2 5 3 6 4 7 3 4 6 7 8 9 9 10 = 1 5.39 5.39 22.48 2.768 2.768 0.286 0.18 0.18 0.232 = 1 0.37 0.74 0.56 0.56 0.31 0.77 0.74 0.6 0.3 stedc: Σ = 0.81 0.185 0.558 24.1 = 0.73 0.33 0.24 0.05 0.635 0.24 0.109 0.91 0.56 0.2 0.05 0.97 0.73 0.11 0.4 0

Pros and Cons Step 1: [GPU] sytrd has 60% BLAS2 and 40% BLAS3 Step 3: [GPU] ormtr is a sequence of householder product Step 2: [CPU] stedc performs QR algorithm sequentially - the runtime of stedc is about the same as sytrd - CPU is occupied during eigenvalue solver - The performance depends on CPU as well n = 4096, double precision routine time(sec) sytrd 4.59 stedc 14.55 ormtr 3.65 Find an alternative to replace stedc

Jacobi method (syevj) A series of rotations, each rotation eliminates one off-diagonal = = = = Monotone property > where = ( =, With proper termination condition Σ = lim =

Example of Jacobi method Eliminate (1,2): = 1 2 2 5 3 6 4 7 3 4 6 7 8 9 9 10 = 19.74842 = 0.17 5.83 0.48 6.70 1.02 8.00 0.48 1.02 6.70 8.00 8 9 9 10 = 19.54482 Eliminate (1,3): = 0.14 0.4 0.4 5.83 6.70 0.47 8.00 0.47 6.70 8.00 8.03 9.05 9.05 10 = 19.53325 Monotone property holds,

Eliminate (1,4): = 0.12 0.8 0.8 5.83 0.4 6.70 7.97 0.4 6.70 7.97 8.03 9.03 9.03 10.02 = 19.52188 Eliminate (2,3): = 0.12 0.32 0.32 0.16 0.84 0.23 0.84 0.23 13.7 12 12 10.02 = 17.08458 (1,4) and (2,3) operate on non-overlapped rows and columns

Cyclic Jacobi (1) (2) while (3) for p = 1:n-1 (4) for q = p+1 : n (5) compute (6) (7) (8) end // for q (9) end // for p (10) end // while control accuracy A sweep consists of n(n-1)/2 rotations Quadratic convergence is based on # of sweeps Column rotation of A Row rotation of A Column rotation of V

Parallel Jacobi n/2 pairs of non-overlapped (p, q) which can be done in parallel Eliminate (1,2) and (3,4): = 1 5 3 6 4 7 3 4 6 8 7 10 = 0.17 5.83 0.32 0.35 1.07 10.4 0.32 1.07 0.35 10.4 0.05 18.06 sweep Off(A) 1 1.195820 2 2.849376E-1 3 2.355936E-4 4 1.279163E-15

Block Jacobi Partition A into blocks = Eliminate off-diagonal block (p,q) by Jacobi rotation = Basic block is batched syevj Column and row rotations are done by efficient GEMM Propagate proper termination condition

Comparison Basic routines GPU friendly? Scalable for next generation GPU sytrd stdec ormtr sytrd (yes) stedc (no), cpu with single thread ormtr (yes) sytrd (yes) stedc (no) ormtr (yes) Computational Complexity low high batched syevj GEMM batched syevj (yes) GEMM (yes) batched syevj (yes) GEMM (yes) Good for small matrix No Yes, with batched syevj Approximate eigenvalue No, it computes exact eigenvalues Yes, accuracy is controlled by tol Support for s, d, c, and z Yes Yes Stable algorithm Yes Yes Quadratic convergence Yes Yes

Complexity Analysis QR algorithm is about 2 sweeps of Jacobi method To reach machine zero, Jacobi method needs 7 sweeps for single precision and 15 sweeps for double precision Although complexity of Jacobi method is bigger, its parallelism makes it faster on small matrices. Once the matrix gets bigger, Jacobi method suffers from big complexity

Outline Symmetric eigenvalue solver Experiment Applications Conclusions

Experimental Setup CPU: Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz, dual socket GPU: K40 Comparison against MKL 11.0.4 Ngflops = 2*n^3 / time normalized gflops w.r.t. GEMM K40 Number of processor cores 2880 10 Core clock 754 MHz 3 GHz E5-2690 v2 bandwidth 180 GB/s 12 GB/s SGEMM 2768 Gflops 386 Gflops DGEMM 1221 Gflops 185 Gflops http://ark.intel.com/products/75279/intel-xeon-processor-e5-2690-v2-25m-cache-3_00-ghz https://www.nvidia.com/content/pdf/kepler/tesla-k40-active-board-spec-bd-06949-001_v03.pdf

Performance of syevj Jacobi method is faster than QR algorithm for small matrix, up to size 256 Jacobi method left behind for large matrix due to high complexity n cusolver syevd syevj MKL syevd 32 0.008 0.04 0.004 64 0.03 0.15 0.04 128 0.11 0.46 0.19 256 0.50 1.05 1.30 512 1.42 2.88 5.19 1024 3.00 4.56 15.26 2048 5.56 5.96 32.66 4096 5.93 7.35 43.66 8192 4.85 9.80 48.35 Ngflops, Double precision

Performance of batched syevj Batched syevj relies on shared memory, the dimension is limited by 32 The performance stabilizes when GPU is fully utilized Batched syevj is faster than MKL with 16 threads for s, d and c n=32 Data type MKL, 16 threads MKL Ngflops S 0.73 7.3 D 1.59 1.4 C 3.13 3.0 Z 4.55 0.8 Speedup

Outline Symmetric eigenvalue solver Experiment Applications Conclusions

Application 1: SVD SVD computes singular value and left/right singular vector U/V = Σ LAPACK uses QR algorithm Jacobi method can apply to SVD because monotone property still holds Naming convention - gesvd : QR algorithm - gesvdj : Jacobi method

Performance of gesvdj Jacobi method is faster than QR algorithm for small matrix, up to size 512 For large matrix, Jacobi method is not bad for s and c compared to MKL n cusolver gesvd gesvdj MKL gesvd 32 0.004 0.036 0.089 64 0.014 0.118 0.034 128 0.064 0.395 0.118 256 0.203 1.089 0.768 512 0.628 1.872 1.734 1024 1.688 2.489 5.487 2048 3.944 2.749 13.083 4096 6.178 2.714 18.111 8192 8.242 2.65 11.061 Ngflops, Double precision

Performance of batched gesvdj The matrix size is limited by 32-by-32 The performance stablizes when GPU is fully utilized Batched gesvdj is faster than MKL with 16 threads for s and d Data type MKL, 16 threads MKL Ngflops S 2.044 1.5 D 1.527 1.1 C 6.486 0.8 Z 5.176 0.3 Speedup

Application 2: multigpu syevj syevj runs on four K40 syevj is competitive against MKL for single precision syevj is ½ to of MKL for double precision

Application 3: approximate eigensolver Use case: to know full inaccurate spectrum quickly or cannot afford a large cluster for dense eigensolver Hydrogen Atom Energy:,, Naming convention: syevij ( ij stands for incomplete Jacobi)

Accuracy of syevij The resolution is 16 grid points for each dimension matrix is 4096-by-4096 with 27136 nonzeros There are 5 bound states but syevij reports 0 bound states The error bound (0.01 per eigenvalue in average) 20.36 ev -12.43 ev

Performance of syevij The complexity is still but much faster than dense eigensolver Strong scaling of multigpu is not significant in this case 2 GPU: 1.4x speedup 4 GPU: 1.7x speedup Single precision keeps the same accuracy but 2x faster Double precision: runtime (second) n Matrix size nnz 1 K40 2 K40 4 K40 16 4,096 27,136 0.6 0.6 0.7 32 32,768 223,232 27.2 18.8 16.0 64 262,144 1,810,432 1919.6 1320.7 1016.6

Conclusions Optimal complexity may not be the best for parallel computing Jacobi method is faster than MKL for small matrices, as well as batched operations Jacobi method can be applied to symmetric eigenvalue solver and SVD Jacobi method uses limited CPU resources CUDA 9 will have syevj, batched syevj gesvdj, batched gesvdj multigpu syevj multigpu syevij

Thank you! [1] Gene H. Golub, Charles F. Van Loan, MATRIX COMPUTATIONS, 3 rd edition, Johns Hopkins [2] LAPACK: Symmetric Eigenproblems, http://www.netlib.org/lapack/lug/node48.html, http://www.netlib.org/lapack/lug/node30.html