Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Size: px

Start display at page:

Download "Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA"

Dwain Lamb
6 years ago
Views:

1 Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA

2 Outline Symmetric eigenvalue solver Experiment Applications Conclusions

3 Symmetric eigenvalue solver The standard form is QR algorithm is state-of-the-art, most popular, and implemented in LAPACK Jacobi method was introduced in 1846, prior to QR algorithm. The parallelism of Jacobi method makes GPU implementation appealing Naming convention syevd (heevd): QR algorithm syevj (heevj): Jacobi method

4 QR algorithm (syevd) sytrd : tridiagonalization = stedc : spectrum of tridiagonal = Σ ormtr : form eigenvectors =

5 Example of QR algorithm sytrd: = = = stedc: Σ = =

6 Pros and Cons Step 1: [GPU] sytrd has 60% BLAS2 and 40% BLAS3 Step 3: [GPU] ormtr is a sequence of householder product Step 2: [CPU] stedc performs QR algorithm sequentially - the runtime of stedc is about the same as sytrd - CPU is occupied during eigenvalue solver - The performance depends on CPU as well n = 4096, double precision routine time(sec) sytrd 4.59 stedc ormtr 3.65 Find an alternative to replace stedc

7 Jacobi method (syevj) A series of rotations, each rotation eliminates one off-diagonal = = = = Monotone property > where = ( =, With proper termination condition Σ = lim =

8 Example of Jacobi method Eliminate (1,2): = = = = Eliminate (1,3): = = Monotone property holds,

9 Eliminate (1,4): = = Eliminate (2,3): = = (1,4) and (2,3) operate on non-overlapped rows and columns

10 Cyclic Jacobi (1) (2) while (3) for p = 1:n-1 (4) for q = p+1 : n (5) compute (6) (7) (8) end // for q (9) end // for p (10) end // while control accuracy A sweep consists of n(n-1)/2 rotations Quadratic convergence is based on # of sweeps Column rotation of A Row rotation of A Column rotation of V

11 Parallel Jacobi n/2 pairs of non-overlapped (p, q) which can be done in parallel Eliminate (1,2) and (3,4): = = sweep Off(A) E E E-15

12 Block Jacobi Partition A into blocks = Eliminate off-diagonal block (p,q) by Jacobi rotation = Basic block is batched syevj Column and row rotations are done by efficient GEMM Propagate proper termination condition

13 Comparison Basic routines GPU friendly? Scalable for next generation GPU sytrd stdec ormtr sytrd (yes) stedc (no), cpu with single thread ormtr (yes) sytrd (yes) stedc (no) ormtr (yes) Computational Complexity low high batched syevj GEMM batched syevj (yes) GEMM (yes) batched syevj (yes) GEMM (yes) Good for small matrix No Yes, with batched syevj Approximate eigenvalue No, it computes exact eigenvalues Yes, accuracy is controlled by tol Support for s, d, c, and z Yes Yes Stable algorithm Yes Yes Quadratic convergence Yes Yes

14 Complexity Analysis QR algorithm is about 2 sweeps of Jacobi method To reach machine zero, Jacobi method needs 7 sweeps for single precision and 15 sweeps for double precision Although complexity of Jacobi method is bigger, its parallelism makes it faster on small matrices. Once the matrix gets bigger, Jacobi method suffers from big complexity

15 Outline Symmetric eigenvalue solver Experiment Applications Conclusions

16 Experimental Setup CPU: Intel(R) Xeon(R) CPU E GHz, dual socket GPU: K40 Comparison against MKL Ngflops = 2*n^3 / time normalized gflops w.r.t. GEMM K40 Number of processor cores Core clock 754 MHz 3 GHz E v2 bandwidth 180 GB/s 12 GB/s SGEMM 2768 Gflops 386 Gflops DGEMM 1221 Gflops 185 Gflops

Performance of syevj Jacobi method is faster than QR algorithm for small matrix, up to size 256 Jacobi method left behind for large matrix due to high complexity n cusolver syevd syevj MKL syevd 32 0.

17 Performance of syevj Jacobi method is faster than QR algorithm for small matrix, up to size 256 Jacobi method left behind for large matrix due to high complexity n cusolver syevd syevj MKL syevd Ngflops, Double precision

utilized Batched syevj is faster than MKL with 16 threads for s, d and c n=32

18 Performance of batched syevj Batched syevj relies on shared memory, the dimension is limited by 32 The performance stabilizes when GPU is fully utilized Batched syevj is faster than MKL with 16 threads for s, d and c n=32 Data type MKL, 16 threads MKL Ngflops S D C Z Speedup

19 Outline Symmetric eigenvalue solver Experiment Applications Conclusions

20 Application 1: SVD SVD computes singular value and left/right singular vector U/V = Σ LAPACK uses QR algorithm Jacobi method can apply to SVD because monotone property still holds Naming convention - gesvd : QR algorithm - gesvdj : Jacobi method

Performance of gesvdj Jacobi method is faster than QR algorithm for small matrix, up to size 512 For large matrix, Jacobi method is not bad for s and c compared to MKL n cusolver gesvd gesvdj MKL

21 Performance of gesvdj Jacobi method is faster than QR algorithm for small matrix, up to size 512 For large matrix, Jacobi method is not bad for s and c compared to MKL n cusolver gesvd gesvdj MKL gesvd Ngflops, Double precision

22 Performance of batched gesvdj The matrix size is limited by 32-by-32 The performance stablizes when GPU is fully utilized Batched gesvdj is faster than MKL with 16 threads for s and d Data type MKL, 16 threads MKL Ngflops S D C Z Speedup

23 Application 2: multigpu syevj syevj runs on four K40 syevj is competitive against MKL for single precision syevj is ½ to of MKL for double precision

24 Application 3: approximate eigensolver Use case: to know full inaccurate spectrum quickly or cannot afford a large cluster for dense eigensolver Hydrogen Atom Energy:,, Naming convention: syevij ( ij stands for incomplete Jacobi)

25 Accuracy of syevij The resolution is 16 grid points for each dimension matrix is 4096-by-4096 with nonzeros There are 5 bound states but syevij reports 0 bound states The error bound (0.01 per eigenvalue in average) ev ev

26 Performance of syevij The complexity is still but much faster than dense eigensolver Strong scaling of multigpu is not significant in this case 2 GPU: 1.4x speedup 4 GPU: 1.7x speedup Single precision keeps the same accuracy but 2x faster Double precision: runtime (second) n Matrix size nnz 1 K40 2 K40 4 K ,096 27, , , ,144 1,810,

27 Conclusions Optimal complexity may not be the best for parallel computing Jacobi method is faster than MKL for small matrices, as well as batched operations Jacobi method can be applied to symmetric eigenvalue solver and SVD Jacobi method uses limited CPU resources CUDA 9 will have syevj, batched syevj gesvdj, batched gesvdj multigpu syevj multigpu syevij

28 Thank you! [1] Gene H. Golub, Charles F. Van Loan, MATRIX COMPUTATIONS, 3 rd edition, Johns Hopkins [2] LAPACK: Symmetric Eigenproblems,

Jacobi-Davidson Eigensolver in Cusolver Library. Lung-Sheng Chien, NVIDIA

Jacobi-Davidson Eigensolver in Cusolver Library Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline CuSolver library - cusolverdn: dense LAPACK - cusolversp: sparse LAPACK - cusolverrf: refactorization