Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Similar documents
Jacobi-Davidson Eigensolver in Cusolver Library. Lung-Sheng Chien, NVIDIA

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU

Computing least squares condition numbers on hybrid multicore/gpu systems

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Level-3 BLAS on a GPU

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11

Direct Self-Consistent Field Computations on GPU Clusters

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012

Solving PDEs with CUDA Jonathan Cohen

Sparse BLAS-3 Reduction

Enhancing Performance of Tall-Skinny QR Factorization using FPGAs

A hybrid Hermitian general eigenvalue solver

The Algorithm of Multiple Relatively Robust Representations for Multi-Core Processors

On the design of parallel linear solvers for large scale problems

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

Parallelization of the Molecular Orbital Program MOS-F

Saving Energy in the LU Factorization with Partial Pivoting on Multi-Core Processors

GPU accelerated Arnoldi solver for small batched matrix

Opportunities for ELPA to Accelerate the Solution of the Bethe-Salpeter Eigenvalue Problem

Multicore Parallelization of Determinant Quantum Monte Carlo Simulations

Improving the performance of applied science numerical simulations: an application to Density Functional Theory

Performance Analysis and Design of a Hessenberg Reduction using Stabilized Blocked Elementary Transformations for New Architectures

Binding Performance and Power of Dense Linear Algebra Operations

MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs

Block Iterative Eigensolvers for Sequences of Dense Correlated Eigenvalue Problems

Performance Evaluation of Some Inverse Iteration Algorithms on PowerXCell T M 8i Processor

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

c 2015 Society for Industrial and Applied Mathematics

A High Throughput FPGA-Based Implementation of the Lanczos Method for the Symmetric Extremal Eigenvalue Problem

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

A Method for Constructing Diagonally Dominant Preconditioners based on Jacobi Rotations

Parallel Singular Value Decomposition. Jiaxing Tan

Numerical Methods I Non-Square and Sparse Linear Systems

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication.

Presentation of XLIFE++

BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

Parallel Eigensolver Performance on High Performance Computers

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark

On the Computational Complexity of the Discrete Pascal Transform

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors

Parallel Transposition of Sparse Data Structures

Restructuring the Symmetric QR Algorithm for Performance. Field Van Zee Gregorio Quintana-Orti Robert van de Geijn

Intel Math Kernel Library (Intel MKL) LAPACK

HPMPC - A new software package with efficient solvers for Model Predictive Control

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Out-of-Core SVD and QR Decompositions

Using Godunov s Two-Sided Sturm Sequences to Accurately Compute Singular Vectors of Bidiagonal Matrices.

APPLIED NUMERICAL LINEAR ALGEBRA

Welcome to MCS 572. content and organization expectations of the course. definition and classification

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION

Practical Combustion Kinetics with CUDA

Minimizing Communication in Linear Algebra

Eigenvalue problems. Eigenvalue problems

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION

A robust multilevel approximate inverse preconditioner for symmetric positive definite matrices

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

On GPU Acceleration of Common Solvers for (Quasi-) Triangular Generalized Lyapunov Equations

Direct methods for symmetric eigenvalue problems

Scientific Computing with Case Studies SIAM Press, Lecture Notes for Unit VII Sparse Matrix

LAPACK-Style Codes for Pivoted Cholesky and QR Updating

Parallel Eigensolver Performance on the HPCx System

6. Iterative Methods for Linear Systems. The stepwise approach to the solution...

A Parallel Bisection and Inverse Iteration Solver for a Subset of Eigenpairs of Symmetric Band Matrices

MARCH 24-27, 2014 SAN JOSE, CA

Imaging using GPU. V-K Veligatla, Kapteyn Institute P. Labropoulos, ASTRON and Kapteyn Institute L. Koopmans, Kapteyn Institute

A Comparison of Parallel Solvers for Diagonally. Dominant and General Narrow-Banded Linear. Systems II.

Preconditioned Parallel Block Jacobi SVD Algorithm

Tile QR Factorization with Parallel Panel Processing for Multicore Architectures

Computing least squares condition numbers on hybrid multicore/gpu systems

SOLUTION of linear systems of equations of the form:

Parallelization Strategies for Density Matrix Renormalization Group algorithms on Shared-Memory Systems

INITIAL INTEGRATION AND EVALUATION

D. Gimenez, M. T. Camara, P. Montilla. Aptdo Murcia. Spain. ABSTRACT

PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM

Making electronic structure methods scale: Large systems and (massively) parallel computing

ab initio Electronic Structure Calculations

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)

Porting a sphere optimization program from LAPACK to ScaLAPACK

Preconditioning Techniques Analysis for CG Method

Divide and Conquer Symmetric Tridiagonal Eigensolver for Multicore Architectures

Parallel Eigensolver Performance on High Performance Computers 1

Dynamic Scheduling within MAGMA

Solving Ax = b, an overview. Program

Introduction to numerical computations on the GPU

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

Transcription:

Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com

Outline Symmetric eigenvalue solver Experiment Applications Conclusions

Symmetric eigenvalue solver The standard form is QR algorithm is state-of-the-art, most popular, and implemented in LAPACK Jacobi method was introduced in 1846, prior to QR algorithm. The parallelism of Jacobi method makes GPU implementation appealing Naming convention syevd (heevd): QR algorithm syevj (heevj): Jacobi method

QR algorithm (syevd) sytrd : tridiagonalization = stedc : spectrum of tridiagonal = Σ ormtr : form eigenvectors =

Example of QR algorithm sytrd: = 1 2 2 5 3 6 4 7 3 4 6 7 8 9 9 10 = 1 5.39 5.39 22.48 2.768 2.768 0.286 0.18 0.18 0.232 = 1 0.37 0.74 0.56 0.56 0.31 0.77 0.74 0.6 0.3 stedc: Σ = 0.81 0.185 0.558 24.1 = 0.73 0.33 0.24 0.05 0.635 0.24 0.109 0.91 0.56 0.2 0.05 0.97 0.73 0.11 0.4 0

Pros and Cons Step 1: [GPU] sytrd has 60% BLAS2 and 40% BLAS3 Step 3: [GPU] ormtr is a sequence of householder product Step 2: [CPU] stedc performs QR algorithm sequentially - the runtime of stedc is about the same as sytrd - CPU is occupied during eigenvalue solver - The performance depends on CPU as well n = 4096, double precision routine time(sec) sytrd 4.59 stedc 14.55 ormtr 3.65 Find an alternative to replace stedc

Jacobi method (syevj) A series of rotations, each rotation eliminates one off-diagonal = = = = Monotone property > where = ( =, With proper termination condition Σ = lim =

Example of Jacobi method Eliminate (1,2): = 1 2 2 5 3 6 4 7 3 4 6 7 8 9 9 10 = 19.74842 = 0.17 5.83 0.48 6.70 1.02 8.00 0.48 1.02 6.70 8.00 8 9 9 10 = 19.54482 Eliminate (1,3): = 0.14 0.4 0.4 5.83 6.70 0.47 8.00 0.47 6.70 8.00 8.03 9.05 9.05 10 = 19.53325 Monotone property holds,

Eliminate (1,4): = 0.12 0.8 0.8 5.83 0.4 6.70 7.97 0.4 6.70 7.97 8.03 9.03 9.03 10.02 = 19.52188 Eliminate (2,3): = 0.12 0.32 0.32 0.16 0.84 0.23 0.84 0.23 13.7 12 12 10.02 = 17.08458 (1,4) and (2,3) operate on non-overlapped rows and columns

Cyclic Jacobi (1) (2) while (3) for p = 1:n-1 (4) for q = p+1 : n (5) compute (6) (7) (8) end // for q (9) end // for p (10) end // while control accuracy A sweep consists of n(n-1)/2 rotations Quadratic convergence is based on # of sweeps Column rotation of A Row rotation of A Column rotation of V

Parallel Jacobi n/2 pairs of non-overlapped (p, q) which can be done in parallel Eliminate (1,2) and (3,4): = 1 5 3 6 4 7 3 4 6 8 7 10 = 0.17 5.83 0.32 0.35 1.07 10.4 0.32 1.07 0.35 10.4 0.05 18.06 sweep Off(A) 1 1.195820 2 2.849376E-1 3 2.355936E-4 4 1.279163E-15

Block Jacobi Partition A into blocks = Eliminate off-diagonal block (p,q) by Jacobi rotation = Basic block is batched syevj Column and row rotations are done by efficient GEMM Propagate proper termination condition

Comparison Basic routines GPU friendly? Scalable for next generation GPU sytrd stdec ormtr sytrd (yes) stedc (no), cpu with single thread ormtr (yes) sytrd (yes) stedc (no) ormtr (yes) Computational Complexity low high batched syevj GEMM batched syevj (yes) GEMM (yes) batched syevj (yes) GEMM (yes) Good for small matrix No Yes, with batched syevj Approximate eigenvalue No, it computes exact eigenvalues Yes, accuracy is controlled by tol Support for s, d, c, and z Yes Yes Stable algorithm Yes Yes Quadratic convergence Yes Yes

Complexity Analysis QR algorithm is about 2 sweeps of Jacobi method To reach machine zero, Jacobi method needs 7 sweeps for single precision and 15 sweeps for double precision Although complexity of Jacobi method is bigger, its parallelism makes it faster on small matrices. Once the matrix gets bigger, Jacobi method suffers from big complexity

Outline Symmetric eigenvalue solver Experiment Applications Conclusions

Experimental Setup CPU: Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz, dual socket GPU: K40 Comparison against MKL 11.0.4 Ngflops = 2*n^3 / time normalized gflops w.r.t. GEMM K40 Number of processor cores 2880 10 Core clock 754 MHz 3 GHz E5-2690 v2 bandwidth 180 GB/s 12 GB/s SGEMM 2768 Gflops 386 Gflops DGEMM 1221 Gflops 185 Gflops http://ark.intel.com/products/75279/intel-xeon-processor-e5-2690-v2-25m-cache-3_00-ghz https://www.nvidia.com/content/pdf/kepler/tesla-k40-active-board-spec-bd-06949-001_v03.pdf

Performance of syevj Jacobi method is faster than QR algorithm for small matrix, up to size 256 Jacobi method left behind for large matrix due to high complexity n cusolver syevd syevj MKL syevd 32 0.008 0.04 0.004 64 0.03 0.15 0.04 128 0.11 0.46 0.19 256 0.50 1.05 1.30 512 1.42 2.88 5.19 1024 3.00 4.56 15.26 2048 5.56 5.96 32.66 4096 5.93 7.35 43.66 8192 4.85 9.80 48.35 Ngflops, Double precision

Performance of batched syevj Batched syevj relies on shared memory, the dimension is limited by 32 The performance stabilizes when GPU is fully utilized Batched syevj is faster than MKL with 16 threads for s, d and c n=32 Data type MKL, 16 threads MKL Ngflops S 0.73 7.3 D 1.59 1.4 C 3.13 3.0 Z 4.55 0.8 Speedup

Outline Symmetric eigenvalue solver Experiment Applications Conclusions

Application 1: SVD SVD computes singular value and left/right singular vector U/V = Σ LAPACK uses QR algorithm Jacobi method can apply to SVD because monotone property still holds Naming convention - gesvd : QR algorithm - gesvdj : Jacobi method

Performance of gesvdj Jacobi method is faster than QR algorithm for small matrix, up to size 512 For large matrix, Jacobi method is not bad for s and c compared to MKL n cusolver gesvd gesvdj MKL gesvd 32 0.004 0.036 0.089 64 0.014 0.118 0.034 128 0.064 0.395 0.118 256 0.203 1.089 0.768 512 0.628 1.872 1.734 1024 1.688 2.489 5.487 2048 3.944 2.749 13.083 4096 6.178 2.714 18.111 8192 8.242 2.65 11.061 Ngflops, Double precision

Performance of batched gesvdj The matrix size is limited by 32-by-32 The performance stablizes when GPU is fully utilized Batched gesvdj is faster than MKL with 16 threads for s and d Data type MKL, 16 threads MKL Ngflops S 2.044 1.5 D 1.527 1.1 C 6.486 0.8 Z 5.176 0.3 Speedup

Application 2: multigpu syevj syevj runs on four K40 syevj is competitive against MKL for single precision syevj is ½ to of MKL for double precision

Application 3: approximate eigensolver Use case: to know full inaccurate spectrum quickly or cannot afford a large cluster for dense eigensolver Hydrogen Atom Energy:,, Naming convention: syevij ( ij stands for incomplete Jacobi)

Accuracy of syevij The resolution is 16 grid points for each dimension matrix is 4096-by-4096 with 27136 nonzeros There are 5 bound states but syevij reports 0 bound states The error bound (0.01 per eigenvalue in average) 20.36 ev -12.43 ev

Performance of syevij The complexity is still but much faster than dense eigensolver Strong scaling of multigpu is not significant in this case 2 GPU: 1.4x speedup 4 GPU: 1.7x speedup Single precision keeps the same accuracy but 2x faster Double precision: runtime (second) n Matrix size nnz 1 K40 2 K40 4 K40 16 4,096 27,136 0.6 0.6 0.7 32 32,768 223,232 27.2 18.8 16.0 64 262,144 1,810,432 1919.6 1320.7 1016.6

Conclusions Optimal complexity may not be the best for parallel computing Jacobi method is faster than MKL for small matrices, as well as batched operations Jacobi method can be applied to symmetric eigenvalue solver and SVD Jacobi method uses limited CPU resources CUDA 9 will have syevj, batched syevj gesvdj, batched gesvdj multigpu syevj multigpu syevij

Thank you! [1] Gene H. Golub, Charles F. Van Loan, MATRIX COMPUTATIONS, 3 rd edition, Johns Hopkins [2] LAPACK: Symmetric Eigenproblems, http://www.netlib.org/lapack/lug/node48.html, http://www.netlib.org/lapack/lug/node30.html