Fine-Grained Parallel Algorithms for Incomplete Factorization Preconditioning

Similar documents
Fine-grained Parallel Incomplete LU Factorization

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION

c 2015 Society for Industrial and Applied Mathematics

9.1 Preconditioned Krylov Subspace Methods

Iterative methods for Linear System of Equations. Joint Advanced Student School (JASS-2009)

Feeding of the Thousands. Leveraging the GPU's Computing Power for Sparse Linear Algebra

Iterative Methods and Multigrid

Lab 1: Iterative Methods for Solving Linear Systems

Preface to the Second Edition. Preface to the First Edition

Lecture 18 Classical Iterative Methods

Iterative Methods. Splitting Methods

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors

Preconditioning Techniques for Large Linear Systems Part III: General-Purpose Algebraic Preconditioners

Preconditioning Techniques Analysis for CG Method

Updating incomplete factorization preconditioners for model order reduction

Numerical Methods I Non-Square and Sparse Linear Systems

The amount of work to construct each new guess from the previous one should be a small multiple of the number of nonzeros in A.

Chapter 7 Iterative Techniques in Matrix Algebra

Iterative methods for Linear System

Lecture 17: Iterative Methods and Sparse Linear Algebra

ITERATIVE METHODS FOR SPARSE LINEAR SYSTEMS

A parameter tuning technique of a weighted Jacobi-type preconditioner and its application to supernova simulations

Boundary Value Problems - Solving 3-D Finite-Difference problems Jacob White

The Conjugate Gradient Method

Iterative Methods for Solving A x = b

Parallel Numerics, WT 2016/ Iterative Methods for Sparse Linear Systems of Equations. page 1 of 1

Solving Sparse Linear Systems: Iterative methods

Solving Sparse Linear Systems: Iterative methods

Robust Preconditioned Conjugate Gradient for the GPU and Parallel Implementations

Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota

The new challenges to Krylov subspace methods Yousef Saad Department of Computer Science and Engineering University of Minnesota

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

Scientific Computing WS 2018/2019. Lecture 9. Jürgen Fuhrmann Lecture 9 Slide 1

Lecture Note 7: Iterative methods for solving linear systems. Xiaoqun Zhang Shanghai Jiao Tong University

Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method

Linear Solvers. Andrew Hazel

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

AMS526: Numerical Analysis I (Numerical Linear Algebra)

DEN: Linear algebra numerical view (GEM: Gauss elimination method for reducing a full rank matrix to upper-triangular

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009

9. Iterative Methods for Large Linear Systems

6.4 Krylov Subspaces and Conjugate Gradients

Contents. Preface... xi. Introduction...

Computational Science and Engineering (Int. Master s Program) Preconditioning for Hessian-Free Optimization

Numerical Analysis: Solutions of System of. Linear Equation. Natasha S. Sharma, PhD

Convergence Models and Surprising Results for the Asynchronous Jacobi Method

On the design of parallel linear solvers for large scale problems

6. Iterative Methods for Linear Systems. The stepwise approach to the solution...

Review of matrices. Let m, n IN. A rectangle of numbers written like A =

AMS Mathematics Subject Classification : 65F10,65F50. Key words and phrases: ILUS factorization, preconditioning, Schur complement, 1.

Stabilization and Acceleration of Algebraic Multigrid Method

Solving Linear Systems

Math 471 (Numerical methods) Chapter 3 (second half). System of equations

Math Introduction to Numerical Analysis - Class Notes. Fernando Guevara Vasquez. Version Date: January 17, 2012.

Jacobi-Davidson Eigensolver in Cusolver Library. Lung-Sheng Chien, NVIDIA

Incomplete Cholesky preconditioners that exploit the low-rank property

Scientific Computing

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu

A robust multilevel approximate inverse preconditioner for symmetric positive definite matrices

Multilevel Low-Rank Preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota. Modelling 2014 June 2,2014

Incomplete LU Preconditioning and Error Compensation Strategies for Sparse Matrices

Iterative Solution methods

JACOBI S ITERATION METHOD

Introduction to Scientific Computing

Motivation: Sparse matrices and numerical PDE's

Scientific Computing with Case Studies SIAM Press, Lecture Notes for Unit VII Sparse Matrix

Next topics: Solving systems of linear equations

Preliminary/Qualifying Exam in Numerical Analysis (Math 502a) Spring 2012

SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA

Incomplete-LU and Cholesky Preconditioned Iterative Methods Using CUSPARSE and CUBLAS

Scalable Non-blocking Preconditioned Conjugate Gradient Methods

Today s class. Linear Algebraic Equations LU Decomposition. Numerical Methods, Fall 2011 Lecture 8. Prof. Jinbo Bi CSE, UConn

The Solution of Linear Systems AX = B

Numerical Programming I (for CSE)

Computational Economics and Finance

OUTLINE ffl CFD: elliptic pde's! Ax = b ffl Basic iterative methods ffl Krylov subspace methods ffl Preconditioning techniques: Iterative methods ILU

An Empirical Comparison of Graph Laplacian Solvers

Conjugate gradient method. Descent method. Conjugate search direction. Conjugate Gradient Algorithm (294)

Jae Heon Yun and Yu Du Han

PowerPoints organized by Dr. Michael R. Gustafson II, Duke University

Jordan Journal of Mathematics and Statistics (JJMS) 5(3), 2012, pp A NEW ITERATIVE METHOD FOR SOLVING LINEAR SYSTEMS OF EQUATIONS

An Efficient Low Memory Implicit DG Algorithm for Time Dependent Problems

Numerical Analysis: Solving Systems of Linear Equations

The flexible incomplete LU preconditioner for large nonsymmetric linear systems. Takatoshi Nakamura Takashi Nodera

Notes for CS542G (Iterative Solvers for Linear Systems)

Lecture # 20 The Preconditioned Conjugate Gradient Method

The solution of the discretized incompressible Navier-Stokes equations with iterative methods

Numerical Methods - Numerical Linear Algebra

Solving linear systems (6 lectures)

An Accelerated Block-Parallel Newton Method via Overlapped Partitioning

LINEAR SYSTEMS (11) Intensive Computation

Towards high performance IRKA on hybrid CPU-GPU systems

From Stationary Methods to Krylov Subspaces

Course Notes: Week 1

Solving Ax = b, an overview. Program

EXASCALE COMPUTING. Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm. P. Ghysels W.

An advanced ILU preconditioner for the incompressible Navier-Stokes equations

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

A DISSERTATION. Extensions of the Conjugate Residual Method. by Tomohiro Sogabe. Presented to

Transcription:

Fine-Grained Parallel Algorithms for Incomplete Factorization Preconditioning Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology, USA SPPEXA Symposium TU München, Leibniz Rechenzentrum 25-27 January 2016

2 Incomplete factorization preconditioning Given sparse A, compute LU A with S = {(i,j) l ij or u ij can be nonzero} Sparse triangular solves Many existing parallel algorithms; all generally use level scheduling.

3 Fine-grained parallel ILU factorization An ILU factorization, A LU, with sparsity pattern S has the property (LU) ij = a ij, (i,j) S. Instead of Gaussian elimination, we compute the unknowns using the constraints l ij, i > j, (i,j) S u ij, i j, (i,j) S min(i,j) l ik u kj = a ij, (i,j) S. k=1 If the diagonal of L is fixed, then there are S unknowns and S constraints.

4 Solving the constraint equations The equation corresponding to (i, j) gives l ij = 1 u jj (a ij u ij = a ij i 1 k=1 j 1 k=1 l ik u kj ), i > j l ik u kj, i j. The equations have the form x = G(x). It is natural to try to solve these equations via a fixed-point iteration with an initial guess x (0). x (k+1) = G(x (k) ) Parallelism: can use one thread per equation for computing x (k+1).

5 Convergence is related to the Jacobian of G(x) 5-point and 7-point centered-difference approximation of the Laplacian Jacobian 1 norm 0.70 0.68 0.66 0.64 0.62 0.60 0.58 0.56 0.54 0.52 2D Laplacian 3D Laplacian 0.50 0 1 2 3 4 5 6 7 8 9 Number of sweeps Values are independent of the matrix size

6 Measuring convergence of the nonlinear iterations Nonlinear residual (A LU) S F = (or some other norm) ILU residual ( (i,j) S a ij A LU F min(i,j) k=1 ) 2 l ik u kj Convergence of the preconditioned linear iterations is known to be strongly related to the ILU residual 1/2

7 Laplacian problem Relative nonlinear residual Frobenius norm 10 0 10 1 10 2 10 3 10 4 2D Laplacian 3D Laplacian 10 5 0 1 2 3 4 5 6 7 8 9 Number of sweeps Relative nonlinear residual norm ILU residual Frobenius norm 2.20 2.00 1.80 1.60 1.40 1.20 1.00 15x15 Laplacian 6x6x6 Laplacian 0.80 0 1 2 3 4 5 6 7 8 9 Number of sweeps ILU residual norm

Asynchronous updates for the fixed point iteration In the general case, x = G(x) x 1 = g 1 (x 1,x 2,x 3,x 4,,x m ) x 2 = g 2 (x 1,x 2,x 3,x 4,,x m ) x 3 = g 3 (x 1,x 2,x 3,x 4,,x m ) x 4 = g 4 (x 1,x 2,x 3,x 4,,x m ). x m = g m (x 1,x 2,x 3,x 4,,x m ) Synchronous update: all updates use components of x at the same iteration Asynchronous update: updates use components of x that are currently available easier to implement than synchronous updates (no extra vector) convergence can be faster if overdecomposition is used: more like Gauss-Seidel than Jacobi (practical case on GPUs and Intel Xeon Phi) 8

Overdecomposition for asynchronous methods Consider the case where there are p (block) equations and p threads (no overdecomposition) Convergence of asynchronous iterative methods is often believed to be worse than that of the synchronous version (the latest information is likely from the previous iteration (like Jacobi), but could be even older) However, if we overdecompose the problem (more tasks than threads), then convergence can be better than that of the synchronous version. Note: on GPUs, we always have to decompose the problem: more thread blocks than multiprocessors (to hide memory latency). Updates tend to use fresher information than when updates are simultaneous Asynchronous iteration becomes more like Gauss-Seidel than like Jacobi Not all updates are happening at the same time (more uniform bandwidth usage, not bursty) 9

10 x = G(x) can be ordered into strictly lower triangular form x 1 = g 1 x 2 = g 2 (x 1 ) x 3 = g 3 (x 1,x 2 ) x 4 = g 4 (x 1,x 2,x 3 ). x m = g m (x 1,,x m 1 )

11 Structure of the Jacobian 5-point matrix on 4x4 grid For all problems, the Jacobian is sparse and its diagonal is all zeros.

12 Update order for triangular matrices with overdecomposition Consider solving with a triangular matrix using asynchronous iterations Order in which variables are updated can affect convergence rate Want to perform updates in top-to-bottom order for lower triangular system On GPUs, don t know how thread blocks are scheduled But, on NVIDIA K40, the ordering appears to be deterministic (on earlier GPUs, ordering appears non-deterministic) (Joint work with Hartwig Anzt)

13 2D FEM Laplacian, n = 203841, RCM ordering, 240 threads on Intel Xeon Phi Level 0 Level 1 Level 2 PCG nonlin ILU PCG nonlin ILU PCG nonlin ILU Sweeps iter resid resid iter resid resid iter resid resid 0 404 1.7e+04 41.1350 404 2.3e+04 41.1350 404 2.3e+04 41.1350 1 318 3.8e+03 32.7491 256 5.7e+03 18.7110 206 7.0e+03 17.3239 2 301 9.7e+02 32.1707 207 8.6e+02 12.4703 158 1.5e+03 6.7618 3 298 1.7e+02 32.1117 193 1.8e+02 12.3845 132 4.8e+02 5.8985 4 297 2.8e+01 32.1524 187 4.6e+01 12.4139 127 1.6e+02 5.8555 5 297 4.4e+00 32.1613 186 1.4e+01 12.4230 126 6.5e+01 5.8706 IC 297 0 32.1629 185 0 12.4272 126 0 5.8894 Very small number of sweeps required

14 Univ. Florida sparse matrices (SPD cases) 240 threads on Intel Xeon Phi Sweeps Nonlin Resid PCG iter af shell3 0 1.58e+05 852.0 1 1.66e+04 798.3 2 2.17e+03 701.0 3 4.67e+02 687.3 IC 0 685.0 thermal2 0 1.13e+05 1876.0 1 2.75e+04 1422.3 2 1.74e+03 1314.7 3 8.03e+01 1308.0 IC 0 1308.0 ecology2 0 5.55e+04 2000+ 1 1.55e+04 1776.3 2 9.46e+02 1711.0 3 5.55e+01 1707.0 IC 0 1706.0 Sweeps Nonlin Resid PCG iter G3 circuit 0 1.06e+05 1048.0 1 4.39e+04 981.0 2 2.17e+03 869.3 3 1.43e+02 871.7 IC 0 871.0 offshore 0 3.23e+04 401.0 1 4.37e+03 349.0 2 2.48e+02 299.0 3 1.46e+01 297.0 IC 0 297.0 parabolic fem 0 5.84e+04 790.0 1 1.61e+04 495.3 2 2.46e+03 426.3 3 2.28e+02 405.7 IC 0 405.0 apache2 0 5.13e+04 1409.0 1 3.66e+04 1281.3 2 1.08e+04 923.3 3 1.47e+03 873.0 IC 0 869.0

15 Timing comparison, ILU(2) on 100 100 grid (5-point stencil) 10 1 Level Scheduled ILU Iterative ILU (3 sweeps) 10 1 Level Scheduled ILU Iterative ILU (3 sweeps) 10 2 10 2 Time (s) Time (s) 10 3 10 3 10 4 1 2 4 8 15 30 60 120 240 Number of threads Intel Xeon Phi 10 4 1 2 4 8 16 20 Number of threads Intel Xeon E5-2680v2, 20 cores

16 Results for NVIDIA Tesla K40c PCG iteration counts for given number of sweeps Timings [ms] IC 0 1 2 3 4 5 IC 5 swps s/up apache2 958 1430 1363 1038 965 960 958 61. 8.8 6.9 ecology2 1705 2014 1765 1719 1708 1707 1706 107. 6.7 16.0 G3 circuit 997 1254 961 968 993 997 997 110. 12.1 9.1 offshore 330 428 556 373 396 357 332 219. 25.1 8.7 parabolic fem 393 763 636 541 494 454 435 131. 6.1 21.6 thermal2 1398 1913 1613 1483 1341 1411 1403 454. 15.7 28.9 IC denotes the exact factorization computed using the NVIDIA cusparse library. (Joint work with Hartwig Anzt)

Sparse triangular solves with ILU factors 17

18 Iterative and approximate triangular solves Trade accuracy for parallelism Approximately solve the triangular system Rx = b x k+1 = (I D 1 R)x k + D 1 b where D is the diagonal part of R. In general, x p(r)b for a polynomial p(r). implementations depend on SpMV iteration matrix G = I D 1 R is strictly triangular and has spectral radius 0 (trivial asymptotic convergence) for fast convergence, want the norm of G to be small R from stable ILU factorizations of physical problems are often close to being diagonally dominant preconditioner is fixed linear operator in non-asynchronous case

19 Related work Above Jacobi method is almost equivalent to approximating the ILU factors with a truncated Neumann series (e.g., van der Vorst 1982) Stationary iterations for solving with ILU factors using x k+1 = x k + Mr k, where M is a sparse approximate inverse of the triangular factors (Bräckle and Huckle 2015)

20 Hook 1498 2000 6000 PCG iterations 1800 1600 1400 1200 1000 800 0 2 4 6 Jacobi sweeps Cost estimate (matvec loads) 5500 5000 4500 4000 3500 0 2 4 6 Jacobi sweeps FEM; equations: 1,498,023; non-zeroes: 60,917,445

IC-PCG with exact and iterative triangular solves on Intel Xeon Phi IC PCG iterations Timing (seconds) Num. level Exact Iterative Exact Iterative sweeps af shell3 1 375 592 79.59 23.05 6 thermal2 0 1860 2540 120.06 48.13 1 ecology2 1 1042 1395 114.58 34.20 4 apache2 0 653 742 24.68 12.98 3 G3 circuit 1 329 627 52.30 32.98 5 offshore 0 341 401 42.70 9.62 5 parabolic fem 0 984 1201 15.74 16.46 1 Table: Results using 60 threads on Intel Xeon Phi. Exact solves used level scheduling.

22 ILU(0)-GMRES(50) with exact and iterative triangular solves on Telsa K40c GPU Exact Jacobi sweeps solve 1 2 3 4 5 6 chipcool0 75 201 108 95 78 86 84 1.2132 0.2777 0.1710 0.1761 0.1561 0.1965 0.2127 stomach 10 22 13 12 11 11 10 0.4789 0.0857 0.0749 0.0952 0.1110 0.1351 0.1443 venkat01 15 49 32 24 20 18 17 0.5021 0.1129 0.0909 0.0855 0.0874 0.0940 0.1034 Table: Iteration counts (first line) and timings [s] (second line). Exact solves used cusparse. RCM ordering was used. Performance will depend on performance of sparse matvec. Results from Hartwig Anzt

23 Geo 1438 1200 6000 PCG iterations 1000 800 600 400 200 0 5 10 15 20 Jacobi sweeps Cost estimate (matvec loads) 5000 4000 3000 2000 1000 0 5 10 15 20 Jacobi sweeps FEM; equations: 1,437,960; non-zeroes: 63,156,690

24 BCSSTK24 solve with L 10 16 Relative residual norm 10 12 10 8 10 4 10 0 10-4 10-8 0 50 100 150 200 Jacobi sweeps for one triangular solve

25 BCSSTK24 solve with L Relative residual norm 10 16 10 12 10 8 10 4 10 0 10-4 10-8 0 50 100 150 200 Jacobi sweeps for one triangular solve Relative residual norm 10 2 block size 6 block size 12 block size 24 10 0 10-2 10-4 0 10 20 30 Jacobi sweeps for one triangular solve Figure: Relative residual norm in triangular solve with the lower triangular incomplete Cholesky factor (level 1), without blocking and with blocking.

26 Block Jacobi for Triangular solves x k+1 = (I D 1 R)x k + D 1 b Blocking strategy Group variables associated with a grid point or element Further group these variables if necessary

27 BCSSTK24 PCG iterations 1500 1000 500 block size 6 block size 12 block size 24 0 0 10 20 30 Jacobi sweeps Cost estimate (matvec loads) 15000 10000 5000 block size 6 block size 12 block size 24 0 0 10 20 30 Jacobi sweeps

Practical use Will Jacobi sweeps work for my triangular factor? Easy to tell if it won t work: try a few iterations and check reduction in residal norm. Example: use 30 iterations and check if relative residual norm goes below 10 2 How many Jacobi sweeps to use? Hard to tell beforehand. No way to accurately measure residal norm in asynchronous iterations (i.e., residual at what iteration?). Number of Jacobi sweeps could be dynamically adapated (restart Krylov method with preconditioner using a different number of sweeps) For an arbitrary problem, could Jacobi work? If so, how many sweeps are needed? (What is a bound?) 28

29 Comprehensive tests on SPD problems Test all SPD matrices in the University of Florida Sparse Matrix collection with nrows 1000 except diagonal matrices: 171 matrices. Among these, find all problems that can be solved with IC(0) or IC(1). IC(0) IC(1) Total 73 86 Num solved using iter trisol 54 (74%) 52 (60%) Num solved using block iter trisol 68 (93%) 70 (81%) Block iterative triangular solves used supervariable amalgamation with block size 12. Caveat: IC may not be the best preconditioner for all these problems

30 Number of Jacobi sweeps for solution in same number of PCG iterations when exact solves are used Iterative triangular solves, IC(0) and IC(1) 6 6 Number of problems 5 4 3 2 1 Number of problems 5 4 3 2 1 0 10 20 30 40 Jacobi sweeps 0 10 20 30 40 Jacobi sweeps Block iterative triangular solves (max block size 12) 10 10 Number of problems 8 6 4 2 Number of problems 8 6 4 2 0 5 10 15 20 25 30 Jacobi sweeps 0 5 10 15 20 25 30 Jacobi sweeps

An Intel Parallel Computing Center grant is gratefully acknowledged. This material is based upon work supported by the U.S. DOE Office of Science, Office of Advanced Scientific Computing Research, Applied Mathematics program under Award Number DE-SC-0012538. 31 Conclusions Fine-grained algorithm for ILU factorization More parallel work, can use lots of threads Dependencies between threads are obeyed asynchronously Does not use level scheduling or reordering of the matrix Each entry of L and U updated iteratively in parallel Do not need to solve the bilinear equations exactly Method can be used to update a factorization given a sligtly changed A Fixed point iteratins can be performed asynchronously (helps tolerate latency and performance irregularities) Solving sparse triangular systems from ILU via iteration Blocking can add robustness by reducing the non-normality of the iteration matrices Must try to reduce DRAM bandwidth of threads during iterations