Parallel Numerics, WT 2016/ Iterative Methods for Sparse Linear Systems of Equations. page 1 of 1

Parallel Numerics, WT 2016/2017 5 Iterative Methods for Sparse Linear Systems of Equations page 1 of 1

Contents 1 Introduction 1.1 Computer Science Aspects 1.2 Numerical Problems 1.3 Graphs 1.4 Loop Manipulations 2 Elementary Linear Algebra Problems 2.1 BLAS: Basic Linear Algebra Subroutines 2.2 Matrix-Vector Operations 2.3 Matrix-Matrix-Product 3 Linear Systems of Equations with Dense Matrices 3.1 Gaussian Elimination 3.2 Parallelization 3.3 QR-Decomposition with Householder matrices 4 Sparse Matrices 4.1 General Properties, Storage 4.2 Sparse Matrices and Graphs 4.3 Reordering 4.4 Gaussian Elimination for Sparse Matrices 5 Iterative Methods for Sparse Linear Systems of Equations 5.1 Stationary Methods 5.2 Nonstationary Methods 5.3 Preconditioning 6 Domain Decomposition 6.1 Overlapping Domain Decomposition 6.2 Non-overlapping Domain Decomposition 6.3 Schur Complements page 2 of 1

Disadvantages of direct methods (in parallel): strongly sequential may lead to dense matrices sparsity pattern changes, additional entries necessary indirect addressing storage computational effort Iterative solver: choose initial guess = starting vector x (0), e.g., x (0) = 0 iteration function x (k+1) := Φ(x (k) ) Applied on solving a linear system: Main part of Φ should be a matrix-vector multiplication Ax (matrix-free!?) Easy to parallelize, no change in the pattern of A. x (k) k x = A 1 b Main problem: Fast convergence! page 3 of 1

5.1. Stationary Methods 5.1.1. Richardson Iteration page 4 of 1

Construct from Ax = b an iteration process: b = Ax = ( A I + I )x = x (I A)x x = b + (I A)x }{{} (artificial) splitting of A = b + Nx page 4 of 1

Construct from Ax = b an iteration process: b = Ax = ( A I + I )x = x (I A)x x = b + (I A)x }{{} (artificial) splitting of A = b + Nx Leads to equation x = Φ(x) with Φ(x) := b + Nx: start: x (0) ; x (k+1) := Φ(x (k) ) = b + Nx (k) = b + (I A)x (k) page 4 of 1

Richardson Iteration (cont.) start: x (0) ; x (k+1) := Φ(x (k) ) = b + Nx (k) = b + (I A)x (k) If x (k) is convergent, x (k) x, then x = Φ( x) = b + N x = b + (I A) x A x = b and therefore it holds x (k) x = x := A 1 b Residual-based formulation: x (k+1) = Φ(x (k) ) = b + (I A)x (k) = b + x (k) Ax (k) = x (k) + (b Ax (k) ) = x (k) + r(x (k) ) }{{} r(x) = residual page 5 of 1

Convergence Analysis via Neumann Series x (k) = b + Nx (k 1) = b + N(b + Nx (k 2) ) = b + Nb + N 2 x (k 2) =... = b + Nb + N 2 b + + N k 1 b + N k x (0) = = ( k 1 k 1 ) j=0 Nj b + N k x (0) = j=0 Nj b + N k x (0) Special case x (0) = 0: x (k) = ( k 1 j=0 Nj ) b x (k) span{b, Nb, N 2 b,..., N k 1 b} = span{b, Ab, A 2 b,..., A k 1 b} which is called the Krylov space to A and b. = K k (A, b) page 6 of 1

Convergence Analysis via Neumann Series (cont.) ( x (k) Nj) b = (I N) 1 b = A 1 b = x j=0 Richardson iteration is convergent for N < 1 or A I. page 7 of 1

Convergence Analysis via Neumann Series (cont.) ( x (k) Nj) b = (I N) 1 b = A 1 b = x j=0 Richardson iteration is convergent for N < 1 or A I. Error analysis for e (k) := x (k) x: e (k+1) = x (k+1) x = Φ(x (k) ) Φ( x) = (b + Nx (k) ) (b + N x) = = N(x (k) x) = Ne (k) e (k) N e (k 1) N 2 e (k 2) N k e (0) N < 1 N k k 0 e (k) k 0 Convergence, if ρ(n) = ρ(i A) < 1, where ρ is spectral radius ρ(n) = λ max = max( λ i ) (λ i is eigenvalue of N) i Eigenvalues of A have to be all in circle around 1 with radius 1. page 7 of 1

Splittings of A Convergence of Richardson only in very special cases! Try to improve the iteration for better convergence! Write A in form A := M N b = Ax = (M N)x = Mx Nx x = M 1 b + M 1 Nx = Φ(x) Φ(x) = M 1 b + M 1 Nx = M 1 b + M 1 (M A)x = = M 1 (b Ax) + x = x + M 1 r(x) N should be such that Ny can be evaluated efficiently. M should be such that M 1 y can be evaluated efficiently. x (k+1) = x (k) + M 1 r (k) Iteration with splitting M N is equivalent to Richardson on M 1 Ax = M 1 b page 8 of 1

Convergence Iteration with splitting A = M N is convergent if ρ(m 1 N) = ρ(i M 1 A) < 1 For fast convergence it should hold M 1 A I M 1 A should be better conditioned than A itself page 9 of 1

Convergence Iteration with splitting A = M N is convergent if ρ(m 1 N) = ρ(i M 1 A) < 1 For fast convergence it should hold M 1 A I M 1 A should be better conditioned than A itself Such a matrix M is called a preconditioner for A. Is used in other iterative methods to accelerate convergence. Condition number: κ(a) = A 1 A, λ max λ min, or σ max σ min page 9 of 1

5.1.2. Jacobi (Diagonal) Splitting Choose A = M N = D (L + U) with D = diag(a) L the lower triangular part of A, and U the upper triangular part. x (k+1) = D 1 b + D 1 (L + U)x (k) = = D 1 b + D 1 (D A)x (k) = x (k) + D 1 r (k) page 10 of 1

5.1.3. Jacobi (Diagonal) Splitting Choose A = M N = D (L + U) with D = diag(a) L the lower triangular part of A, and U the upper triangular part. x (k+1) = D 1 b + D 1 (L + U)x (k) = = D 1 b + D 1 (D A)x (k) = x (k) + D 1 r (k) Convergent for A diag(a) or diagonal dominant matrices: ρ(d 1 N) = ρ(i D 1 A) < 1 page 10 of 1

Jacobi (Diagonal) Splitting (cont.) Iteration process written elementwise: x (k+1) = D 1 (b (A D)x (k) ) x (k+1) j a jj x (k+1) j = b j j 1 a j,mx (k) m m=1 = 1 b j a jj n n m=1,m j a j,mx (k) m m=j+1 Damping or relaxation for improving convergence Idea: Iterative method as correction of last iterate in search direction. Introduce step length for this correction step: a j,m x (k) x (k+1) = x (k) + D 1 r (k) x (k+1) = x (k) + ωd 1 r (k) with additional damping parameter ω. Damped Jacobi iteration: x (k+1) damped = (ω + 1 ω)x (k) + ωd 1 r (k) = ωx (k+1) + (1 ω)x (k) m page 11 of 1

Damped Jacobi Iteration x (k+1) = x (k) + ωd 1 r (k) = x (k) + ωd 1 (b Ax (k) ) = is convergent for =... = ωd 1 b + [(1 ω)i + ωd 1 (L + U)]x (k) ρ([(1 ω)i + ωd 1 (L + U)] ) < 1 }{{} ω 0 I Look for optimal ω with best convergence (add. degree of freedom). page 12 of 1

Parallelism in the Jacobi Iteration Jacobi method is easy to parallelize: only Ax and D 1 x. But often too slow convergence! Improvement: block Jacobi iteration page 13 of 1

5.1.4. Gauss-Seidel Iteration Always use newest information available! Jacobi iteration: a jj x (k+1) j j 1 = b j Gauss-Seidel iteration: a jj x (k+1) j m=1 j 1 = b j m=1 a j,m a j,m x (k) m }{{} already computed x (k+1) m }{{} already computed n m=j+1 n m=j+1 a j,m x (k) m a j,m x (k) m page 14 of 1

Gauss-Seidel Iteration (cont.) Compare dependancy graphs for general iterative algorithms. Here: x = f (x) = D 1 (b + (D A)x) = D 1 (b (L + U)x) to splitting A = (D L) U = M N x (k+1) = (D L) 1 b + (D L) 1 Ux (k) = = (D L) 1 b + (D L) 1 (D L A)x (k) = = x (k) + (D L) 1 r (k) Convergence depends on spectral radius ρ(i (D L) 1 A) < 1 page 15 of 1

Parallelism in the Gauss-Seidel Iteration Linear system in D L is easy to solve because D L is lower triangular but strongly sequential! Use red-black ordering or graph colouring for compromise: parallel convergence page 16 of 1

Successive Over Relaxation (SOR) Damping or relaxation: x (k+1) = x (k) +ω(d L) 1 r (k) = ω(d L) 1 b+[(1 ω)+ω(d L) 1 U]x (k) Convergence depends on spectral radius of iteration matrix (1 ω) + ω(d L) 1 U Parallelization of SOR == parallelization of GS page 17 of 1

Stationary Methods (in General) Can always be written in the two normal forms x (k+1) = c + Bx (k) with convergence depending on ρ(b) of iteration matrix and x (k+1) = x (k) + Fr (k) with preconditioner F, B = I FA For x (0) = 0: x (k+1) K k (B, c), which is the Krylov space with respect to matrix B and vector c. Slow convergence (but good smoothing properties! multigrid) page 18 of 1

MATLAB Example clear; n=100;k=10;omega=1;stationary tridiag(.5, 1,.5): Jacobi norm cos(π/n) GS norm cos(π/n) 2 both < 1 convergence, but slow To improve convergence nonstationary methods (or multigrid) page 19 of 1

Chair of Informatics V SCCS Efficient Numerical Algorithms Parallel & HPC High-dimensional numerics (sparse grids) Fast iterative solvers (multi-level methods, preconditioners, eigenvalue solvers) Uncertainty Quantification Space-filling curves Numerical linear algebra Numerical algorithms for image processing HW-aware numerical programming Fields of application in simulation CFD (incl. fluid-structure interaction) Plasma physics Molecular dynamics Quantum chemistry Further info www5.in.tum.de Feel free to come around and ask for thesis topics! page 20 of 1

5.2. Nonstationary Methods 5.2.1. Gradient Method Consider A = A T > 0 (A SPD) Function Φ(x) = 1 2 x T Ax b T x n-dim. paraboloid R n R Gradient Φ(x) = Ax b Position with Φ(x) = 0 is exactly minimum of paraboloid page 21 of 1

5.3. Nonstationary Methods 5.3.1. Gradient Method Consider A = A T > 0 (A SPD) Function Φ(x) = 1 2 x T Ax b T x n-dim. paraboloid R n R Gradient Φ(x) = Ax b Position with Φ(x) = 0 is exactly minimum of paraboloid Instead of solving Ax = b consider min x Φ(x) Local descent direction in y: Φ(x) y is minimum for y = Φ(x) page 21 of 1

Gradient Method (cont.) Optimization: start with x (0) x (k+1) := x (k) + α k d (k) with search direction d (k) and step size α k. In view of previous results the optimal (local) search direction is To define α k : Φ(x (k) ) =: d (k) min g(α) := min(φ(x (k) + α(b Ax (k) ))) α α ( ) 1 = min α 2 (x (k) + αd (k) ) T A(x (k) + αd (k) ) b T (x (k) + αd (k) ) ( 1 = min α 2 α2 d (k) T Ad (k) αd (k)t d (k) + 1 ) 2 x (k)t Ax (k) x (k)t b α k = d (k)t d (k) d (k)t Ad (k) d (k) = Φ(x (k) ) = b Ax (k) page 22 of 1

Gradient Method (cont. 2) x (k+1) = x (k) + b Ax (k) 2 2 (b Ax (k) ) T A(b Ax (k) ) (b Ax (k) ) Method of steepest descent. page 23 of 1

Gradient Method (cont. 2) x (k+1) = x (k) + b Ax (k) 2 2 (b Ax (k) ) T A(b Ax (k) ) (b Ax (k) ) Method of steepest descent. Disadvantage: Distorted contour lines. Slow convergence (zig zag path) Local descent direction is not globally optimal page 23 of 1

Analysis of the Gradient Method Definition A-norm: x A := x T Ax Consider error: x x 2 A = x A 1 b 2 A = (x T b T A 1 )A(x A 1 b) = x T Ax 2b T x + b T A 1 b = 2Φ(x) + b T A 1 b page 24 of 1

Analysis of the Gradient Method Definition A-norm: Consider error: x A := x T Ax x x 2 A = x A 1 b 2 A = (x T b T A 1 )A(x A 1 b) = x T Ax 2b T x + b T A 1 b = 2Φ(x) + b T A 1 b Therefore, minimizing Φ is equivalent to minimizing the error in the A-norm! More detailed analysis reveals: ( x (k+1) x 2 A 1 1 ) x (k) x 2 A κ(a) Therefore, for κ(a) 1 very slow convergence! page 24 of 1

5.3.2. The Conjugate Gradient Method Improving descent direction being globally optimal. x (k+1) := x (k) + α k p (k) with search direction not being negative gradient, but projection of gradient that is A-conjugate to all previous search directions: p (k) Ap (j) p (k) A p (j) for all j < k or or p (k)t Ap (j) = 0 for j < k page 25 of 1

5.3.3. The Conjugate Gradient Method Improving descent direction being globally optimal. x (k+1) := x (k) + α k p (k) with search direction not being negative gradient, but projection of gradient that is A-conjugate to all previous search directions: p (k) Ap (j) p (k) A p (j) for all j < k or or p (k)t Ap (j) = 0 for j < k We choose new search direction as component of last residual that is A-conjugate to all previous search directions. α k again by 1-dim. minimization as before (for chosen p (k) ) page 25 of 1

The Conjugate Gradient Algorithm p (0) = r (0) = b Ax (0) for k = 1, 2,... do α (k) = r (k),r (k) p (k),ap (k) x (k+1) = x (k) α (k) p (k) r (k+1) = r (k) + α (k) Ap (k) if r (k+1) 2 2 ɛ then break β (k) = r (k+1),r (k+1) r (k),r (k) p (k+1) = r (k+1) + β (k) p (k) page 26 of 1

Properties of Conjugate Gradients It holds p (j)t Ap (k) = 0 = r (j)t r (k) for j k and span(p (1),..., p (j) ) = span(r (0),..., r (j 1) ) = = span(r (0), Ar (0),..., A j 1 r (0) ) = K j (A, r (0) ) Especially for x (0) = 0 it holds K j (A, r (0) ) = span(b, Ab,..., A j 1 b) x (k) is best approximate solution to Ax = b in subspace K k (A, r (0) ). For x (0) = 0 : x (k) K k (A, b) Error: x (k) x A = min x K k (A,b) x x A Cheap 1-dim. minimization gives optimal k-dim. solution for free! page 27 of 1

Properties of Conjugate Gradients (cont.) Consequence: After n steps K n (A, b) = R n and therefore x (n) = A 1 b is solution in exact arithmetic. Unfortunately, this is not true in floating point arithmetic. Furthermore, O(n) iteration steps would be too costly: costs: #iterations matrix-vector-product page 28 of 1

Error Estimation (x (0) = 0) e (k) A = x (k) x A = min x K k (A,b) x x A = = min α 0,...,α k 1 = min P (k 1) (x) = min P (k 1) (x) = min P (k 1) (x) = min α j(a j b) x j=0 = A k 1 P (k 1) (A)b x = A P (k 1) (A)A x x = A (P (k 1) (A)A I)( x x (0) ) = A Q (k) (x),q (k) (0)=1 Q (k) (A)e (0) A for polynomial Q (k) (x) of degree k with Q (k) (0) = 1. page 29 of 1

Error Estimation Matrix A has orthonormal basis of eigenvectors u j, j = 1,..., n, eigenvalues λ j It holds Au j = λ j u j, j = 1,..., n, and u T j u k = 0 for j k and 1 for j = k Start error in ONB: e (0) = n ρ ju j j=1 n n e (k) A = min Q (k) (0)=1 Q(k) (A) ρ j u j = min ρ j Q (k) (A)u j = j=1 Q (k) (0)=1 j=1 A A n { = min ρ j Q (k) (λ j )u j min max Q (k) (λ j ) } n ρ j u j Q (k) (0)=1 j=1 Q (k) (0)=1 j=1,...,n = j=1 { A } A e = min max Q (k) (0)=1 j=1,...,n Q(k) (λ j ) (0) A page 30 of 1

Error Estimates By choosing polynomials with Q (k) (0) = 1, we can derive error estimates for the error after the kth step: Choose, e.g. Q (k) (x) := 1 2 x λ max + λ min k page 31 of 1

Error Estimates By choosing polynomials with Q (k) (0) = 1, we can derive error estimates for the error after the kth step: Choose, e.g. Q (k) (x) := 1 2 x λ max + λ min k e (k) A max j=1,...,n Q(k) (λ j ) e (0) A = 1 2λ k max λ max + λ min e (0) A = ( ) k ( ) k λmax λ min κ(a) 1 e (0) A = e (0) A λ max + λ min κ(a) + 1 page 31 of 1

Better Estimates Choose normalized Chebyshev polynomials T n (x) = cos(n arccos(x)) e (k) A 1 T k ( κ(a)+1 κ(a) 1 ) e (0) A 2 ( κ(a) 1 κ(a) + 1 ) k e (0) A page 32 of 1

Better Estimates Choose normalized Chebyshev polynomials T n (x) = cos(n arccos(x)) e (k) A 1 T k ( κ(a)+1 κ(a) 1 ) e (0) A 2 ( κ(a) 1 κ(a) + 1 ) k e (0) A For clustered eigenvalues choose special polynomial, e.g. assume that A has only two eigenvalues λ 1 and λ 2 : Q (2) (x) := (λ 1 x)(λ 2 x) λ 1 λ 2 e (2) A max Q (2) (λ j ) e (0) A = 0 j=1,2 Convergence of CG after two steps! page 32 of 1

Outliers/Cluster Assume the matrix has an eigenvalue λ 1 > 1 and all other eigenvalues are contained in an ɛ-neighborhood of 1: λ λ 1 : λ 1 < ɛ e (2) A λ 1 <ɛ Q (2) (x) := (λ 1 x)(1 x) λ 1 max (λ 1 λ)(1 λ) λ 1 e(0) A (λ 1 1 + ɛ)ɛ = O(ɛ) λ 1 Very good approximation of CG after only two steps! Important: small number of outliers combined with cluster. page 33 of 1

Conjugate Gradients Summary To get fast convergence and reduce the number of iterations: find preconditioner M, such that M 1 Ax = M 1 b with clustered eigenvalues. Conjugate gradients (CG) is always the method of choice for symmetric positive definite A (in general). To improve convergence, include preconditioning (PCG). CG has two important properties: optimal and cheap. page 34 of 1

Parallel Conjugate Gradients Algorithm page 35 of 1

Non-Blocking Collective Operations Use to hide communication! Allows overlap of numerical computations and communcations. In MPI-1/MPI-2 only possible for point-to-point communication: MPI Isend and MPI Irecv. Additional libraries necessary for collective operations! Example: LibNBC (non-blocking collectives) Are included in new MPI-3 standard. page 36 of 1

Example: Pseudocode for Nonblock. Reduct. MPI_Request req; int sbuf1[size], rbuf1[size], buf2[size]; // compute sbuf1 compute(sbuf1, SIZE); // start non-blocking allreduce of subf1 // computation and communication overlap MPI_Iallreduce(sbuf1,rbuf1,SIZE,MPI_INT,MPI_SUM,MPI_COMM, &req); // compute buf2 (independent of buf1) compute(buf2, SIZE); // synchronization MPI_WAIT(&req, &stat); // use data in rbuf1; final computation evaluate(rbuf1, buf2, SIZE); page 37 of 1

Iter. Methods for general (nonsymmetric) A: BiCGSTAB Applicable if little memory at hand and A T not available. Computational costs per iteration similar to BiCG and CGS. Alternative for CGS that avoids irregular convergence patterns of CGS maintaining similar convergence speed. Less loss of accuracy in the updated residual. page 38 of 1

Iter. Methods for general (nonsymmetric) A: GMRES Leads to smallest residual for fixed number of iteration steps, but these steps become increasingly expensive. To limit increasing storage requirments and work per iteration step, restarting is necessary. When to do so depends on A and b; it requires skill and experience. Requires only matrix-vector products with the coeff. matrix. Number of inner products grows linearly with iteration number (up to restart point). Implementation based on Gram-Schmidt inner products independent only one synchronization point. Using modified Gram-Schmidt one synchronization point per inner product. We consider GMRES in the following. page 39 of 1

5.3.4. GMRES Iterative solution method for general A Consider small subspace U m and determine optimal approximate solution for Ax = b in U m. Hence x is of the form x := U m y min x U m Ax b 2 = min y A(U m y) b 2 Can be solved by normal equations: (U T ma T AU m )y = U T ma T b page 40 of 1

5.3.5. GMRES Iterative solution method for general A Consider small subspace U m and determine optimal approximate solution for Ax = b in U m. Hence x is of the form x := U m y min x U m Ax b 2 = min y A(U m y) b 2 Can be solved by normal equations: (U T ma T AU m )y = U T ma T b Try to find sequence of good subspaces U 1 U 2 U 3... such that iteratively we can update the optimal solutions x 1 x 2 x 3... A 1 b using mainly matrix-vector products. page 40 of 1

GMRES: Subspace What subspace for fast convergence and easy computations? page 41 of 1

GMRES: Subspace What subspace for fast convergence and easy computations? U m := K m (A, b) = span(b, Ab,..., A m 1 b) (Krylov space) Problem: b, Ab, A 2 b,... is bad basis for this subspace! So we need a first step to compute a good basis for U m : Start with u 1 := b b 2 and do for j = 2 : m: j 1 j 1 ũ j := Au j 1 (uk T Au j 1)u k = Au j 1 h k,j 1 u k k=1 k=1 u j := ũ j ũ j 2 = ũj h j,j 1 which is the standard orthogonalization method (Arnoldi method) page 41 of 1

Matrix Form of Arnoldi ONB U m = span(b, Ab,..., A m 1 b) = span(u 1, u 2,..., u m ) (ONB) Write this orthogonalization method in matrix form j 1 j Au j 1 = h k,j 1 u k + ũ j = h k,j 1 u k k=1 k=1 page 42 of 1

Matrix Form of Arnoldi ONB U m = span(b, Ab,..., A m 1 b) = span(u 1, u 2,..., u m ) (ONB) Write this orthogonalization method in matrix form j 1 j Au j 1 = h k,j 1 u k + ũ j = h k,j 1 u k k=1 k=1 AU m = A(u 1... u m ) = (u 1... u m+1 ) H m+1,m = U m+1 Hm+1,m h 11 h 1m. h.. 21. H m+1,m =. 0......... hmm 0 0 h m+1,m page 42 of 1

GMRES: Minimization This leads to minimization problem min Ax b = min AU m y b x U m y Um+1 = min Hm+1,m y b u 1 y Um+1 = min ( H m+1,m y b e 1 ) y = min Hm+1,m y b e 1 y Because U m+1 is part of an orthogonal matrix. page 43 of 1