INTRODUCTION TO SPARSE MATRICES

Size: px

Start display at page:

Download "INTRODUCTION TO SPARSE MATRICES"

Madison Jones
5 years ago
Views:

1 Outline A tutorial on: Iterative methods for Sparse Linear Systems Yousef Saad University of Minnesota Computer Science and Engineering CIMPA - Tlemcen May Part 1 Part 2 Sparse matrices and sparsity Preconditioned iterations Basic iterative techniques Preconditioning techniques Projection methods Multigrid methods Krylov subspace methods Part 3 Parallel implementations Parallel Preconditioners Software CIMPA - Tlemcen May 28April 26, 28 2 Typical Problem: Physical Model Nonlinear PDEs INTRODUCTION TO SPARSE MATRICES Discretization Linearization (Newton) Sequence of Sparse Linear Systems Ax = b CIMPA - Tlemcen May 28April 26, 28 4

2 Introduction: Linear System Solvers Problem considered: Linear systems Ax = b Can view the problem from somewhat different angles: Discretized problem coming from a PDE An algebraic system of equations [ignore origin] Sometimes a system of equations where A is not explicitly available Direct sparse Solvers Fast Poisson Solvers A x = b u = f + bc Iterative Methods Preconditioned Krylov General Purpose Specialized Multigrid Methods CIMPA - Tlemcen May 28April 26, 28 5 CIMPA - Tlemcen May 28April 26, 28 6 Introduction: Linear System Solvers A few observations Much of recent work on solvers has focussed on: (1) Parallel implementation scalable performance (2) Improving Robustness, developing more general preconditioners Problems are getting harder for Sparse Direct methods (more 3-D models, much bigger problems,..) Problems are also getting difficult for iterative methods Cause: more complex models - away from Poisson Researchers in iterative methods are borrowing techniques from direct methods: preconditioners The inverse is also happening: Direct methods are being adapted for use as preconditioners CIMPA - Tlemcen May 28April 26, 28 7 CIMPA - Tlemcen May 28April 26, 28 8

3 What are sparse matrices? Nonzero patterns of a few sparse matrices Common definition:..matrices that allow special techniques to take advantage of the large number of zero elements and the structure. A few applications of sparse matrices: Structural Engineering, Reservoir simulation, Electrical Networks, optimization problems,... Goals: Much less storage and work than dense computations. Observation: A 1 is usually dense, but L and U in the LU factorization may be reasonably sparse (if a good technique is used). ARC13: Unsymmetric matrix from laser problem. a.r.curtis, oct 1974 SHERMAN5: fully implicit black oil simulator 16 by 23 by 3 grid, 3 unk CIMPA - Tlemcen May 28April 26, 28 9 Two types of matrices: structured (e.g. Sherman5) and unstructured (e.g. BP 1) Main goal of Sparse Matrix Techniques: To perform standard matrix computations economically i.e., without storing the zeros of the matrix. Example: To add two square dense matrices of size n requires O(n 2 ) operations. To add two sparse matrices A and B requires O(nnz(A) + nnz(b)) where nnz(x) = number of nonzero elements of a matrix X. PORES3: Unsymmetric MATRIX FROM PORES BP_1: UNSYMMETRIC BASIS FROM LP PROBLEM BP For typical Finite Element /Finite difference matrices, number of nonzero elements is O(n). CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, 28 12

4 Graph Representations of Sparse Matrices Graph theory is a fundamental tool in sparse matrix techniques. Graph G = (V, E) of an n n matrix A defined by Vertices V = {1, 2,..., N}. Edges E = {(i, j) a ij }. Graph is undirected if matrix has symmetric structure: a ij iff a ji. CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, Example: Adjacency graph of: A =. Example: For any matrix A, what is the graph of A 2? [interpret in terms of paths in the graph of A] Direct versus iterative methods Background. Two types of methods: Direct methods : based on sparse Gaussian eimination, sparse Cholesky,.. Iterative methods: compute a sequence of iterates which converge to the solution - preconditioned Krylov methods.. Remark: These two classes of methods have always been in competition. 4 years ago solving a system with n = 1, was a challenge Now you can solve this in < 1 sec. on a laptop. CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, 28 16

5 Sparse direct methods made huge gains in efficiency. As a result they are very competitive for 2-D problems. 3-D problems lead to more challenging systems [inherent to the underlying graph] Problems with many unknowns per grid point similar to 3-D problems Direct Sparse Matrix Techniques Principle of sparse matrix techniques: Store only the nonzero elements of A. Try to minimize computations and (perhaps more importantly) storage. Difficulty in Gaussian elimination: Fill-in Remarks: No robust black-box iterative solvers. Robustness often conflicts with efficiency However, situation improved in last decade Line between direct and iterative solvers blurring Trivial Example: A = CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, Reorder equations and unknowns in order N, N 1,..., 1 A stays sparse during Gaussian eliminatin i.e., no fill-in A = Finding the best ordering to minimize fill-in is NP-complete. A number of heuristics developed. Among the best known: Minimum degree ordering (Tinney Scheme 2) Nested Dissection Ordering. Approximate Minimal Degree... Reorderings and graphs Let π = {i 1,, i n } a permutation A π, = { a π(i),j = matrix A with its i-th row replaced by row }i,j=1,...,n number π(i). A,π = matrix A with its j-th column replaced by column π(j). Define P π = I π, = Permutation matrix Then: (1) Each row (column) of P π consists of zeros and exactly one 1 (2) A π, = P π A (3) P π P T π = I (4) A,π = AP T π CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, 28 2

6 Consider now: A = A π,π = P π AP T π Example A 9 9 arrow matrix and its adjacency graph. Entry (i, j) in matrix A is exactly entry in position (π(i), π(j)) in A, i.e., (a ij = a π(i),π(j)) (i, j) E A (π(i), π(j)) E A General picture : i j π(i) π(j) CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, Graph and matrix after permuting the nodes in reverse order Cuthill-McKee & reverse Cuthill-McKee A class of reordering techniques proceeds by levels in the graph. Related to Breadth First Search (BFS) traversal in graph theory. Idea of BFS is to visit the nodes by levels. Level = level of starting node. Start with a node, visit its neighbors, then the (unmarked) neighbors of its neighbors, etc... CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, 28 24

7 Example: A B H I * *-----* * / / / / BFS from node A: / Level : A C* * D / Level 1: B, C; \ / Level 2: E, D, H; \ * K Level 3: I, K, E, F, G, H. F \ E * *----* G \ / \ / \ / * H Implementation using levels Algorithm BFS(G, v) by level sets Initialize S = {v}, seen = 1; Mark v; While seen < n Do S new = ; For each node v in S do For each unmarked w in adj(v) do Add w to S new ; Mark w; seen + +; S := S new CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, A few properties of Breadth-First-Search If G is a connected undirected graph then each vertex will be visited once each edge will be inspected at least once Therefore, for a connected undirected graph, The cost of BFS is O( V + E ) Cuthill McKee ordering Algorithm proceeds by levels. Same as BFS except: in each level, nodes are ordered by increasing degree Example B D A G E Distance = level number; For each node v we have: min dist(s, v) = level number(v) = depth T (v) Several reordering algorithms are based on variants of Breadth-First- Search CIMPA - Tlemcen May 28April 26, C F Level Nodes Deg. Order A 2 A 1 B, C 4, 3 C, B 2 D, E, F 3, 4, 2 F, D, E 3 G 2 G CIMPA - Tlemcen May 28April 26, 28 28

8 ALGORITHM : 1 Cuthill Mc Kee ordering. Find an intial node for the traversal 1. Initialize S = {v}, seen = 1, π(seen) = v; Mark v; 2. While seen < n Do 3. S new = ; 4. For each node v, going from lowest to highest degree, Do: 5. π(+ + seen) = v; 6. For each unmarked w in adj(v) do 7. Add w to S new ; 8. Mark w; 9. EndDo Reverse Cuthill McKee ordering The Cuthill - Mc Kee ordering has a tendency to create small arrow matrices (going the wrong way): Origimal matrix CM ordering S := S new 11. EndDo EndWhile CIMPA - Tlemcen May 28April 26, nz = nz = 377 CIMPA - Tlemcen May 28April 26, 28 3 Idea: Take the reverse ordering RCM ordering 1 2 Nested Dissection ordering The idea of divide and conquer recursively divide graph in two using a separator nz = Reverse Cuthill M Kee ordering (RCM). CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, 28 32

9 Nested dissection for a small mesh Second Dissection Third Dissection Original Grid First dissection CIMPA - Tlemcen May 28April 26, Nested dissection: cost for a regular mesh Ordering techniques for direct methods in practice In 2-D consider an n n problem, N = n 2 In 3-D consider an n n n problem, N = n 3 2-D 3-D space (fill) O(N log N) O(N 4/3 ) time (flops) O(N 3/2 ) O(N 2 ) Significant difference in complexity between 2-D and 3-D In practice: Nested dissection (+ variants) is preferred for parallel processing Good implementations of Min. Degree algorithm work well in practice. Currently AMD and AMF are best known implementations/variants/ Best practical reordering algorithms usually combine Nested dissection and min. degree algorithms. CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, 28 36

10 BASIC RELAXATION SCHEMES Relaxation schemes: based on the decomposition A = D E F BASIC RELAXATION METHODS D - E - F D = diag(a), E = strict lower part of A and F its strict upper part. Gauss-Seidel iteration for solving Ax = b: (D E)x (k+1) = Fx (k) + b idea: correct the j-th component of the current approximate solution, j = 1, 2,..n, to zero the j th component of residual. CIMPA - Tlemcen May 28April 26, Can also define a backward Gauss-Seidel Iteration: (D F)x (k+1) = Ex (k) + b and a Symmetric Gauss-Seidel Iteration: forward sweep followed by backward sweep. Over-relaxation is based on the decomposition: ωa = (D ωe) (ωf + (1 ω)d) successive overrelaxation, (SOR): (D ωe)x (k+1) = [ωf + (1 ω)d]x (k) + ωb Iteration matrices Jacobi, Gauss-Seidel, SOR, & SSOR iterations are of the form x (k+1) = Mx (k) + f M Jac = D 1 (E + F) = I D 1 A M GS (A) = (D E) 1 F == I (D E) 1 A M SOR (A) = (D ωe) 1 (ωf +(1 ω)d) = I (ω 1 D E) 1 A M SSOR (A) = I (2ω 1 1)(ω 1 D F) 1 D(ω 1 D E) 1 A = I ω(2ω 1)(D ωf) 1 D(D ωe) 1 A CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, 28 4

11 General convergence result Consider the iteration: x (k+1) = Gx (k) + f (1) Assume that ρ(a) < 1. Then I G is non-singular and G has a fixed point. Iteration converges to a fixed point for any f and x (). (2) If iteration converges for any f and x () then ρ(g) < 1. Example: Richardson s iteration x (k+1) = x (k) + α(b A (k) ) Assume Λ(A) R. When does the iteration converge? Jacobi and Gauss-Seidel converge for diagonal dominant A An observation. Introduction to Preconditioning The iteration x (k+1) = Mx (k) + f is attempting to solve (I M)x = f. Since M is of the form M = I P 1 A this system can be rewritten as P 1 Ax = P 1 b where for SSOR, we have P SSOR = (D ωe)d 1 (D ωf) referred to as the SSOR preconditioning matrix. In other words: Relaxation Scheme Preconditioned Fixed Point Iteration SOR converges for < ω < 2 for SPD matrices CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, The Problem We consider the linear system Ax = b PROJECTION METHODS FOR LINEAR SYSTEMS where A is N N and can be Real symmetric positive definite Real nonsymmetric Complex Focus: A is large and sparse, possibly with an irregular structure CIMPA - Tlemcen May 28April 26, 28 44

12 Projection Methods Initial Problem: b Ax = Given two subspaces K and L of R N define the approximate problem: Find x K such that b A x L With a nonzero initial guess x, the approximate problem is Find x x + K such that b A x L Write x = x + δ and r = b Ax. Leads to a system for δ: Find δ K such that r Aδ L Leads to a small linear system ( projected problems ) This is a basic projection step. Typically: sequence of such steps are applied CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, Matrix representation: V = [v 1,..., v m ] a basis of K & Let W = [w 1,..., w m ] a basis of L Then letting x be the approximate solution x = x + δ x + V y where y is a vector of R m, the Petrov-Galerkin condition yields, W T (r AV y) = and therefore x = x + V [W T AV ] 1 W T r Remark: In practice W T AV is known from algorithm and has a simple structure [tridiagonal, Hessenberg,..] Prototype Projection Method Until Convergence Do: 1. Select a pair of subspaces K, and L; 2. Choose bases V = [v 1,..., v m ] for K and W = [w 1,..., w m ] for L. 3. Compute r b Ax, y (W T AV ) 1 W T r, x x + V y. CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, 28 48

13 Operator Form Representation Approximate problem amounts to solving Q(b Ax) =, x K Let P be the orthogonal projector onto K and Q the (oblique) projector onto K and orthogonally to L. or in operator form Q(b APx) = Px K, x Px K Qx K, x Qx L L Qx Px x The P and Q projectors K Question: what accuracy can one expect? Let x be the exact solution. Then 1) We cannot get better accuracy than (I P)x 2, i.e., x x 2 (I P)x 2 2) The residual of the exact solution for the approximate problem satisfies: b QAPx 2 QA(I P) 2 (I P)x 2 CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, 28 5 Two important particular cases. 1. L = AK. then b A x 2 = min z K b Az 2 class of minimal residual methods: CR, GCR, ORTHOMIN, GM- RES, CGNR, L = K class of Galerkin or orthogonal projection methods. When A is SPD then x x A = min z K x z A. One-dimensional projection processes K = span{d} and L = span{e} Then x x + αd and Petrov-Galerkin condition r Aδ e yields Three popular choices: (I) Steepest descent. α = (r,e) (Ad,e) (II) Residual norm steepest descent. (III) Minimal residual iteration. CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, 28 52

14 (I) Steepest descent. A is SPD. Take at each step d = r and e = r. Iteration: Each step minimizes r b Ax, α (r, r)/(ar, r) x x + αr (II) Residual norm steepest descent. A is arbitrary (nonsingular). Take at each step d = A T r and e = Ad. Iteration: r b Ax, d = A T r α d 2 2 / Ad 2 2 x x + αd Each step minimizes f(x) = b Ax 2 2 in direction f. f(x) = x x 2 A = (A(x x ), (x x )) in direction f. Convergence guaranteed if A is SPD. Important Note: equivalent to usual steepest descent applied to normal equations A T Ax = A T b. Converges under the condition that A is nonsingular. CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, (III) Minimal residual iteration. A positive definite (A+A T is SPD). Take at each step d = r and e = Ar. Krylov Subspace Methods Iteration: r b Ax, α (Ar, r)/(ar, Ar) x x + αr Principle: Projection methods on Krylov subspaces: K m (A, v 1 ) = span{v 1, Av 1,, A m 1 v 1 } Each step minimizes f(x) = b Ax 2 2 in direction r. Converges under the condition that A + A T is SPD. probably the most important class of iterative methods. many variants exist depending on the subspace L. Simple properties of K m. Let µ = deg. of minimal polynomial of v K m = {p(a)v p = polynomial of degree m 1} K m = K µ for all m µ. Moreover, K µ is invariant under A. dim(k m ) = m iff µ m. CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, 28 56

15 A little review: Gram-Schmidt process Goal: given X = [x 1,..., x m ] compute an orthonormal set Q = [q 1,..., q m ] which spans the same susbpace. ALGORITHM : 2 1. For j = 1,..., m Do: Classical Gram-Schmidt 2. Compute r ij = (x j, q i ) for i = 1,..., j 1 3. Compute ˆq j = x j j 1 i=1 r ijq i 4. r jj = ˆq j 2 If r jj == exit ALGORITHM : 3 Modified Gram-Schmidt 1. For j = 1,..., m Do: 2. ˆq j := x j 3. For i = 1,..., j 1 Do 4. r ij = (ˆq j, q i ) 5. ˆq j := ˆq j r ij q i 6. EndDo 7. r jj = ˆq j 2. If r jj == exit 8. q j := ˆq j /r jj 9. EndDo 5. q j = ˆq j /r jj 6. EndDo CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, Let: X = [x 1,..., x m ] (n m matrix) Q = [q 1,..., q m ] (n m matrix) R = {r ij } (m m upper triangular matrix) At each step, Arnoldi s Algorithm Goal: to compute an orthogonal basis of K m. Input: Initial vector v 1, with v 1 2 = 1 and m. For j = 1,..., m do Result: j x j = r ij q i i=1 X = QR Compute w := Av j h i,j := (w, v i ) for i = 1,..., j, do w := w h i,j v i h j+1,j = w 2 and v j+1 = w/h j+1,j CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, 28 6

16 Result of orthogonalization process Arnoldi s Method (L m = K m ) 1. V m = [v 1, v 2,..., v m ] orthonormal basis of K m. 2. AV m = V m+1 H m 3. V T m AV m = H m H m last row. Petrov-Galerkin condition when L m = K m, shows: x m = x + V m H 1 m V T m r Select v 1 = r / r 2 r /β in Arnoldi s algorithm, then: x m = x + βv m H 1 m e 1 V m H m = O Equivalent algorithms: * FOM [YS, 1981] (above formulation) * Young and Jea s ORTHORES [1982]. * Axelsson s projection method [1981]. CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, Minimal residual methods (L m = AK m ) Restarting and Truncating When L m = AK m, we let W m AV m and obtain: x m = x + V m [W T m AV m] 1 W T m r Use again v 1 := r /(β := r 2 ) and: AV m = V m+1 H m x m = x + V m [ H T m H m ] 1 H T m βe 1 = x + V m y m where y m minimizes βe 1 H m y 2 over y R m. Hence, (Generalized Minimal Residual method (GMRES) [Saad-Schultz, 1983]): x m = x + V m y m where y m : min y βe 1 H m y 2 Equivalent methods: Axelsson s CGLS Orthomin (198) Orthodir GCR Difficulty: As m increases, storage and work per step increase fast. First remedy: Restarting. Fix the dimension m of the subspace ALGORITHM : 4 Restarted GMRES (resp. Arnoldi) 1. Start/Restart: Compute r = b Ax, and v 1 = r /(β := r 2 ). 2. Arnoldi Process: generate H m and V m. 3. Compute y m = H 1 m βe 1 (FOM), or 4. x m = x + V m y m y m = argmin βe 1 H m y 2 (GMRES) 5. If r m 2 ǫ r 2 stop else set x := x m and go to 1. CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, 28 64

17 Second remedy: Truncate the orthogonalization The direct version of IOM [DIOM]: The formula for v j+1 is replaced by h j+1,j v j+1 = Av j j i=j k+1 h ij v i each v j is made orthogonal to the previous k v i s. x m still computed as x m = x + V m H 1 m βe 1. It can be shown that this is again an oblique projection process. IOM (Incomplete Orthogonalization Method) = replace orthogonalization in FOM, by the above truncated (or incomplete ) orthogonalization. Writing the LU decomposition of H m as H m = L m U m we get x m = x + V m U 1 m L 1 m βe 1 x + P m z m Structure of L m, U m when k = 3 1 x 1 x 1x L m = 1 x 1 x 1x 1 p m = u 1 mm [v m m 1 i=m k+1 u imp i ] z m = x x x x x x x x x U m = x x x x x x x x x z m 1 ζ m CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, Result: Can update x m at each step: x m = x m 1 + ζ m p m The Symmetric Case: Observation Note: Several existing pairs of methods have a similar link: they are based on the LU, or other, factorizations of the H m matrix CG-like formulation of IOM called DIOM [Saad, 1982] ORTHORES(k) [Young & Jea 82] equivalent to DIOM(k) SYMMLQ [Paige and Saunders, 77] uses LQ factorization of H m. Can add partial pivoting to LU factorization of H m Observe: When A is real symmetric then in Arnoldi s method: H m = Vm TAV m must be symmetric. Therefore THEOREM. When Arnoldi s algorithm is applied to a (real) symmetric matrix then the matrix H m is symmetric tridiagonal. In other words: 1) h ij = for i j > 1 2) h j,j+1 = h j+1,j, j = 1,..., m CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, 28 68

18 We can write α 1 β 2 β 2 α 2 β 3 β H m = 3 α 3 β β m α m The v i s satisfy a three-term recurrence [Lanczos Algorithm]: β j+1 v j+1 = Av j α j v j β j v j 1 simplified version of Arnoldi s algorithm for sym. systems. Symmetric matrix + Arnoldi Symmetric Lanczos (1) The Lanczos algorithm ALGORITHM : 5 Lanczos 1. Choose an initial vector v 1 of norm unity. Set β 1, v 2. For j = 1, 2,..., m Do: 3. w j := Av j β j v j 1 4. α j := (w j, v j ) 5. w j := w j α j v j 6. β j+1 := w j 2. If β j+1 = then Stop 7. v j+1 := w j /β j+1 8. EndDo CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, 28 7 Lanczos algorithm for linear systems Usual orthogonal projection method setting: L m = K m = span{r, Ar,..., A m 1 r } Basis V m = [v 1,..., v m ] of K m generated by the Lanczos algorithm Three different possible implementations. (1) Arnoldi-like; (2) Exploit tridigonal nature of H m (DIOM); (3) Conjugate gradient. ALGORITHM : 6 Lanczos Method for Linear Systems 1. Compute r = b Ax, β := r 2, and v 1 := r /β 2. For j = 1, 2,..., m Do: 3. w j = Av j β j v j 1 (If j = 1 set β 1 v ) 4. α j = (w j, v j ) 5. w j := w j α j v j 6. β j+1 = w j 2. If β j+1 = set m := j and go to 9 7. v j+1 = w j /β j+1 8. EndDo 9. Set T m = tridiag(β i, α i, β i+1 ), and V m = [v 1,..., v m ]. 1. Compute y m = T 1 m (βe 1) and x m = x + V m y m CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, 28 72

19 ALGORITHM : 7 D-Lanczos 1. Compute r = b Ax, ζ 1 := β := r 2, v 1 := r /β 2. Set λ 1 = β 1 =, p = 3. For m = 1, 2,..., until convergence Do: 4. Compute w := Av m β m v m 1 and α m = (w, v m ) 5. If m > 1: Compute λ m = βm η m 1 & ζ m = λ m ζ m 1 6. η m = α m λ m β m 7. p m = η 1 m (v m β m p m 1 ) 8. x m = x m 1 + ζ m p m 9. If x m has converged then Stop The Conjugate Gradient Algorithm (A S.P.D.) Note: the p i s are A-orthogonal The r i s are orthogonal. And we have x m = x m 1 + ξ m p m So there must be an update of the form: 1. p m = r m 1 + β m p m 1 2. x m = x m 1 + ξ m p m 3. r m = r m 1 ξ m Ap m 1. w := w α m v m 11. β m+1 = w 2, v m+1 = w/β m EndDo CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, ALGORITHM : 8 Conjugate Gradient Start: r := b Ax, p := r. Iterate: Until convergence do, α j := (r j, r j )/(Ap j, p j ) x j+1 := x j + α j p j r j+1 := r j α j Ap j β j := (r j+1, r j+1 )/(r j, r j ) METHODS BASED ON LANCZOS BIORTHOGONALIZATION EndDo p j+1 := r j+1 + β j p j r j = scaling v j+1. The r j s are orthogonal. The p j s are A-conjugate, i.e., (Ap i, p j ) = for i j. CIMPA - Tlemcen May 28April 26, 28 75

20 ALGORITHM : 9 Lanczos Bi-Orthogonalization 1. Choose two vectors v 1, w 1 such that (v 1, w 1 ) = Set β 1 = δ 1, w = v 3. For j = 1, 2,..., m Do: 4. α j = (Av j, w j ) 5. ˆv j+1 = Av j α j v j β j v j 1 Extension of the symmetric Lanczos algorithm Builds a pair of biorthogonal bases for the two subspaces K m (A, v 1 ) and K m (A T, w 1 ) Different ways to choose δ j+1, β j+1 in lines 7 and ŵ j+1 = A T w j α j w j δ j w j 1 7. δ j+1 = (ˆv j+1, ŵ j+1 ) 1/2. If δ j+1 = Stop 8. β j+1 = (ˆv j+1, ŵ j+1 )/δ j+1 9. w j+1 = ŵ j+1 /β j+1 1. v j+1 = ˆv j+1 /δ j EndDo Let α 1 β 2 δ 2 α 2 β 3 T m =... δ m 1 α m 1 β m δ m α m. v i K m (A, v 1 ) and w j K m (A T, w 1 ). CIMPA - Tlemcen May 28April 26, If the algorithm does not break down before step m, then the vectors v i, i = 1,..., m, and w j, j = 1,..., m, are biorthogonal, i.e., (v j, w i ) = δ ij 1 i, j m. Moreover, {v i } i=1,2,...,m is a basis of K m (A, v 1 ) and {w i } i=1,2,...,m is a basis of K m (A T, w 1 ) and AV m = V m T m + δ m+1 v m+1 e T m, A T W m = W m T T m + β m+1w m+1 e T m, W T m AV m = T m. The Lanczos Algorithm for Linear Systems ALGORITHM : 1 Lanczos Alg. for Linear Systems 1. Compute r = b Ax and β := r 2 2. Run m steps of the nonsymmetric Lanczos Algorithm i.e., 3. Start with v 1 := r /β, and any w 1 such that (v 1, w 1 ) = 1 4. Generate the pair of Lanczos vectors v 1,..., v m, and w 1,..., w m 5. and the tridiagonal matrix T m from Algorithm Compute y m = T 1 m (βe 1) and x m := x + V m y m. BCG can be derived from the Lanczos Algorithm similarly to CG CIMPA - Tlemcen May 28April 26, 28 8

21 ALGORITHM : 11 BiConjugate Gradient (BCG) 1. Compute r := b Ax. 2. Choose r such that (r, r ) ; Set p := r, p := r 3. For j =, 1,..., until convergence Do:, 4. α j := (r j, rj )/(Ap j, p j ) 5. x j+1 := x j + α j p j 6. r j+1 := r j α j Ap j 7. rj+1 := r j α ja T p j 8. β j := (r j+1, rj+1 )/(r j, rj ) 9. p j+1 := r j+1 + β j p j 1. p j+1 := r j+1 + β jp j 11. EndDo Quasi-Minimal Residual Algorithm Recall relation from the lanczos algorithm: AV m = V m+1 T m with T m T m = (m + 1) m tridiagonal matrix T m =. δ m+1 e T m Let v 1 βr and x = x + V m y. Residual norm b Ax 2 equals ( r AV m y 2 = βv 1 V m+1 T m y 2 = V m+1 βe1 T m y ) 2 Column-vectors of V m+1 are not ( GMRES). But: reasonable idea to minimize the function J(y) βe 1 T m y 2 Quasi-Minimal Residual Algorithm (Freund, 199). CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, Transpose-Free Variants Conjugate Gradient Squared BCG and QMR require a matrix-by-vector product with A and A T at each step. The products with A T do not contribute directly to x m. They allow to determine the scalars (α j and β j in BCG). QUESTION: is it possible to bypass the use of A T? Motivation: in nonlinear equations, A is often not available explicitly but via the Frechet derivative: J(u k )v = F(u k + ǫv) F(u k ) ǫ. * Clever variant of BCG which avoids using A T [Sonneveld, 1984]. In BCG: r i = ρ i (A)r where ρ i = polynomial of degree i. In CGS: r i = ρ 2 i (A)r Define : r j = φ j (A)r, p j = π j (A)r, CIMPA - Tlemcen May 28April 26, 28 83

22 r j = φ j(a T )r, p j = π j(a T )r Solution: Let φ j+1 (t)π j (t), be a third member of the recurrence. For π j (t)φ j (t), note: Scalar α j in BCG is given by α j = (φ j(a)r, φ j (A T )r ) (Aπ j (A)r, π j (A T )r ) = (φ2 j (A)r, r ) (Aπ 2 j (A)r, r ) Possible to get a recursion for the φ 2 j (A)r and π 2 j (A)r? Result: φ j (t)π j (t) = φ j (t) (φ j (t) + β j 1 π j 1 (t)) = φ 2 j (t) + β j 1φ j (t)π j 1 (t). ( ) φ 2 j+1 = φ2 j α jt 2φ 2 j + 2β j 1φ j π j 1 α j t π 2 j φ j+1 (t) = φ j (t) α j tπ j (t), π j+1 (t) = φ j+1 (t) + β j π j (t) Square these equalities φ j+1 π j = φ 2 j + β j 1φ j π j 1 α j t π 2 j π 2 j+1 = φ2 j+1 + 2β jφ j+1 π j + β 2 j π2 j. φ 2 j+1 (t) = φ2 j (t) 2α jtπ j (t)φ j (t) + α 2 j t2 π 2 j (t), π 2 j+1 (t) = φ2 j+1 (t) + 2β jφ j+1 (t)π j (t) + β 2 j π j(t) 2. Problem:..... Cross terms Define: r j = φ 2 j (A)r, p j = π 2 j (A)r, q j = φ j+1 (A)π j (A)r CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, Recurrences become: r j+1 = r j α j A (2r j + 2β j 1 q j 1 α j A p j ), q j = r j + β j 1 q j 1 α j A p j, p j+1 = r j+1 + 2β j q j + β 2 j p j. Define auxiliary vector d j = 2r j + 2β j 1 q j 1 α j Ap j one more auxiliary vector, u j = r j + β j 1 q j 1. So d j = u j + q j, q j = u j α j Ap j, p j+1 = u j+1 + β j (q j + β j p j ), vector d j is no longer needed. Sequence of operations to compute the approximate solution, starting with r := b Ax, p := r, q :=, β :=. 1. α j = (r j, r )/(Ap j, r ) 2. d j = 2r j +2β j 1 q j 1 α j Ap j 3. q j = r j + β j 1 q j 1 α j Ap j 5. r j+1 = r j α j Ad j 6. β j = (r j+1, r )/(r j, r ) 7. p j+1 = r j+1 + β j (2q j + β j p j ). 4. x j+1 = x j + α j d j CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, 28 88

23 ALGORITHM : 12 Conjugate Gradient Squared 1. Compute r := b Ax ; r arbitrary. 2. Set p := u := r. Note: no matrix-by-vector products with A T but two matrix-by-vector products with A, at each step. 3. For j =, 1, 2..., until convergence Do: 4. α j = (r j, r )/(Ap j, r ) 5. q j = u j α j Ap j 6. x j+1 = x j + α j (u j + q j ) 7. r j+1 = r j α j A(u j + q j ) 8. β j = (r j+1, r )/(r j, r ) 9. u j+1 = r j+1 + β j q j 1. p j+1 = u j+1 + β j (q j + β j p j ) 11. EndDo Vector: Polynomial in BCG : q i r i (t) p i 1 (t) u i p 2 i (t) r i r 2 i (t) where r i (t) = residual polynomial at step i for BCG,.i.e., r i = r i (A)r, and p i (t) = conjugate direction polynomial at step i, i.e., p i = p i (A)r. CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, 28 9 BCGSTAB (van der Vorst, 1992) In CGS: residual polynomial of BCG is squared. bad behavior in case of irregular convergence. Bi-Conjugate Gradient Stabilized (BCGSTAB) = a variation of CGS which avoids this difficulty. Derivation similar to CGS. Residuals in BCGSTAB are of the form, r j = ψ j(a)φ j (A)r in which, φ j (t) = BCG residual polynomial, and.... ψ j (t) = a new polynomial defined recursively as ALGORITHM : 13 BCGSTAB 1. Compute r := b Ax ; r arbitrary; 2. p := r. 3. For j =, 1,..., until convergence Do: 4. α j := (r j, r )/(Ap j, r ) 5. s j := r j α j Ap j 6. ω j := (As j, s j )/(As j, As j ) 7. x j+1 := x j + α j p j + ω j s j 8. r j+1 := s j ω j As j 9. β j := (r j+1,r ) (r j,r ) α j ω j 1. p j+1 := r j+1 + β j (p j ω j Ap j ) 11. EndDo ψ j+1 (t) = (1 ω j t)ψ j (t) ω i chosen to smooth convergence [steepest descent step] CIMPA - Tlemcen May 28April 26, 28 92

24 Preconditioning Basic principles PRECONDITIONING Basic idea is to use the Krylov subspace method on a modified system such as M 1 Ax = M 1 b. The matrix M 1 A need not be formed explicitly; only need to solve Mw = v whenever needed. Consequence: fundamental requirement is that it should be easy to compute M 1 v for an arbitrary vector v. CIMPA - Tlemcen May 28April 26, Left, Right, and Split preconditioning Preconditioned CG (PCG) Left preconditioning: M 1 Ax = M 1 b Right preconditioning: AM 1 u = b, with x = M 1 u Split preconditioning: M 1 L AM 1 R u = M 1 L b, with x = M 1 [Assume M is factored: M = M L M R. ] R u Assume: A and M are both SPD. Applying CG directly to M 1 Ax = M 1 b or AM 1 u = b won t work because coefficient matrices are not symmetric. Alternative: when M = LL T use split preconditioner option Second alternative: Observe that M 1 A is self-adjoint wrt M inner product: (M 1 Ax, y) M = (Ax, y) = (x, Ay) = (x, M 1 Ay) M CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, 28 96

25 Preconditioned CG (PCG) Note M 1 A is also self-adjoint with respect to (.,.) A : (M 1 Ax, y) A = (AM 1 Ax, y) = (x, AM 1 Ay) = (x, M 1 Ay) A ALGORITHM : 14 Preconditioned Conjugate Gradient 1. Compute r := b Ax, z = M 1 r, and p := z 2. For j =, 1,..., until convergence Do: 3. α j := (r j, z j )/(Ap j, p j ) 4. x j+1 := x j + α j p j 5. r j+1 := r j α j Ap j 6. z j+1 := M 1 r j+1 7. β j := (r j+1, z j+1 )/(r j, z j ) 8. p j+1 := z j+1 + β j p j 9. EndDo Can obtain a similar algorithm Assume that M = Cholesky product M = LL T. Then, another possibility: Split preconditioning option, which applies CG to the system L 1 AL T u = L 1 b, with x = L T u Notation: Â = L 1 AL T. All quantities related to the preconditioned system are indicated byˆ. CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, ALGORITHM : 15 CG with Split Preconditioner 1. Compute r := b Ax ; ˆr = L 1 r ; and p := L T ˆr. 2. For j =, 1,..., until convergence Do: 3. α j := (ˆr j, ˆr j )/(Ap j, p j ) 4. x j+1 := x j + α j p j 5. ˆr j+1 := ˆr j α j L 1 Ap j 6. β j := (ˆr j+1, ˆr j+1 )/(ˆr j, ˆr j ) 7. p j+1 := L T ˆr j+1 + β j p j 8. EndDo The x j s produced by the above algorithm and PCG are identical (if same initial guess is used). Flexible accelerators Question: What can we do in case M is defined only approximately? i.e., if it can vary from one step to the other.? Applications: Iterative techniques as preconditioners: Block-SOR, SSOR, Multi-grid, etc.. Chaotic relaxation type preconditioners (e.g., in a parallel computing environment) Mixing Preconditioners mixing coarse mesh / fine mesh preconditioners. CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, 28 1

26 ALGORITHM : 16 GMRES No preconditioning ALGORITHM : 17 GMRES (right) Preconditioning 1. Start: Choose x and a dimension m of the Krylov subspaces. 2. Arnoldi process: Compute r = b Ax, β = r 2 and v 1 = r /β. For j = 1,..., m do Compute w := Av j { } hi,j := (w, v for i = 1,..., j, do i ) ; w := w h i,j v i h j+1,1 = w 2 ; v j+1 = w h j+1,1 Define V m := [v 1,..., v m ] and H m = {h i,j }. 3. Form the approximate solution: Compute x m = x + V m y m where y m = argmin y βe 1 H m y 2 and e 1 = [1,,..., ] T. 4. Restart: If satisfied stop, else set x x m and goto Start: Choose x and a dimension m 2. Arnoldi process: Compute r = b Ax, β = r 2 and v 1 = r /β. For j = 1,..., m do Compute z j := M 1 v j Compute w := Az j { } hi,j := (w, v for i = 1,..., j, do : i ) w := w h i,j v i h j+1,1 = w 2 ; v j+1 = w/h j+1,1 Define V m := [v 1,..., v m ] and H m = {h i,j }. 3. Form the approximate solution: x m = x + M 1 V m y m where y m = argmin y βe 1 H m y 2 and e 1 = [1,,..., ] T. 4. Restart: If satisfied stop, else set x x m and goto 2. CIMPA - Tlemcen May 28April 26, ALGORITHM : 18 GMRES variable preconditioner 1. Start: Choose x and a dimension m of the Krylov subspaces. 2. Arnoldi process: Compute r = b Ax, β = r 2 and v 1 = r /β. For j = 1,..., m do Compute z j := M 1 j v j ; Compute w := Az j ; { } hi,j := (w, v for i = 1,..., j, do: i ) ; w := w h i,j v i h j+1,1 = w 2 ; v j+1 = w/h j+1,1 Define Z m := [z 1,..., z m ] and H m = {h i,j }. 3. Form the approximate solution: Compute x m = x + Z m y m where y m = argmin y βe 1 H m y 2 and e 1 = [1,,..., ] T. 4. Restart: If satisfied stop, else set x x m and goto 2. Properties x m minimizes b Ax m over Span{Z m }. If Az j = v j (i.e., if preconditioning is exact at step j) then approximation x j is exact. If M j is constant then method is to Right-Preconditioned GMRES. Additional Costs: Arithmetic: none. Memory: Must save the additional set of vectors {z j } j=1,...m Advantage: Flexibility CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, 28 14

27 Standard preconditioners An observation. Introduction to Preconditioning Simplest preconditioner: M = Diag(A) poor convergence. Next to simplest: SSOR M = (D ωe)d 1 (D ωf) Still simple but often more efficient: ILU(). ILU(p) ILU with level of fill p more complex. Take a look back at basic relaxation methods: Jacobi, Gauss-Seidel, SOR, SSOR,... These are iterations of the form x (k+1) = Mx (k) + f where M is of the form M = I P 1 A. For example for SSOR, P SSOR = (D ωe)d 1 (D ωf) Class of ILU preconditioners with threshold Class of approximate inverse preconditioners Class of Multilevel ILU preconditioners: Multigrid, Algebraic Multigrid, M-level ILU,.. CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, SSOR attempts to solve the equivalent system P 1 Ax = P 1 b where P P SSOR by the fixed point iteration x (k+1) = (I P 1 A) x }{{} +P 1 b instead of x (k+1) = (I A)x (k) +b M In other words: Relaxation Scheme Preconditioned Fixed Point Iteration The SOR/SSOR preconditioner F D E SOR preconditioning M SOR = (D ωe) SSOR preconditioning M SSOR = (D ωe)d 1 (D ωf) M SSOR = LU, L = lower unit matrix, U = upper triangular. One solve with M SSOR same cost as a MAT-VEC. CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, 28 18

28 k-step SOR (resp. SSOR) preconditioning: k steps of SOR (resp. SSOR) Questions: Best ω? For preconditioning can take ω = 1 M = (D E)D 1 (D F) Observe: M = LU + R with R = ED 1 F. Iteration times versus k for SOR(k) preconditioned GMRES Best k? k = 1 is rarely the best. Substantial difference in performance. CIMPA - Tlemcen May 28April 26, ILU() and IC() preconditioners What is the IKJ version of GE? Notation: NZ(X) = {(i, j) X i,j } Different computational patterns for gaussian elimination Formal definition of ILU(): A = LU + R NZ(L) NZ(U) = NZ(A) r ij = for (i, j) NZ(A) This does not define ILU() in a unique way. Constructive definition: Compute the LU factorization of A but drop any fill-in in L and U outside of Struct(A). ILU factorizations are often based on i, k, j version of GE. CIMPA - Tlemcen May 28April 26, KJI,KJI IJK

29 ALGORITHM : 19 Gaussian Elimination IKJ Variant 1. For i = 2,..., n Do: 2. For k = 1,..., i 1 Do: 3. a ik := a ik /a kk 4. For j = k + 1,..., n Do: 5. a ij := a ij a ik a kj 6. EndDo 7. EndDo 8. EndDo IKJ JKI CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, Accessed but not modified Accessed and modified Not accessed ILU() zero-fill ILU ALGORITHM : 2 ILU() For i = 1,..., N Do: For k = 1,..., i 1 and if (i, k) NZ(A) Do: Compute a ik := a ik /a kj For j = k + 1,... and if (i, j) NZ(A), Do: compute a ij := a ij a ik a k,j. EndFor EndFor When A is SPD then the ILU factorization = Incomplete Cholesky factorization IC(). Meijerink and Van der Vorst [1977]. CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26,

30 Typical eigenvalue distribution of preconditioned matrix Pattern of ILU() for 5-point matrix CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, Stencils and ILU factorization Higher order ILU factorization Stencils of A and the L and U parts of A: Higher accuracy incomplete Cholesky: for regularly structured problems, IC(p) allows p additional diagonals in L. Can be generalized to irregular sparse matrices using the notion of level of fill-in [Watts III, 1979] for a ij Initially Lev ij = for a ij == At a given step i of Gaussian elimination: Lev kj = min{lev kj ; Lev ki + Lev ij + 1} CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, 28 12

31 ILU(p) Strategy = drop anything with level of fill-in exceeding p. * Increasing level of fill-in usually results in more accurate ILU and... ILU(1) *...typically in fewer steps and fewer arithmetic operations. CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, ALGORITHM : 21 ILU(p) For i = 2, N Do For each k = 1,..., i 1 and if a ij do Compute a ik := a ik /a jj Compute a i, := a i, a ik a k,. Update the levels of a i, Replace any element in row i with lev(a ij ) > p by zero. EndFor EndFor ILU with threshold generic algorithms ILU(p) factorizations are based on structure only and not numerical values potential problems for non M-matrices. One remedy: ILU with threshold (generic name ILUT.) Two broad approaches: First approach [derived from direct solvers]: use any (direct) sparse solver and incorporate a dropping strategy. [Munksgaard (?), Osterby & Zlatev, Sameh & Zlatev[9], D. Young, & al. (Boeing) etc...] The algorithm can be split into a symbolic and a numerical phase. Level-of-fill in Symbolic phase CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26,

32 Second approach : [derived from iterative solvers viewpoint] 1. use a (row or colum) version of the (i, k, j) version of GE; 2. apply a drop strategy for the elment l ik as it is computed; 3. perform the linear combinations to get a i. Use full row expansion of a i ; 4. apply a drop strategy to fill-ins. ILU with threshold: ILUT(k, ǫ) Do the i, k, j version of Gaussian Elimination (GE). During each i-th step in GE, discard any pivot or fill-in whose value is below ǫ row i (A). Once the i-th row of L + U, (L-part + U-part) is computed retain only the k largest elements in both parts. Advantages: controlled fill-in. Smaller memory overhead. Easy to implement Can be made quite inexpensive. CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, Crout-based ILUT (ILUTC) Terminology: Crout versions of LU compute the k-th row of U and the k-th column of L at the k-th step. Computational pattern Black = part computed at step k Blue = part accessed References: [1] M. Jones and P. Plassman. An improved incomplete Choleski factorization. ACM Transactions on Mathematical Software, 21:5 17, [2] S. C. Eisenstat, M. H. Schultz, and A. H. Sherman. Algorithms and data structures for sparse symmetric Gaussian elimination. SIAM Journal on Scientific Computing, 2: , [3] M. Bollhöfer. A robust ILU with pivoting based on monitoring the growth of the inverse factors. Linear Algebra and its Applications, 338(1 3):21 218, 21. Main advantages: 1. Less expensive than ILUT (avoids sorting) 2. Allows better techniques for dropping [4] N. Li, Y. Saad, and E. Chow. Crout versions of ILU. MSI technical report, 22. CIMPA - Tlemcen May 28April 26,

33 Crout LU (dense case) Go back to delayed update algorithm (IKJ alg.) and observe: we could do both a column and a row version Note: Entries 1 : k 1 in k-th row of figure need not be computed. Available from already computed columns of L. Similar observation for L (right) Left: U computed by rows. Right: L computed by columns CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, ALGORITHM : 22 Crout LU Factorization (dense case) 1. For k = 1 : n Do : 2. For i = 1 : k 1 and if a ki Do : 3. a k,k:n = a k,k:n a ki a i,k:n 4. EndDo 5. For i = 1 : k 1 and if a ik Do : 6. a k+1:n.k = a k+1:n,k a ik a k+1:n,i 7. EndDo 8. a ik = a ik /a kk for i = k + 1,..., n 9. EndDo Comparison with standard techniques Preconditioning Time (sec.) Preconditioning time vs. Lfil for RAEFSKY3 14 ILUC r LLUT c ILUT 12 b ILUT Lfil Precondition time vs. Lfil for ILUC (solid), row-ilut (circles), column- ILUT (triangles) and r-ilut with Binary Search Trees (stars) CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26,

34 Independent set orderings & ILUM (Background) ILUM AND ARMS Independent set orderings permute a matrix into the form B F E C where B is a diagonal matrix. Unknowns associated with the B block form an independent set (IS). IS is maximal if it cannot be augmented by other nodes to form another IS. IS ordering can be viewed as a simplification of multicoloring Main observation: Reduced system obtained by eliminating the unknowns associated with the IS, is still sparse since its coefficient matrix is the Schur complement S = C EB 1 F Idea: apply IS set reduction recursively. When reduced system small enough solve by any method Can devise an ILU factorization based on this strategy. ALGORITHM : 23 ILUM For lev = 1, nlev Do a. Get an independent set for A. b. Form the reduced system associated with this set; c. Apply a dropping strategy to this system; d. Set A := current reduced matrix and go back to (a). EndDo See work by [Botta-Wubbs 96, 97, YS 94, 96, (ILUM), Leuze 89,..] CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26,

35 Group Independent Sets / Aggregates Algebraic Recursive Multilevel Solver (ARMS) Generalizes (common) Independent Sets Original matrix, A, and reordered matrix, A = P T AP. Main goal: to improve robustness Main idea: use independent sets of cliques, or aggregates. There is no coupling between the aggregates No Coupling Reorder equations so nodes of independent sets come first nz = 3155 Block ILU factorization of A l nz = 3155 B l F l L l I U l L 1 l E l C l E l U 1 l I A l+1 I F l CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, Diagonal blocks treated as sparse Problem: Fill-in Remedy: dropping strategy Algebraic Recursive Multilevel Solver (ARMS) Basic step: B F y = f E C z g L U L 1 F y = f EU 1 I S z g where S = C EB 1 F = Schur complement Perform block factorization recursively on S nz = 1225 nz = 4255 Next step: treat the Schur complement recursively L, U Blocks: sparse Exploit recursivity CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, 28 14

36 Factorization: at level l Pl T A l P l = B l F l L l I U l L 1 l E l C l E l U 1 l I A l+1 I L-solve restriction. U-solve prolongation. Solve Last level system with, e.g., ILUT+GMRES F l ALGORITHM : If lev = last lev then 2. Compute A lev L lev U lev 3. Else: ARMS(A lev ) factorization 4 Find an independent set permutation P lev 5. Apply permutation A lev := P T lev A levp 6. Compute factorization 7. Call ARMS(A lev+1 ) 8. EndIf CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, Group Independent Set reordering Original matrix.19e+7 Separator First Block Simple strategy used: Do a Cuthill-MKee ordering until there are enough points to make a block. Reverse ordering. Start a new block from a non-visited node. Continue until all points are visited. Add criterion for rejecting not sufficiently diagonally dominant rows..1e-6 CIMPA - Tlemcen May 28April 26,

37 Block size of 6 Block size of 2.19E+7.19E+7.1E-6.1E-6 Introduction Premise: we now work directly on a Partial Differential Equation e.g. M U L T I G R I D (VERY BRIEF) u = f, +B.C Main idea of multigrid: exploit a hierarchy of grids to get good convergence from simple iterative schemes Need a good grasp of matrices and spectra of model problems CIMPA - Tlemcen May 28April 26,

38 Richardson s iteration Simple iterative scheme: Richardson s iteration for 1-D case Fixed parameter ω. Iteration: u j+1 = u j + ω(b Au j ) = (I ωa)u j + ωb. Iteration matrix is M ω = I ωa. Recall: convergence takes place for < ω < 2/ρ(A) In practice an upper bound ρ(a) γ is often available take ω = 1/γ converging iteration since 1/γ 1/ρ(A) < 2/ρ(A). Eigenvalues of the iteration matrix are 1 ωλ k, where λ k = 2(1 cos θ k ) = 4 sin 2 θ k 2 Eigenvectors are the same as those of A. If u is the exact solution, the error vector d j u u j, obeys the relation, d j = M j ω d Expand the error vector d in the eigenbasis of M ω, as n d = ξ k w k. k=1 CIMPA - Tlemcen May 28April 26, Then from d j = Mω jd and M ω = I ωa: n ( d j = 1 λ ) j k ξ k w k. γ k=1 Each component is reduced by (1 λ k /γ) j. Slowest converging component corresponds to λ 1 Possibly very slow convergence rate when λ 1 /γ λ1. For the model problem one-dimensional case Gershgorin s theorem yields γ = 4, so the corresponding reduction coefficient is 1 sin 2 π 2(n + 1) 1 (πh/2)2 = 1 O(h 2 ). Consequence: convergence can be quite slow for fine meshes, i.e., when h is small. CIMPA - Tlemcen May 28April 26, Basic observation: convergence is not similar for all components. Half of the error components see a very good decrease. This is the case for the high frequency components, that is, all those components corresponding to k > n/2. [referred to as the oscillatory part] The reduction factors for these components are η k = 1 sin 2 kπ 2(n + 1) = kπ cos2 2(n + 1) 1 2. CIMPA - Tlemcen May 28April 26,

39 1 η1 How can we reduce the other components? Introduce a coarse grid problem. Assume n is odd. Consider the 1/2 problem issued from discretizing the original PDE on a mesh Ω 2h with the mesh-size 2h. Use superscripts h and 2h for the two meshes. η n θ 1 θn/2+1 θ n Observation: note that x 2h i = x h 2i from which it follows that, for k n/2, w h k (xh 2i ) = sin(kπxh 2i ) = sin(kπx2h i ) = w 2h k (x2h i ). Reduction coefficients for Richardson s method applied to the 1-D model problem Observations: Oscillatory components, undergo excellent reduction, Also reduction factor is independent of the step-size h. So: Taking a smooth mode on the fine grid (wk h with k n/2) and canonically injecting it into the coarse grid, i.e., defining its values on the coarse points to be the same as those on the fine points, yields the k-th mode on the coarse grid. CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, At some point iteration will fail to make progress on the fine grid: when the only components left are those associated with the smooth modes. Multigrid strategy: do not attempt to eliminate these components on the fine grid. Move down to a coarser grid where smooth modes are translated into oscillatory ones. Then iterate. Practically, need to go back and forth between different grids. The mode w 2 on a fine grid (n = 7) and a coarse grid (n = 3) Some of the modes which were smooth on the fine grid, become oscillatory. The oscillatory modes on the fine mesh are no longer represented on the coarse mesh. CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26,

40 Inter-grid operations: Prolongation Restriction A prolongation operation takes a vector from Ω H and defines the analogue vector in Ω h. Common notation in use I h H : Ω H Ω h. Simplest approach : linear interpolation. Example: in 1-D. Given (vi 2h ) i=,...,(n+1)/2, define vector v h = I2h h v2h Ω h : v2j h = vj 2h n + 1 for j =,...,. v2j+1 h = (v2h j + vj+1 2h )/2 2 Given a function v h on the fine mesh, a corresponding function in Ω H must be defined from v h. Reverse of prolongation Simplest example canonical injection: v 2h i = v h 2i. Termed Injection operator. Obvious 2-D analogue: v 2h i,j = vh 2i,2j. More common restriction operator: full weighting (FW), v 2h j = 1 ( ) v h 2j vh 2j + vh 2j+1. Averages the neighboring values using the weights.25,.5,.25. CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, Standard multigrid techniques Coarse problems and smoothers Simplest idea: obtain an initial guess from interpolating a solution computed on a coarser grid. Repeat recursively. Interpolation from a coarser grid can be followed by a few s of a smoothing iteration. known as nested iteration MG techniques use essentially two main ingredients: a hierarchy of grid problems (restrictions and prolongations) and a smoother. What is a smoother? Ans: any scheme which has the smoothing property of damping quickly the high frequency components of the error. At the highest level (finest grid) A h u h = f h Can define this problem at next level (mesh-size= H) on mesh Ω H Also common (FEM) to define the system by Galerkin projection, A H = I H h A hi h H, fh = I H h fh. Notation: u h ν = smootherν (A h,u h,f h) u h ν == result of ν smoothing s for A hu = f h starting with u h. CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, 28 16

41 V-cycles V-cyle multigrid is a canonical application of recursivity - starting from ALGORITHM : 25 u h = V-cycle(A h, u h, fh ) 1. Pre-smooth: u h := smoother ν1 (A h,u h,fh ) 2. Get residual: r h = f h A h u h 3. Coarsen: r H = Ih Hrh 4. If (H == h ) 5. Solve: A H δ H = r H 6. Else 7. Recursion: δ H = V-cycle(A H,, r H ) 8. EndIf the 2-grid cycle as an example. Notation: H = 2h and h is the coarsest mesh-size. Many other options avaialble For Poisson equation MG is a fast solver cost of order N log N. In fact O(N) for FMG. 9. Correct: u h := u h + I h H δh 1. Post-smooth: u h := smoother ν2 (A h,u h,f h ) 11. Return u h CIMPA - Tlemcen May 28April 26, Algebraic Multigrid (AMG) Generalizes geometric or mesh-based MG to general problems Idea: Define coarsening by looking at strong couplings PARALLEL IMPLEMENTATION Define coarse problem from a Galerkin approach, i.e., using the restriction A H = I H h A hi h H... Generally speaking: limited success for problems with several unknowns per mesh point, of for non-pde relatex problems.. CIMPA - Tlemcen May 28April 26,

42 Introduction Parallel preconditioners: A few approaches Thrust of parallel computing techniques in most applications areas. Programming model: Message-passing seems (MPI) dominates Open MP and threads for small number of processors Important new reality: parallel programming has penetrated the applications areas [Sciences and Engineering + industry] Parallel matrix computation viewpoint: Local preconditioners: Polynomial (in the 8s), Sparse Approximate Inverses, [M. Benzi-Tuma & al 99., E. Chow ] Distributed versions of ILU [Ma & YS 94, Hysom & Pothen ] Use of multicoloring to unaravel parallelism Problem 1: algorithms lagging behind somewhat Problem 2: Message passing is painful for large applications. Time to solution up. CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, Domain Decomposition ideas: Schwarz-type Preconditioners [e.g. Widlund, Bramble-Pasciak-Xu, X. Cai, D. Keyes, Smith,...] Schur-complement techniques [Gropp & Smith, Ferhat et al. (FETI), T.F. Chan et al., YS and Sosonkina 97, J. Zhang,...] Multigrid / AMG viewpoint: Multi-level Multigrid-like preconditioners [e.g., Shadid-Tuminaro et al (Aztec project),...] Intrinsically parallel preconditioners Some alternatives (1) Polynomial preconditioners; (2) Approximate inverse preconditioners; (3) Multi-coloring + independent set ordering; (4) Domain decomposition approach. In practice: Variants of additive Schwarz very common (simplicity) CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26,

43 POLYNOMIAL PRECONDITIONING Principle: M 1 = s(a) where s is a (low) degree polynomial: s(a)ax = s(a)b Problem: how to obtain s? Note: s(a) A 1 * Chebyshev polynomials Several approaches. * Least squares polynomials * Others Polynomial preconditioners are seldom used in practice. Domain Decomposition Problem: u = f in Ω u = u Γ on Γ = Ω. Domain: s Ω = Ω i, i=1 Domain decomposition or substructuring methods attempt to solve a PDE problem (e.g.) on the entire domain from problem solutions on the subdomains Ω i. CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, Coefficient Matrix Discretization of domain CIMPA - Tlemcen May 28April 26,

44 Types of mappings DISTRIBUTED SPARSE MATRICES (a) Vertex-based; (b) edge-based; and (c) element-based partitioning Can adapt PDE viewpoint to general sparse matrices Will use the graph representation and vertex-based viewpoint CIMPA - Tlemcen May 28April 26, Generalization: Distributed Sparse Systems Simple illustration: Block assignment. Assign equation i and unknown i to a given process Naive partitioning - won t work well in practice Best idea is to use the adjacency graph of A: Vertices = {1, 2,, n}; Edges: i j iff a ij Graph partitioning problem: Want a partition of the vertices of the graph so that (1) partitions have the same sizes (2) interfaces are small in size CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26,

General Partitioning of a sparse linear system Alternative: Map elements / edges rather than vertices 21 22 23 24 25 21 22 23 24 25 S 1 = {1, 2, 6, 7, 11, 12}: This 16 17 18 11 12 13 14 15 6 7 8 9 1

45 General Partitioning of a sparse linear system Alternative: Map elements / edges rather than vertices S 1 = {1, 2, 6, 7, 11, 12}: This means equations and unknowns 1, 2, 3, 6, 7, 11, 12 are assigned to Domain 1. S 2 = {3, 4, 5, 8, 9, 1, 13} S 3 = {16, 17, 18, 21, 22, 23} Equations/unknowns 3, 8, 12 shared by 2 domains. From distributed sparse matrix viewpoint this is an overlap of one layer S 4 = {14, 15, 19, 2, 24, 25} Partitioners : Metis, Chaco, Scotch,.. More recent: Zoltan, H-Metis, PaToH CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, Standard dual objective: minimize communication + balance partition sizes Recent trend: use of hypergraphs [PaToh, Hmetis,...] Hypergraphs are very general.. Ideas borrowed from VLSI work Main motivation: to better represent communication volumes when partitioning a graph. Standard models face many limitations Hypergraphs can better express complex graph partitioning problems and provide better solutions. Example: completely nonsymmetric patterns. CIMPA - Tlemcen May 28April 26, CIMPA - Tlemcen May 28April 26, 28 18

A short course on: Preconditioned Krylov subspace methods. Yousef Saad University of Minnesota Dept. of Computer Science and Engineering

A short course on: Preconditioned Krylov subspace methods Yousef Saad University of Minnesota Dept. of Computer Science and Engineering Universite du Littoral, Jan 19-3, 25 Outline Part 1 Introd., discretization