Topics. The CG Algorithm Algorithmic Options CG s Two Main Convergence Theorems

Topics The CG Algorithm Algorithmic Options CG s Two Main Convergence Theorems What about non-spd systems? Methods requiring small history Methods requiring large history Summary of solvers 1 / 52

Conjugate gradient method Hestenes, Stiefel 1952 For A N N SPD In exact arithmetic, solves in N steps In real arithmetic No guaranteed stopping Often converges in many fewer than N steps Optimal method for minimizing with A SPD J(x) = 1 2 x t Ax x t b. 2 / 52

Descent method Given an SPD A, x 0 and a maximum number of iterations itmax r 0 = b Ax 0 for n=0:itmax ( )Choose a descent direction d n α n := arg min α J(x n + αd n ) = d n, r n / d n, Ad n x n+1 = x n + α n d n r n+1 = b Ax n+1 if converged, stop, end end 3 / 52

Conjugate search direction Definition Conjugate means orthogonal in A-inner product [ ] b 2 0 A = 0 a 2 and J(x) = x T Ax Level curve is ellipse: J(x) = const or x 2 [ ] a cos θ x = b sin θ [ ] a sin θ Tangent vector t = dx dθ = b cos θ + y 2 a 2 b 2 = const x and t are conjugate: x, At A = x T At = a 2 b 2 sin θ cos θ + a 2 b 2 sin θ cos θ = 0 Good choice d n : vector conjugate to tangent vector 4 / 52

Conjugate Gradient Algorithm (294) Given an SPD A, x 0 and a maximum number of iterations itmax r 0 = b Ax 0 d 0 = r 0 for n=1:itmax α n 1 = d n 1, r n 1 / d n 1, Ad n 1 x n = x n 1 + α n 1 d n 1 r n = b Ax n if converged, stop, end β n = r n, r n / r n 1, r n 1 d n = r n + β n d n 1 end 5 / 52

Features of CG A three term recursion or a coupled two term recursion. Only three vectors are required ( cond(a) ) Worst case: O iterations per significant digit of accuracy. Common cases much faster. In exact arithmetic, CG reaches the exact solution of an N N system in N steps or less. Low FLOP count once the matrix-vector product is complete. 6 / 52

Non-SPD case You can have a short recursion (few vectors need to be stored) or you can have reasonably fast convergence. You cannot have both! 7 / 52

Example A=gallery( poisson,100); x=rand(10000,1); b=a*x; [y,flag,relres,iter]=pcg(a,b,1.e-8,10000,[],[],x); 8 / 52

Example A=gallery( poisson,100); x=rand(10000,1); b=a*x; [y,flag,relres,iter]=pcg(a,b,1.e-8,10000,[],[],x); flag = 0 iter = 0 8 / 52

Example A=gallery( poisson,100); x=rand(10000,1); b=a*x; [y,flag,relres,iter]=pcg(a,b,1.e-8,10000,[],[],x); flag = 0 iter = 0 [y,flag,relres,iter]=pcg(a,b,1.e-8,10000); Default: initial guess is zero. iter = 259 flag = 0 8 / 52

Speedy! [y,flag,relres,iter]=pcg(a,b,1.e-1,10000);iter [y,flag,relres,iter]=pcg(a,b,1.e-2,10000);iter [y,flag,relres,iter]=pcg(a,b,1.e-3,10000);iter [y,flag,relres,iter]=pcg(a,b,1.e-4,10000);iter [y,flag,relres,iter]=pcg(a,b,1.e-5,10000);iter [y,flag,relres,iter]=pcg(a,b,1.e-6,10000);iter [y,flag,relres,iter]=pcg(a,b,1.e-7,10000);iter [y,flag,relres,iter]=pcg(a,b,1.e-8,10000);iter 9 / 52

Speedy! [y,flag,relres,iter]=pcg(a,b,1.e-1,10000);iter iter = 3 3 [y,flag,relres,iter]=pcg(a,b,1.e-2,10000);iter iter = 15 12 more [y,flag,relres,iter]=pcg(a,b,1.e-3,10000);iter iter = 88 73 more [y,flag,relres,iter]=pcg(a,b,1.e-4,10000);iter iter = 118 30 more [y,flag,relres,iter]=pcg(a,b,1.e-5,10000);iter iter = 139 21 more [y,flag,relres,iter]=pcg(a,b,1.e-6,10000);iter iter = 195 56 more [y,flag,relres,iter]=pcg(a,b,1.e-7,10000);iter iter = 222 27 more [y,flag,relres,iter]=pcg(a,b,1.e-8,10000);iter iter = 259 37 more 9 / 52

Example Example 196: Jacobi took 35 iterations to get 2-digit accuracy. whose true solution is 2u 1 u 2 = 1 u 1 +2u 2 u 3 = 2 u 2 +2u 3 u 4 = 3 u 3 +2u 4 = 4. u 1 = 4, u 2 = 7, u 3 = 8, u 4 = 6. 10 / 52

Example results Iteration 1 2 3 4 α 1.5 0.6222 0.5357 0.4000 β 0.8750 0.2963 0.1607 0 1.5 2.6667 3.5000 x n 3. 5.3333 7.0000 4.5 8.0000 8.0000 6. 6.0000 6.0000 4.0000 7.0000 8.0000 6.0000 11 / 52

Homework Text Exercise 296, 297, Exercise G. 296 Write a program to do CG for a matrix 297 Modify previous program to do CG for MPP xercise G CG converges in at most N iterations for an N N matrix. Show that for A = I, where I is the N N identity matrix, CG converges in a single iteration, no matter how large N is! Hint: Consider the system Ix = b, where b is an arbitrary vector of length N. Starting from an arbitrary initial condition x 0, follow the CG algorithm (by hand) and show that x 1 is the exact solution. 12 / 52

What if A is not SPD Might work! alpha = d n 1, d n 1 / d n 1, Ad n 1 might divide by zero. If don t divide by zero, should converge in N steps in exact arithmetic. 13 / 52

Example: converges but A not SPD 2-1 0 0 0 0 0 0 0 0-1 2-1 0 0 0 0 0 0 0 0-1 2-1 0 0 0 0 0 0 0 0-1 2-1 0 0 0 0 0 0 0 0-1 2-1 0 0 0 0 0 0 0 0-1 -2-1 0 0 0 0 0 0 0 0-1 -2-1 0 0 0 0 0 0 0 0-1 -2-1 0 0 0 0 0 0 0 0-1 -2-1 0 0 0 0 0 0 0 0-1 -2 error=7.9038 error=3.5936 error=25.2284 error=2.8189 error=2.5492 error=1.4865 error=5.3006 error=1.059 error=2.0833 error=1.1638e-13 14 / 52

Topics The CG Algorithm Algorithmic Options CG s Two Main Convergence Theorems What about non-spd systems? Methods requiring small history Methods requiring large history Summary of solvers 15 / 52

Algorithmic options 1. An equivalent expression for α n is α n = r n, r n d n, Ad n. 2. The expression r n+1 = b Ax n+1 is equivalent to r n+1 = r n α n Ad n in exact arithmetic. Since x n+1 = x n + α n d n, multiply by A and add b to both sides to get b Ax n+1 }{{} r n+1 = b Ax n }{{} r n α n Ad n, 16 / 52

Topics The CG Algorithm Algorithmic Options CG s Two Main Convergence Theorems What about non-spd systems? Methods requiring small history Methods requiring large history Summary of solvers 17 / 52

Definitions span Let z 1,..., z m be m vectors. Then span { z 1,..., z m} is the set of all linear combinations of z 1,..., z m, i.e., the subspace span { { } z 1,..., z m} m = x = α i z i : α i R. Krylov subspace Let x 0 be given and r 0 = b Ax 0. The Krylov subspace determined by r 0 and A is i=1 X n = X n (A; r 0 ) = span{r 0, Ar 0,..., A n 1 r 0 } and the affine Krylov space determined by r 0 and A is K n = K n (A; x 0 ) = x 0 + X n = {x 0 + x : x X n }. 18 / 52

Important facts about CG, Proposition 302 The CG iterates x j, residuals r j and search directions d j satisfy x j x 0 + span{r 0, Ar 0,..., A j 1 r 0 }, r j r 0 + A span{r 0, Ar 0,..., A j 1 r 0 } and d j span{r 0, Ar 0,..., A j r 0 }. Note misprint in book! 19 / 52

Proof Proof of third result is by induction d 0 = r 0 span{r 0 } Assume d n 1 span{r 0,..., r n 1 } d n = r n + β n d n 1 span{r 0,..., r n } Proof of first and second results similar 20 / 52

Homework Exercise F: Complete the proof of Proposition 302. 21 / 52

First convergence theorem 304 Let A be SPD. Then the CG method satisfies the following: 1. The n th residual is globally optimal over the affine subspace K n in the A 1 -norm r n A 1 = min r r 0 +AX n r A 1 2. The n th error is globally optimal over K n in the A-norm e n A = 3. J(x n ) is the global minimum over K n min e e 0 +X n e A J(x n ) = min x Kn J(x). 4. Furthermore, the residuals are orthogonal and search directions are A-orthogonal: r n, r k = 0, for k n, d n, d k A = 0, for k n. 22 / 52

Prove r n, r k = d n, Ad k = 0 for k < n By induction: Vacuously true for n = 0 Assue true for n 1 First, prove r n, r n 1 and d n, d n 1 A 23 / 52

Prove r n, r k = d n, Ad k = 0 for k < n By induction: Vacuously true for n = 0 Assue true for n 1 First, prove r n, r n 1 and d n, d n 1 A Using r n = r n 1 α n 1 Ad n 1 and α n 1 = r n 1, r n 1 / d n 1, Ad n 1 r n, r n 1 = r n 1, r n 1 α n 1 Ad n 1, d n 1 = 0 Using d n = r n + β n 1 d n 1, r n r n 1 = α n 1 Ad n 1, and β n 1 = r n, r n / r n 1, r n 1 d n, Ad n 1 = r n, Ad n 1 + β n 1 d n 1, Ad n 1 23 / 52

Continue proof r n, r k = d n, Ad k = 0 For the case k n 2 Using r n+1 = r n α n Ad n, and d k span{r 0, Ar 0,..., A k 1 r 0 } r n+1, r k = r n, r k α n r n, Ad k = 0 0 Using d n+1 = r n+1 + β n d n, and r k+1 r k = α k Ad k, d n+1, Ad k = r n+1, Ad k + β n d n, Ad k ) = 1 α k r n+1, (r k+1 r k ) β n d n, Ad k ) = 0 24 / 52

Prove J(x n ) = min x Kn J(x), n = 1 Induction proof: Assume n = 1, x 1 = x 0 + α 0 d 0 K 1 For arbitrary α J(x 0 + αd 0 ) = 1 2 (x 0 + αd 0 ), A(x 0 + αd 0 ) b, (x 0 + αd 0 ) 25 / 52

Prove J(x n ) = min x Kn J(x), n = 1 Induction proof: Assume n = 1, x 1 = x 0 + α 0 d 0 K 1 For arbitrary α J(x 0 + αd 0 ) = 1 2 (x 0 + αd 0 ), A(x 0 + αd 0 ) b, (x 0 + αd 0 ) = 1 2 x 0, Ax 0 + α x 0, Ad 0 + 1 2 α2 d 0, Ad 0 b, x 0 α b, d 0 25 / 52

Prove J(x n ) = min x Kn J(x), n = 1 Induction proof: Assume n = 1, x 1 = x 0 + α 0 d 0 K 1 For arbitrary α J(x 0 + αd 0 ) = 1 2 (x 0 + αd 0 ), A(x 0 + αd 0 ) b, (x 0 + αd 0 ) This is minimized when = 1 2 x 0, Ax 0 + α x 0, Ad 0 + 1 2 α2 d 0, Ad 0 b, x 0 α b, d 0 = J(x 0 ) + α Ax 0 b, d 0 + 1 2 α2 d 0, Ad 0 = J(x 0 ) α r 0, d 0 + 1 2 α2 d 0, Ad 0 r 0, d 0 + α d 0, Ad 0 = 0 25 / 52

Prove J(x n ) = min x Kn J(x), induction step For x = x n 1 + αd n 1, previous calculation yields α n = r 0, d 0 /α d 0, Ad 0 so x n minimizes J over x n 1 + αd n 1. 26 / 52

Prove J(x n ) = min x Kn J(x), induction step For x = x n 1 + αd n 1, previous calculation yields α n = r 0, d 0 /α d 0, Ad 0 so x n minimizes J over x n 1 + αd n 1. To see that x n actually minimizes J over K n, suppose ỹ K n. Write ỹ = x n + y, so y X n. Computing J(x n + y) = J(x n + x n, Ay + 1 y, Ay b, y 2 = J(x n ) + Ax n b, y + 1 y, Ay 2 = J(x n ) r n, y + 1 y, Ay 2 But y X n = span{r 0,..., A n 1 r 0 } = span{r 0,..., r n 1 }, so that r, y = 0 because r n, r k = 0 for k < n. Hence J(x n + y) = J(x n ) + 1 2 y, Ay > J(x n ) unless y = 0. 26 / 52

Finite termination Let A be SPD. Then in exact arithmetic CG produces the exact solution to an N N system in N steps or fewer. Proof. Since the residuals {r 0, r 1,..., r N 1 } are orthogonal they are linearly independent. Thus, r l = 0 for some l N. 27 / 52

Remark on development Sections 6.2 and 6.3 in the text present a way to develop the CG method. We will be skipping it because of lack of time. 28 / 52

Convergence rate of CG Notation: The set of real polynomials of degree n is denoted Π n. Theorem 327 Let A be SPD. The error at CG step n is bounded by ( ) x x n A min max p(x) e 0 A λ min x λ max Theorem 329 Given any ε > 0 for p n Π n and p(0)=1 n 1 2 cond(a) ln( 2 ε ) + 1 the error in the CG iterations is reduced by ε: x n x A ε x 0 x A. 29 / 52

Idea of proof 1. e n e 0 + X n = e n = (polynomial in A)e 0 2. e n is optimal 3. Chebychev polynomials satisfy known bounds 4. CG must be no worse than Chebychev bounds. 30 / 52

Polynomial bounds r n r 0 + A(span{r 0, Ar 0,..., A n 1 r 0 }) e n e 0 + span{e 0, Ae 0,..., A n 1 e 0 } e n = [I + a 1 A + a 2 A 2 + + a n A n ]e 0 = p(a)e 0 e n A = min p n Π n and p(0)=1 p(a)e0 A ( ) min p(a) A e 0 A. p n Π n and p(0)=1 p(a) A = max p(x) max p(x). λ spectrum(a) λ min x λ max 31 / 52

Chebychev polynomials, min-max problem The Chebychev polynomials T n (x) = cos ( n cos 1 (x) ) can be scaled and translated to [a, b] (where a = λ min, b = λ max ) ( ) ( ) b + a 2x b + a p n /T n. b a b a p n (x) are known to attain Hence, min p n Π n p(0) = 1 min p n Π n p(0) = 1 max p(x) = max a x b a x b 1 where σ = 1 + a b a b max p(x) λ min x λ max T n ( b+a 2x b a ) T n ( b+a b a ) = 2 σn 1 + σ n ( ) κ 1 1 = = 1 2 + O( 1 κ + 1 κ κ ) 32 / 52

Scaled Chebychev polynomials 1 max p(x) p(x) λ min min p(x) λ max 33 / 52

Chebychev polynomials do_cheby.m 34 / 52

How many iterations? How many iterations to get 2σ n /(1 + σ n ) ɛ? ( ) 1 σ 1 2 κ σ n ɛ/2 n log σ log(ɛ/2) n( 2 1/κ) log(ɛ/2) κ n 2 log 2 ɛ + 1 35 / 52

Polynomial error e n = [I + a 1 A + a 2 A 2 + + a n A n ]e 0 = p(a)e 0 Repeated eigenvalues are treated as single eigenvalue Accelerates convergence K distinct eigenvalues = convergence in K iterations. Recall Exercise G. Clusters of eigenvalues speed up convergence 36 / 52

Preconditioning Instead of solving Ax = b, solve MAx = Mb. M = A 1, converges in 1 iteration. M A 1, but computing Mx is fast M L L, approximate factors of A A few iterations of another method (such as Gauss-Seidel) Universe of alternatives 37 / 52

PCG Algorithm for solving Ax = b Given a SPD matrix A, preconditioner M, initial guess vector x 0, right side vector b, and maximum number of iterations itmax r 0 = b Ax 0 Solve Md 0 = r 0 z 0 = d 0 for n=0:itmax α n = r n, z n / d n, Ad n x n+1 = x n + α n d n r n+1 = b Ax n+1 ( ) if converged, stop end Solve Mz n+1 = r n+1 β n+1 = r n+1, z n+1 / r n, z n d n+1 = z n+1 + β n+1 d n ( ) end 38 / 52

Example Write A = UU, where U is upper triangular Cholesky factorization Factors have fill-in (curse of dimensionality) Only keep nonzeros where A is nonzero: Incomplete Cholesky 39 / 52

Example N=50; A=gallery( poisson,n); xact=sin(1:n^2) ; b=a*xact; tol=1.e-6; maxit=n^2; 40 / 52

Example N=50; A=gallery( poisson,n); xact=sin(1:n^2) ; b=a*xact; tol=1.e-6; maxit=n^2; tic;[x,flag,relres,iter0] = pcg(a,b,tol,maxit);toc 40 / 52

Example N=50; A=gallery( poisson,n); xact=sin(1:n^2) ; b=a*xact; tol=1.e-6; maxit=n^2; tic;[x,flag,relres,iter0] = pcg(a,b,tol,maxit);toc iter0 >> Elapsed time is 0.023921 seconds. >> iter0 = 73 U=chol(A); U(A==0)=0; tic;[x,flag,relres,iter] = pcg(a,b,tol,maxit,u,u);toc 40 / 52

Topics The CG Algorithm Algorithmic Options CG s Two Main Convergence Theorems What about non-spd systems? Methods requiring small history Methods requiring large history Summary of solvers 41 / 52

GMRES is most popular Youcef Saad and Martin H. Schultz, GMRES: A Generalized Minimal Residual Algorithm for Solving Nonsymmetric Linear Systems, SIAM J. Sci. and Stat. Comput. 7, pp. 856-869. x n is the vector that minimizes Ax b over the Krylov space K n = span{r (0), Ar (0), A 2 r (0),..., A n 1 r (0) }. The basis for K n is formed according to a modified Gram-Schmidt orthogonalization, called an Arnoldi process. w (n) = Av (n) for k = 1,..., n w (n) = w (n) (w (n), v (k) )v (k) end v (n+1) = w (n) / w (n) 42 / 52

Features of GMRES A not SPD = no 3-term relation All of the basis vectors must be kept Usually restart whole thing after m steps: GMRES(m). Yousef Saad, Iterative Methods for Sparse Linear Systems, Second edition, SIAM, 2003, www-users.cs.umn.edu/%7esaad/itermethbook_2nded.pdf 43 / 52

Topics The CG Algorithm Algorithmic Options CG s Two Main Convergence Theorems What about non-spd systems? Methods requiring small history Methods requiring large history Summary of solvers 44 / 52

Conjugate gradient squared (cgs) Conjugate gradient applied to A H Ax = A H b Hermitian is transpose plus complex conjugate Convergence behavior can be irregular Two (not independent) matrix-vector multiplies per iteration A H A: condition number is squared Works on general matrices, even complex 45 / 52

Conjugate gradient applied to normal equations (cgn) Conjugate gradient applied to A H Ax = A H b directly A H is formed Two matrix-vector multiplies per iteration 46 / 52

minres and symmlq Variants of the CG method for indefinite systems MINRES minimizes the residual in the 2-norm. SYMMLQ solves the projected system, but does not minimize anything. SYMMLQ keeps the residual orthogonal to all previous ones. SYMMLQ uses LQ decomposition to solve an intermediate system 47 / 52

Bi-conjugate gradient (bicg) r (n) = r (n 1) α n Ap (n) p (n) = r (n 1) + β n 1 p (n 1) r (n) = r (n 1) α n A T p (n) p (n) = r (n 1) + β n 1 p (n 1) α n = r (n 1),r (n 1) p (n),ap (n) β n = r (n),r (n) r (n 1),r (n 1) p (n) are conjugate-orthogonal to p (k) No minimization principle = irregular convergence Two matrix-vector products per iteration, one is with A T Works for non-symmetric matrices Can break down 48 / 52

Topics The CG Algorithm Algorithmic Options CG s Two Main Convergence Theorems What about non-spd systems? Methods requiring small history Methods requiring large history Summary of solvers 49 / 52

GMRES x n is the vector that minimizes Ax b over the Krylov space K n = span{r (0), Ar (0), A 2 r (0),..., A n 1 r (0) }. Must keep all basis vectors for K n Restart to keep history size under control 50 / 52

Topics The CG Algorithm Algorithmic Options CG s Two Main Convergence Theorems What about non-spd systems? Methods requiring small history Methods requiring large history Summary of solvers 51 / 52

How do I choose? Symmetric? yes Definite? yes CG no MINRES or SYMMLQ or CR no GMRES or CGN Best to use no preconditioner at first good methods will be better when preconditioned. 52 / 52