ITERATIVE METHODS BASED ON KRYLOV SUBSPACES

ITERATIVE METHODS BASED ON KRYLOV SUBSPACES LONG CHEN We shall present iterative methods for solving linear algebraic equation Au = b based on Krylov subspaces We derive conjugate gradient (CG) method developed by Hestenes and Stiefel in 1950s [1] for symmetric and positive definite matrix A and briefly mention the GMRES method [3] for general non-symmetric matrix systems 1 CONJUGATE GRADIENT METHOD In this section, we introduce the conjugate gradient (CG) method for solving Au = b with a symmetric and positive definite operator A defined on a Hilbert space V with dim V = N and its preconditioned version PCG We derive the CG from the projection in A-inner product point of view We shall use (, ) for the standard inner product and (, ) A for the inner product introduced by the SPD operator A When we say orthogonal vectors, we refer to the default (, ) inner product and emphasize the orthogonality in (, ) A by A-orthogonal 11 Gradient Method Recall that the simplest iteration method, Richardson method, is in the form u k+1 = u k + α r k, where α is a fixed constant The correction α r k can be thought of as an approximation to the residual equation Ae k = r k Now we solve the residual equation Ae k = r k in the 1-dimensional subspace spanned by r k, ie, we look for e k = α kr k such that Test with r k, we get A(α k r k ) = r k, in span{r k } (1) α k = (r k, r k ) (Ar k, r k ) = (u u k, r k ) A (r k, r k ) A The iterative method u k+1 = u k + α k r k, is called gradient method Since α k is a nonlinear function of u k, the method becomes nonlinear The second identity in (1) implies that the correction e k = α kr k is the projection of the error e k = u u k onto the space span{r k } in the A-inner product From that, we immediately get the A-orthogonality (2) (e k e k, r k ) A = 0, which can be rewritten as (3) (u u k+1, r k ) A = (r k+1, r k ) = 0 Namely the residual in the consecutive step is orthogonal Date: March 6, 2016 1

2 LONG CHEN Remark 11 The name gradient method comes from the minimization point of view Let E(u) = u 2 A /2 (b, u) and consider the optimization problem min u V E(u) Given a current approximation u k, the decent direction is the negative gradient E(u k ) = b Au k = r k The optimal parameter α k is obtained by considering the one-dimensional minimization problem min α E(u k + αr k ) The residual is simply the (negative) gradient of the energy We then give convergence analysis of the gradient method to show that it converges as the optimal Richardson method Recall that in the Richardson method, the optimal choice α = 2/(λ min (A) + λ max (A)) requires the information of eigenvalues of A, while in the gradient method, α k is computed using the action of A only The optimality is build in through the optimization of the step size (so-called the exact line search) Theorem 12 Let A be SPD and let u k be the kth iteration in the gradient method with an initial guess u 0 We have ( ) k κ(a) 1 (4) u u k A u u 0 A, κ(a) + 1 where κ(a) = λ max(a) /λ min(a) is the condition number of A Proof Since we solve the residual equation in the subspace span{r k } exactly, or in other words, α k r k is the projection of u u k in the A-inner product, we have u u k+1 A = e k e k = inf α R e k αr k A = inf α R (I αa)(u u k) A Since I αa is symmetric, I αa A = ρ(i αa) Consequently u u k+1 A inf α R ρ(i αa) u u k A = κ(a) 1 κ(a) + 1 u u k A The proof is completed by recursion on k The level set of x 2 A is an ellipsoid and the condition number of A is the ratio of major and minor axis of the ellipsoid When κ(a) 1, the ellipsoid is very skinny and the gradient method will give a long zig-zag path towards the solution 12 Conjugate Gradient Method By tracing back to the initial guess u 0, the k + 1-th approximation of the gradient method can be written as Let u k+1 = u 0 + α 0 r 0 + α 1 r 1 + + α k r k V k = span{r 0, r 1,, r k } The correction k i=0 α ir i can be thought as an approximate solution of the residual equation Ae 0 = r 0, in V k by computing the coefficients along each basis {r 0, r 1,, r k } It is no longer the best approximation of e 0 = u u 0 in the subspace V k since the basis {r 0, r 1,, r k } is not A-orthogonal If we can find an A-orthogonal basis {p 0, p 1,, p k } of V k, ie, (p i, p j ) A = 0 if i j, the best approximation, ie, the A-projection P k e 0 V k can be found by the formulae P k e 0 = k α i p i, i=0 α i = (e 0, p i ) A (p i, p i ) A, for i = 0, k

ITERATIVE METHODS BASED ON KRYLOV SUBSPACES 3 Let u k = u 0 + k 1 i=0 α ip i for k 1, then u k+1 can be computed recursively by The residual u k+1 = u k + α k p k, α k = (u u 0, p k ) A (p k, p k ) A r k+1 = b Au k+1 = b Au k α k Ap k = r k α k Ap k can be also updated recursively The vector p k is called the conjugate direction which is better than gradient direction since it is now A-orthogonal to all previous directions while the gradient direction is only orthogonal (not even A-orthogonal) to the previous one The questions are: How to find such an A-orthogonal basis? Which A-orthogonal basis is better in favor of storage and computation? The answer to the first question is easy Gram-Schmidt process We begin with p 0 = r 0 Update u 1 = u 0 + α 0 p 0 and compute the new residual r 1 = b Au 1 or update r 1 = r 0 α 0 Ap 0 If r 1 = 0, we obtain the exact solution and stop Otherwise we have two vectors {p 0, r 1 } and apply the Gram-Schmidt process to obtain p 1 = r 1 + β 0 p 0, β 0 = (r 1, p 0 ) A (p 0, p 0 ) A, which is A-orthogonal to p 0 We can then update u 2 = u 1 + α 1 p 1 and continue In general, suppose we have found an A-orthogonal basis {p 0, p 1,, p k } for V k We update the approximation u k+1 = u k +α k p k and update the residual r k+1 = r k α k Ap k Again we apply the Gram-Schmidt process to look for the next conjugate direction p k+1 = r k+1 + β k p k + γ k 1 p k 1 + + γ 0 p 0 The coefficients can be computed by the A-orthogonality (p k+1, p i ) A = 0 for 0 i k and the formula for γ i is γ i = (r k+1, p i ) A /(p i, p i ) A If we need to compute all these coefficients, we need to store all previous basis vectors This is not efficient in both time and space The amazing fact is that all γ i, 0 i k 1, are zero We can simply compute the new conjugate direction as (5) p k+1 = r k+1 + β k p k β k = (r k+1, p k ) A (p k, p k ) A The formulae (5) is justified by the following orthogonality Lemma 13 The residual r k+1 is (, )-orthogonal to V k and A-orthogonal to V k 1 Proof Since {p 0, p 1,, p k } is a A-orthogonal basis of V k As we mentioned before, the formulae u k+1 u 0 = k i=0 α ip i actually computes the A-projection of the error ie, u k+1 u 0 = P k (u u 0 ) We write u u k+1 = u u 0 (u k+1 u 0 ) = (I P k )(u u 0 ) Therefore u u k+1 A V k, ie r k+1 = A(u u k+1 ) V k The first orthogonality is proved By the update formula for the residual r i = r i 1 α i 1 Ap i 1, we get Ap i 1 span{r i 1, r i } V k for i k Since we have proved r k+1 V k, we get (r k+1, p i ) A = (r k+1, Ap i ) = 0 for i k 1, ie r k+1 is A-orthogonal to V k 1 This lemma shows the advantange of the conjugate gradient method over the gradient method The new residual is orthogonal to the whole space not only to one residual vector in the previous step The additional orthogonality reduces the Gram-Schmidt process to three-term recursion

4 LONG CHEN In summary, starting from an initial guess u 0 and p 0 = r 0, for k = 0, 1, 2,, we use three recursion formulae to compute u k+1 = u k + α k p k, r k+1 = r k α k Ap k, α k = (u u 0, p k ) A (p k, p k ) A, p k+1 = r k+1 + β k p k, β k = (r k+1, p k ) A (p k, p k ) A The method is called conjugate gradient (CG) method since the conjugate direction is obtained by a correction of the gradient (the residual) direction The importance of the gradient direction is the orthogonality given in Lemma 13 We now derive more efficient formulae for α k and β k Proposition 14 α k = (r k, r k ) (Ap k, p k ), β k = (r k+1, r k+1 ) (r k, r k ) Proof First since r k = r 0 k 1 i=0 α iap i and (p k, Ap i ) = 0 for 0 i k 1, we have (u u 0, p k ) A = (r 0, p k ) = (r k, p k ) Then since p k = r k + β k 1 p k 1 and (r k, p k 1 ) = 0, we have (r k, p k ) = (r k, r k ) Recall that α k is the coefficient of u u 0 corresponding to the basis p k, ie, α k = (u u 0, p k ) A (p k, p k ) A = (r k, r k ) (Ap k, p k ) Second, to prove the formula of β k, we use the recursion formula for the residual to get (r k+1, Ap k ) = 1 α k (r k+1, r k+1 ) Then using the formula we just proved for α k, we replace (Ap k, p k ) = 1 α k (r k, r k ), to get β k = (r k+1, p k ) A = (r k+1, Ap k ) (p k, p k ) A (Ap k, p k ) = (r k+1, r k+1 ) (r k, r k ) We present the algorithm of the conjugate gradient method as follows 1 function u = CG(A,b,u,tol) 2 tol = tol*norm(b); 3 k = 1; 4 r = b - A*u; 5 p = r; 6 r2 = r *r; 7 while sqrt(r2) tol && k<length(b) 8 Ap = A*p; 9 alpha = r2/(p *Ap); 10 u = u + alpha*p; 11 r = r - alpha*ap;

ITERATIVE METHODS BASED ON KRYLOV SUBSPACES 5 12 r2old = r2; 13 r2 = r *r; 14 beta = r2/r2old; 15 p = r + beta*p; 16 k = k + 1; 17 end Several remarks on the realization are listed below: The most time consuming part is the matrix-vector multiplication A*p The error is measured by the relative error of the residual r tol b A maximum iteration step is given to avoid a large loop 13 Convergence Analysis of Conjugate Gradient Method Recall that An important observation is that Lemma 15 V k = span{r 0, r 1,, r k } = span{p 0, p 1,, p k } V k = span{r 0, Ar 0,, A k r 0 } Proof We prove it by induction For k = 0, it is trivial Suppose the statement holds for k = i We prove it holds for k = i + 1 by noting that V i+1 = V i + span{r i+1 } = V i + span{r i, Ap i } = V i + span{ap i } = V i + AV i The space span{r 0, Ar 0,, A k r 0 } is called Krylov subspace The CG method belongs to a large class of Krylov subspace iterative methods By the construction, we have u k = u 0 + P k 1 (u u 0 ) Writing u u k = u u 0 (u k u 0 ), we have (6) u u k A = inf v V k 1 u u 0 + v A Theorem 16 Let A be SPD and let u k be the kth iteration in the CG method with an initial guess u 0 Then (7) (8) u u k A = u u k A inf p k (A)(u u 0 ) A, p k P k, p k (0)=1 inf p k P k, sup λ σ(a) p k (0)=1 Proof For v V k 1, it can be expanded as k 1 v = c i A i r 0 = i=0 Let p k (x) = 1+ k i=1 c i 1x i Then p k (λ) u u 0 A k c i 1 A i (u u 0 ) i=1 u u 0 + v = p k (A)(u u 0 ) The identity (7) then follows from (6) Since A i is symmetric in the A-inner product, we have p k (A) A = ρ(p k (A)) = sup p k (λ), λ σ(a)

6 LONG CHEN which leads to the estimate (8) The polynomial p k P k with constrain p k (0) = 1 will be called the residual polynomial Various convergence results of CG method can be obtained by choosing specific residual polynomials Corollary 17 Let A be SPD with eigenvectors {φ} N i=1 Let b span{φ i 1, φ i2,, φ ik }, k N Then the CG method with u 0 = 0 will find the solution Au = b in at most k iterations In particular, for a given b R N, the CG algorithm will find the solution Au = b within N iterations Namely CG can be viewed as a direct method Proof Choose the polynomial p(x) = k l=1 (1 x/λ i l ) Then p is a residual polynomial of order k and p(0) = 1, p(λ il ) = 0 Let b = k l=1 b lφ il Then u = k l=1 u lφ il with u l = b l /λ il and k k p(a) u l φ il = u l p(λ il )φ il = 0 l=1 The assertion is then from the identity (7) l=1 Remark 18 The CG method can be also applied to symmetric and positive semi-definite matrix A Let {φ i } k i=1 be the eigenvectors associated to λ min(a) = 0 Then from Corollary 17, if b range(a) = span{φ k+1, φ k+2,, φ N }, then the CG method with u 0 range(a) will find a solution Au = b within N k 1 iterations The CG is invented as a direct method but it is more effective to use as an iterative method The rate of convergence depends crucially on the distribution of eigenvalues of A and could converge to the solution within certain tolerance in steps k N Theorem 19 Let u k be the k-th iteration of the CG method with u 0 Then ( ) k κ(a) 1 (9) u u k A 2 u u 0 A, κ(a) + 1 Proof Let b = λ max (A) and a = λ min (A) Choose p k (x) = T k((b + a 2x)/(b a)), T k ((b + a)/(b a)) where T k (x) is the Chebyshev polynomial of degree k given by { cos(k arccos x) if x 1 T k (x) = cosh(k arccosh x) if x 1 It is easy to see that p k (0) = 1, and if x [a, b], then b + a 2x b a 1 Hence We set inf p k P k, sup λ σ(a) p k (0)=1 p k (λ) [ ( )] 1 b + a T k b a b + a b a = coshσ = eσ + e σ 2

ITERATIVE METHODS BASED ON KRYLOV SUBSPACES 7 Solving this equation for e σ, we have κ(a) + 1 e σ = κ(a) 1 with κ(a) = b/a We then obtain ( ) b + a T k = cosh(kσ) = ekσ + e kσ 1 b a 2 2 ekσ = 1 2 which complete the proof ( κ(a) + 1 κ(a) 1 ) k, The estimate (9) shows that CG is in general better than the gradient method Furthermore if the condition number of A is close to one, the CG iteration will converge very fast Even if κ(a) is large, the iteration will perform well if the eigenvalues are clustered in a few small intervals; see the following estimate Corollary 110 Assume that σ(a) = σ 0 (A) σ 1 (A) and l is the number of elements in σ 0 (A) Then ( ) k l b/a 1 (10) u u k A 2M u u 0 A, b/a + 1 where a = min λ, b = max λ, and M = max λ σ 1(A) λ σ 1(A) λ σ 1(A) µ σ 0(A) 1 λ/µ 14 Preconditioned Conjugate Gradient (PCG) method Let B be an SPD matrix We shall apply the Gram-Schmidt process to the subspace V k = span{br 0, Br 1,, Br k }, to get another A-orthogonal basis That is, starting from an initial guess u 0 and p 0 = Br 0, for k = 0, 1, 2,, we use three recursion formulae to compute u k+1 = u k + α k p k, α k = (Br k, r k ) (Ap k, p k ), r k+1 = r k α k Ap k, p k+1 = Br k+1 + β k p k, β k = (Br k+1, r k+1 ) (Br k, r k ) We need to justify p k+1 computed above is A-orthogonal to V k Lemma 111 Br k+1 is A-orthogonal to V k 1 and p k+1 is is A-orthogonal to V k Exercise 112 Prove Lemma 111 We present the algorithm of preconditioned conjugate gradient method as follows 1 function u = pcg(a,b,u,b,tol) 2 tol = tol*norm(b); 3 k = 1; 4 r = b - A*u; 5 rho = 1; 6 while sqrt(rho) tol && k<length(b) 7 Br = B*r; 8 rho = r *Br;

8 LONG CHEN 9 if k = 1 10 p = Br; 11 else 12 beta = rho/rho_old; 13 p = Br + beta*p; 14 end 15 Ap = A*p; 16 alpha = rho/(p *Ap); 17 u = u + alpha*p; 18 r = r - alpha*ap; 19 rho_old = rho; 20 k = k + 1; 21 end Similarly we collect several remarks for the implementation If we think A : V V, the residual is in the dual space V B is the Reisz representation from V to V of some inner product In the original CG, B is the identity matrix A general guiden principle of design a good principle is to find the correct inner product such that the operator A is continuous and stable For an effective PCG, the computation B*r is crucial The matrix B do not have to be formed explicitly All we need is the matrix-vector multiplication which can be replaced by a subroutine To perform the convergence analysis, we let r 0 = Br 0 Then one can easily verify that V k = span{ r 0, BA r 0,, (BA) k r 0 } Since the optimality u u k A = inf u u 0 + v A v V k 1 still holds, and BA is symmetric in the A-inner product, we obtain the following convergence rate of PCG Theorem 113 Let A be SPD and let u k be the kth iteration in the PCG method with the preconditioner B and an initial guess u 0 Then (11) (12) u u k A = u u k A inf p k (BA)(u u 0 ) A, p k P k, p k (0)=1 inf p k P k, sup λ σ(ba) p k (0)=1 p k (λ) u u 0 A, (13) u u k A 2 ( κ(ba) 1 κ(ba) + 1 ) k u u 0 A The SPD matrix B is called preconditioner A good preconditioner should have the properties that the action of B is easy to compute and that κ(ba) is significantly smaller than κ(a) The design of a good preconditioner requires the knowledge of the problem and finding a good preconditioner is a central topic of scientific computing For a linear iterative method Φ B with a symmetric iterator B, the iterator can be used as a preconditioner for A The corresponding PCG method converges at a faster rate Let ρ denote the convergence rate of Φ B, ie ρ = ρ(i BA) Then κ(ba) 1 δ = 1 1 ρ 2 < ρ κ(ba) + 1 ρ

ITERATIVE METHODS BASED ON KRYLOV SUBSPACES 9 A more interesting fact is that the scheme Φ B may not be convergent at all whereas B can always be a preconditioner That is even ρ > 1, the PCG rate δ < 1 For example, the Jacobi method is not convergent for all SPD systems, but B = D 1 can always be used as a preconditioner, which is often known as the diagonal preconditioner Another popular preconditioner is an incomplete Cholesky factorization 2 GMRES For a general matrix A, we can reformulated the equation Au = b as a least square problem min b Au, u R N and restrict the minimization in the affine subspace u 0 + V k, ie, (14) min v u 0+V k b Av Let u k be the solution of (14) and r k = b Au k Identical to the proof of CG, we immediately get (15) r k = min p(a)r 0 min p(a) r 0 p P k,p(0)=1 p P k,p(0)=1 Unlike the case when A is SPD, now the norm p(a) is not easy to estimate since the spectrum of a general matrix A includes complex numbers and much harder to estimate Convergence analysis for diagonalizable matrices can be found in [2] In the algorithmic level, the GMRES is also more expensive than CG There are no recursive formula to update the approximation One has to (1) Compute and store an orthogonal basis of V k ; (2) Solve the least square problem (14) The k orthogonal basis can be computed by Gram-Schmidt process requiring O(kN) memory and complexity The least square problem can be solved in O(k 3 ) complexity Plus one matrix-vector product with O(sN) operations, where s is the sparsity of the matrix (average number of non-zeros in one row) Therefore for m GMRES iterations, the total complexity is O(smN + m 2 N + m 4 ) As the iteration progresses, the term m 2 N + m 4 increases quickly In addition, it requires mn memory to store m orthogonal basis GMRES won not be efficient for large m One simple fix is the restart Choose a upper bound for m, say, 20 After that, discard the previous basis and record new basis The restart will lose the optimality (15) and in turn slow down the convergence since in (14) the space is changed A good preconditioner is thus more important for GMRES Suppose we can choose a preconditioner B such that I BA = ρ < 1 Then solving BA u = Bb by GMRES The estimate (15) will lead to the convergence Br k ρ k Br 0 One can also solve AB v = b by GMRES if I AB = ρ < 1 and set u = Bv These two are called left or right preconditioner, respectively Again a general guiding principle of design a good principle is to find the correct inner product such that the operator A is continuous and stable and use the corresponding Reisz representation as the preconditioner

10 LONG CHEN REFERENCES [1] M Hestenes and E Stiefel Methods of conjugate gradients for solving linear systems Journal cf Research of the National Bureau of Standards Vol, 49(6), 1952 [2] C Kelley Iterative methods for linear and nonlinear equations Society for Industrial Mathematics, 1995 [3] Y Saad and M H Schultz GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems SIAM J Sci Statist Comput, 7:856 869, 1986