Lecture # 20 The Preconditioned Conjugate Gradient Method We wish to solve Ax = b (1) A R n n is symmetric and positive definite (SPD). We then of n are being VERY LARGE, say, n = 10 6 or n = 10 7. Usually, the matrix is also sparse (mostly zeros) and Cholesky factorization is not feasible. When A is SPD, solving (1) is equivalent to finding x such that x 1 = arg min x R n 2 xt Ax x T b. The conjugate gradient algorithm is one way to solve this problem. Algorithm 1 (The Conjugate Gradient Algorithm) x 0 initial guess (usually 0). p 1 = r 0 = b Ax 0 ; w = Ap 1 ; α 1 = r T 0 r 0 /(p T 1 w); x 1 = x 0 + α 1 p 1 ; r 1 = r 0 α 1 w; k = 1 ; while r k 2 > ǫ (some tolerence) β k = r T k r k/(r T k 1 r k 1); p k+1 = r k + β k p k ; w = Ap k+1 ; α k+1 = r T k r k/(p T k+1 w); x k+1 = x k + α k+1 p k+1 ; r k+1 = r k α k+1 w; k = k + 1; x = x k ; 1
The conjugate gradient algorithm is essentially optimal in the following sense. We note that One can show that x k = x 0 + x k = arg If we let the A-norm be given by then one can show that k α j p j = x 0 + P k a j=1 P k = (p 1,...,p k ) a = (α 1,...,α k ) T 1 min x=x 0 +P k c 2 xt Ax x T b. w A = w T Aw ( ) k κ 1 x k x A 2 x 0 x A κ + 1 κ = κ 2 (A) = A 1 2 A 2. Thus if A is well conditioned, the conjugate gradient method converges quickly. This bound is actually very pessimistic! However, if A is badly conditioned, conjugate gradient can converge slowly, so we look to transform (1) into a well conditioned problem. We writing A in the form of a splitting A = M N M is SPD. Since M is SPD, it has a Cholesky factorization, thus M = R T R. If we let  = R T AR 1, ˆb = R T b 2
then solving (1) is equivalent to solving Ây = ˆb y = Rx. We hope that κ 2 (Â) κ 2(A), and we solve (1) by performing conjugate gradient on Â. In that formulation, we would have and ŷ k = Rx k ˆr k = ˆb Âŷ k = R T (b Ax k ) = R T r k. A conjugate gradient step would be ˆp k+1 = ˆr k + β kˆp k so Thus Since y k+1 = y k + α k+1ˆp k+1 We prefer to update x k = R 1 y k. We have that if we let then ˆp k+1 = R T r k + β kˆp k, x k+1 = R 1 y k+1 = R 1 y k + α k+1 R 1ˆp k+1 x k+1 = x k + α k+1 p k+1 p k+1 = R 1 R T r k + β k p k. M 1 = R 1 R T z k = M 1 r k p k+1 = z k + β k p k is the update to the descent direction. The formulas for β k and α k+1 update to β k = (r T kz k )/(r T k 1z k 1 ) α k+1 = (r T kz k )/(p T k+1ap k+1 ). These modifications lead to the preconditioned conjugate gradient method. 3
Algorithm 2 (The Preconditioned Conjugate Gradient Algorithm) x 0 initial guess (usually 0). r 0 = b Ax 0 ; Solve Mz 0 = r 0 ; p 1 = z 0 ; w = Ap 1 ; α 1 = r T 0 z 0 /(p T 1 w); x 1 = x 0 + α 1 p 1 ; r 1 = r 0 α 1 w; k = 1 ; while r k 2 > ǫ (some tolerence) Solve Mz k = r k ; β k = r T k z k/(r T k 1 z k 1); p k+1 = z k + β k p k ; w = Ap k+1 ; α k+1 = r T k z k/(p T k+1 w); x k+1 = x k + α k+1 p k+1 ; r k+1 = r k α k+1 w; k = k + 1; x = x k ; So far, I have ignored the question of how to choose M. It is a very important question. Although this version of the conjugate gradient method dates from Concus, Golub, and O Leary (1976), people still write papers on how to choose preconditioners (including your instructor!). Here are some straightforward choices. The Jacobi preconditioner. Choose M = D = diag(a 11,a 22,...,a nn ). The resulting  is  = D 1/2 AD 1/2 which is the same as diagonally scaling to produce  with ones on the diagonal. 4
Block versions of the Jacobi preconditioner are also possible.that is, if we write A in the form A 11 A 12 A 1s A 21 A 22 A 2s A =.......... A s1 A s2 A ss each A ij is q q matrix. A block Jacobi preconditioner is the matrix A 11 0 0 0 A 22 M = D =...... 0 0 0 A ss This matrix M is symmetric and positive definite. The solution of solves z = z 1 z 2. z s Mz = r A kk z k = r k, r = r 1 r 2. r s The SSOR preconditioner For ω (0, 2) choose. M = [ω(2 ω)] 1 (D + ωl)d 1 (D + ωl T ) The value of ω matters little so we usually choose ω = 1, which yields In this case, M = (D + L)D 1 (D + L T ) N = LD 1 L T. The work here is just a back and forward solve. 5
Incomplete Cholesky preconditioner Do Cholesky, but ignore fill elements. If A is large and sparse in the Cholesky factorization A = R T R (2) the matrix R will often have many more nonzeros than A. This is one of the reasons that conjugate gradient is cheaper than Cholesky in some instances. First, let us write a componentwise version of the Cholesky algorithm to compute (2). for k = 1:n 1 r kk = a kk ; for j = k + 1 : n r kj = a kj /r kk ; end ; for i = k + 1 : n for j = i:n % Upper triangle only! a ij = a ij r ki r kj ; r nn = a nn ; Incomplete Cholesky ignores the fill elements. We get M = ˆR T ˆR R M = (ˆr kj ) from the following algorithm. Here r kj = 0 if a kj = 0. Algorithm 3 (Incomplete Cholesky Factorization) for k = 1:n 1 ˆr kk = a kk ; for j = k + 1 : n ˆr kj = a kj /ˆr kk ; end ; for i = k + 1 : n 6
for j = i:n % Upper triangle only! if a ij 0 a ij = a ij ˆr kiˆr kj ; r nn = a nn ; One downside to this algorithm, is that even if A is SPD, it is possible that a kk could be negative or zero when it is time for r kk to be evaluated at the beginning of the main loop. Thus, unlike the Jacobi and SSOR preconditioners, the incomplete Cholesky preconditioner is not defined for all SPD matrices! However, if, in addition, A is irreducibly diagonally dominant, that is, a jj i j a ij, j = 1,...,n with at least one inequality strict and no decoupling, that is, no permutation such that ( ) PAP T A1 0 =, 0 A 2 then incomplete Cholesky is guaranteed to exist. Incomplete Cholesky is a popular preconditioner in many environments. Much has been written about it and continues to be written about it. The many variants of it consist of algorithms for specifying which fill elements to accept and which to reject. They can become quite sophisticated. 7