TMA 4180 Optimeringsteori THE CONJUGATE GRADIENT METHOD

Size: px

Start display at page:

Download "TMA 4180 Optimeringsteori THE CONJUGATE GRADIENT METHOD"

Allan Lee
5 years ago
Views:

1 INTRODUCTION TMA 48 Optimeringsteori THE CONJUGATE GRADIENT METHOD H. E. Krogstad, IMF, Spring 28 This note summarizes main points in the numerical analysis of the Conjugate Gradient (CG) method. Most of the material is also found in N&W, but the note contains the proof of Eqn. 5.36, which is the main convergence result for CG. The numerical experiments reported in the lecture are also included. 2 DERIVATION The Conjugate Gradient method is derived for the quadratic test problem, min (x) ; () x2rn where (x) = 2 x Ax b x; and A 2 R nn is positive de nite, A >. The method was originally considered to be a direct method for linear equations, but its favorable properties as an iterative method was soon realized, and it was later generalized to more general optimization problems. A ey point in the derivation is clever use of the A-scalar product, hx; yi A corresponding norm, x = (x Ax) =2 (see Basic Tools note). We already now that r (x) = g = Ax b and r 2 = A, so that = x Ay, and the x = arg min x2r n (x) = A b: (2) The rst observation is that a set of n A-orthogonal vectors, p ; p ; ; p n, that is, p iap j = p iap i ij ; (3) is a basis for R n (recall TMA 4/5 or a similar course for the terminology!). This property can be stated in several equivalent ways:. The n vectors fp j g are linearly independent 2. span fp ; p ; ; p n g = R n 3. every x 2 R n can be written uniquely as x = P n j= jp j Let us for the derivation of the CG method assume that we already have a basis fp ; p ; ; p n g, that we start at x =, and let p = g = b (the start point is easily adjusted later). The A-orthogonality enables us to write x = P n j= jp j, where = p Ax p Ap = p b p Ap = p g p Ap : (4)

2 It is interesting that we can compute the coe cients in the expansion for x without nowing x itself! Below the subspaces will play an important role. Observe that B = span fp ; ; p g ; = ; ; n ; (5) B B B n = R n : (6) It is a general property of expansions in orthogonal bases that partial sums are the best approximations in the corresponding subspaces spanned by the basis-vectors. This means that X x = j p j = arg min y x A : (7) y2b j= As long as we loo around for a good approximation to x in B, x is the best choice (in the A-norm)! In the present case, x is also best in another sense, namely, it solves the optimization problem x = arg min (y) : (8) y2b Show this yourself by deriving that y x 2 A = 2 (y) + b A b. Equation 8 represents a constrained problem (minimum constrained to B ). The feasible directions d from x will all be vectors in B, and since r (x ) d, and d 2 B if d is, r (x ) d = g d = for all d 2 B : (9) This could equally well be derived from the famous Projection Theorem, since the error, x x, is A-orthogonal to B : = (x x ) Ad = (Ax b) d () = g d for all d 2 B : The only remaining problem is to actually nd fp j g n j=, and the standard way would be to do this by a Gram-Schmidt procedure, if we had a set of vectors to start with. If we now p, p,, p, it is tempting to try p = g + p + ~ 2 p ~ p ; () since we already now from above that g is orthogonal to and hence linearly independent of fp ; p ; ; p g. The surprising result is now that it su ces to use and then determine so that p Ap =, p = g + p ; (2) = p Ag p Ap : (3) In order to prove that this is really true, we recall that B i B i+. 2

3 The argument uses induction, and the start is easy: Starting with p, p = A-orthogonal to p by choosing as in Eqn. 3. g + p can be made One then needs to verify that fp ; p ; ; p ; p g is an A-orthogonal set if the vectors fp ; p ; ; p g are A-orthogonal. We already now that p Ap =, but what about p and p ; p ; ; p 2? Consider p and p i for i 2: p iap = p ia g + p = p i Ag ; (4) since we now p i Ap = by the induction hypothesis. The tric is now to write multiply by A and subtract b on both sides so that x i+ = x i + i p i ; (5) g i+ = g i + i Ap i : (6) In general, g i 2 B i+ B i+2 B, g i+ 2 B i+2, and then from Eqn. 6, Ap i = (g i+ g i ) = i 2 B. But g is 2 -orthogonal to B, so (Ap i ) g = for all i 2! Before stating the algorithm, let us consider the modi cations when we start at an arbitrary x 6=. The series expansion for x will then be and exactly as above, x = x + j p j ; (7) j= = p A (x x ) p Ap = p (Ax b) p Ap = p g p Ap : (8) The CG-algorithm may then be written: Given x = x Define p = g = Ax + b for = : max end = p g=p Ap x = x + p g = g + Ap = p Ag=p Ap p = g + p The expression for may be simpli ed in several ways, e.g. by utilizing that gradients are 2 - orthogonal: = p Ag + g+ p Ap = (g + g ) = = g + g + g + p (g+ g ) = g g : (9) 3

4 3 ERROR ANALYSIS We now that fx g is a sequence of best approximations, but how good is it really? It turns out that there exists a fairly complete theory for the error analysis which involves some interesting mathematics. Let us start at a general x with the rst search direction p = g = (Ax b). Then, from Eqns. 6 and 2, p = g + p = (g Ag ) g 2 span fg ; Ag g : (2) Continuing in the same way, it is easy to prove that o p 2 span ng ; Ag ; A 2 g ; ; A g : (2) Hence, B could as well be de ned B = span ng ; Ag ; A 2 g ; ; A g o ; (22) and x o x 2 span ng ; Ag ; A 2 g ; ; A g : (23) The sequence g ; Ag ; A 2 g ; (24) is called a Krylov Sequence, and fb g the Krylov Subspaces, after Alexei Niolaevich Krylov ( ), famous Russian naval architect and applied mathematician. The group of algorithms based on Krylov sequences and subspaces (of which CG is the main one) belongs to the 2th Century Top Ten Algorithms. Because of (23), we can write X x x = j A j X g j A j A g = P (A) g ; (25) j= j= where P (A) is a matrix polynomial of degree. Furthermore, x x = P (A) g + x x = P (A) (Ax Ax ) + x x (26) = (P (A) A + I) (x x ) = Q (A) (x x ) ; where Q is a matrix polynomial of degree with Q () =. Lie we did for the analysis the Steepest Descent method, we observe that A has eigenvalues 2 n and a corresponding set of orthogonal, normalized, eigenvectors, e ; e 2 ; ; e n. Recall that for y = P n j= t je j, y 2 A = y Ay = j t j e j A = j t 2 j: (27) j= j= 4

5 Similarly, noting that also Q (A) e j = Q ( j ) e j, Q (A) y 2 A = y Q (A) AQ (A) y = Q 2 ( j ) j t 2 j: (28) j= If we apply Eqn. 28 to Eqn. 26 with (x x ) = P n j= je j, we obtain x x 2 A = Q 2 ( j) j 2 j: (29) j= This is an exact expression, and since the norm is as small as possible, it will be impossible to decrease it by choosing a di erent polynomial. Thus, the polynomial Q is the optimal solution to the problem Q = arg Q 2 ( j ) j 2 A j (3) deg(q)=; Q()= The expression is not particularly tractable since the optimal Q will vary for each x, but we can also write x x 2 A max Q 2 ( j) j 2 j = max Q 2 ( j) x x 2 j j A ; (3) which shows that the relative error of x is bound by max j jq ( j )j ; j= j= x x x A x max jq ( j )j : (32) A j The quite general error estimate in Eqn. 32 has some surprising implications. We are free to put in any polynomial we lie as long as it has degree and Q () =. Thus, if A happens to have only di erent eigenvalues, there exists a -th order polynomial having those eigenvalues as zeros and, by a suitable scaling, we can also obtain Q () =. Then x x A =, and the method converges to x in iterations! The following result is somewhat deeper. It considers the case when we have an idea about the extreme eigenvalues of A, and n, but not the eigenvalues in between. It is then reasonable to as how small the right hand side of Eqn. 32 could be, since we clearly have x x A x x min A Q of order Q()= max n jq ()j : (33) This turns out to be a classic so-called min-max problem in the approximation of functions, and the solution is given by T 2 n n Q () = T + n n ; (34) where T is the Chebyshev polynomial of degree. Loo up for plots and properties of the Chebyshev polynomials (This is the spelling accepted by Mathworld. When I was a student, we were told that there are 52 di erent spellings of the name Chebyshev, but Mathworld cites 4). 5

6 Since max x jt (x)j =, inequality 32 now leads to x x x A x A T + n ; (35) n and x x x A x A T + ; (36) where = n = is the condition number of A. The Chebyshev polynomials satisfy a lot of relations, and one de nition (not included in the Mathworld Encyclopedia) is simply T (x) = x + p p x x x 2 : (37) With x = ( + ) = ( ), we obtain and x + p x 2 = (p + ) 2 + T = (p + ) 2 ; x p x 2 = (p ) 2 ; (38)!! + (p ) 2 A In addition, = ( p + ) 2 + ( p ) 2 2 ( p ) ( p + ) (39) = p + p + p p > p + p : p p +!! ; (4) when >. This leads nally to Eqn in N&W, 2nd Ed. (Note: Eqn in N&W, st Ed. states the result with the wrong power, the wrong constant, and the wrong de nition of!): p x x A 2 p x x + A : (4) Figure shows the polynomials Q for some -s when = :3 and n = 2. deviation for Q in the interval [:3 ; 2] is only 5:6 4, not bad! The maximum Thus, if we have a system of equations and now that the coe cient matrix has eigenvalues in the interval [:3 2], after CG-iterations, regardless the size of the system! x x A 5:6 4 x x A ; (42) Penalty and barrier methods (N&W, Chapter 7) always lead to Hessians which have a few very large eigenvalues (actually tending to as the iteration proceeds), whereas the rest are clustered 6

7 Figure : The optimal Chebyshev polynomials for the interval [:3; 2] when = 2; 5 and. in a xed range. Theorem 5.5, p. 5, in N&W has some quite important implications for this case: x + x n A x x n + A : (43) The bound is independent of the size of n + ; ; n. It is easy to derive the bound from 32 by constructing a ( + )-th order polynomial with zeros at ( + n ) =2, n + ; ; n, when n + ; ; n are unique (try it!). We recognize (43) as the Steepest Descent error bound for a matrix with extreme eigenvalues and n, and this is a ey observation: By restarting the CG method every + iterations we obtain a convergence rate similar to the SD method, but with a greatly reduced condition number (counting CG steps as one iteration). The convergence behavior with several clustered groups of eigenvalues is illustrated on p. 6-7 in N&W. Another good reference to the Conjugate Gradient is the boo of G. H. Golub and C. E. van Loan: Matrix Computations, Johns Hopins University Press (An excellent general reference boo in numerical linear algebra!). 4 SOME NUMERICAL EXPERIMENTS It is easy to experiment with the CG-method using Matlab. In the following code we rst generate a positive de nite matrix R R, and then the matrix A = (R R) npot. The power npot steers the eigenvalue distribution and hence the condition number. ndim = ; R = randn(ndim); for npot =.:.4:2 A = (R *R)^npot; lamb = eig(a); appa= max(lamb)/min(lamb); xsol = rand(ndim,); b = A*xsol; Norm2 = sqrt(xsol *xsol); NormA = sqrt(xsol *A*xsol); 7

8 .5 Eigenvalues, ndim= npot=. κ= A norm 2 norm 5 Error Iteration number Figure 2: Well-conditioned matrix. Fast convergence. end x = zeros(size(b)); g = Ax-b ; p = -g; for loop = :ndim Ap = A*p; % Only one matrix-vector product! alfa = -(p *g)./(p *Ap); x = x + alfa*p; g = g + alfa*ap; % g = A*x-b; beta = (g *Ap)./ (p *Ap); p = -g + beta*p; err2(loop) = sqrt((x-xsol) *(x-xsol))/norm2; erra(loop) = sqrt((x-xsol) *A*(x-xsol))/NormA; end; subplot(,2,); plot(lamb,zeros(size(lamb)), o ); Tittel = [ Eigenvalues, ndim= num2str(ndim)]; title(tittel); subplot(,2,2); semilogy(:ndim, err2,:ndim,erra); legend( 2-norm, A-norm ); xlabel( Iteration number ); ylabel( Error ) Tittel = [ npot= num2str(npot) nappa=,num2str(appa)]; title(tittel); pause 4. Exercise (a) Implement and plot the error bound in Eqn. 4 in the Matlab code. How does this compare with the actual decrease of the error? (N&W say "This bound often gives a large overestimate"). (b) Modify the matrix A (say, A 2 R ) so that it has m (say, m = 5 ) large eigenvalues by adding a ran-m matrix LL, A = (R R) npot + LL ;. Here L is n m and consists of m random column vectors. Test the performance of the CG-method. 8

9 .5 Eigenvalues, ndim= npot=.5 κ= A norm 2 norm 5 Error Iteration number Figure 3: Medium conditioned matrix..5 Eigenvalues, ndim= npot=.3 κ=.86x 7 A norm 2 norm Error Iteration number Figure 4: Ill-conditioned matrix. The convergence is very slow, and rounding errors distroy the accuracy severely. 9

EECS 275 Matrix Computation

EECS 275 Matrix Computation Ming-Hsuan Yang Electrical Engineering and Computer Science University of California at Merced Merced, CA 95344 http://faculty.ucmerced.edu/mhyang Lecture 20 1 / 20 Overview