Lecture 7 Conjugate Gradient Method(CG)

Size: px

Start display at page:

Download "Lecture 7 Conjugate Gradient Method(CG)"

Jonah Rolf Griffith
5 years ago
Views:

1 Lecture 7 Conjugate Gradient Method(CG) Jinn-Liang Liu 2017/6/7 The CG method is used for solving symmetric and positive definite(spd) systems A x = b (7.1) with A = A T, (7.2) x T A x > 0 forallnon-zerovectors x R N. (7.3) Algorithm CG: Conjugate Gradient Method x (0) = 0 (Arbitrary) r (0) = b A x (0) (Residualvector) p (1) = r (0) fork=1,,n (WhyN notkmax?) p, r (k 1) (S1) α k := p,a p = r (k 1), r (k 1) p,a (Why??) p (S2) x = x (k 1) +α kp (S3) r = b A x = r (k 1) α k A p (Why?) check convergence; continue if necessary (S4) β k := r,a p p,a p = r, r r (k 1), (Why??) r (k 1) (S5) p (k+1) = r +β kp end for Inthisalgorithm,α k iscalledasteppinglengthand p iscalledasearch direction(or predictor, or prediction vector). We can geometrically interpret (S2)asthatwearecurrentlyat x (k 1) andthentakeanextstepwiththe lengthα k inthedirectionof p togotothepoint x. Therefore, itis criticaltodeterminethenextα k and p. 29

2 Example 7.1. Given A = [ and b = [ 0 0, find the solution of A x = [ b byusing the conjugate method with the initial guess x (0) = 1. 1/9 Project7.1. First,writeaprogramfortheCGalgorithmandtestitbyExample7.1. Second,runyourCGprogramforthe1DPoissonProblem (1.1)(withf(x)=2,g D =0,andg N =0)discretizedbyFDM. Input: N,A, b,tol(writetheinputintheprogram). Output: N k E x E u α The CG is a nonstationary iterative method, i.e., in matrix form x =B x (k 1) + c (7.4) whereband c dependontheiteration(i.e.,band c changefromiteration to iteration). Table 7.1. Time Complexity of Various Methods Method 2D 3D GE O(N 3 ) O(N 3 ) JM O(N 2 ) O(N 5 3) GS O(N 2 ) O(N 5 3) SOR O(N 3 2) O(N 4 3) CG O(N 3 2) O(N 4 3) PCG O(N 1.2 ) O(N 1.17 ) Multigrid O(NlogN) O(NlogN) Remark 7.1. (1) CG can also be used to solve unconstrained optimization problems. 30

3 (2) The biconjugate gradient method (BiCG) is a generalization of CG for non-symmetric systems. (3) CG with preconditioning requires about N steps to determine an approximate solution. Wesaythattwonon-zerovectors u and v areconjugate withrespect toaif u, v A := u T A v =0 (7.5) ThelefthandsidedefinesaninnerproductsinceAisSPD,i.e., u, v A = u T A v = u,a v = A T u, v =A u, v = v, u A, where, isthestandardinnerproductdefinedinr N. Thisconjugateis not related to the notion of complex[ conjugate. 2 1 Example 7.2. Show that A = is SPD. Let [ 0 u = and [ v =. Showthat u, v=0,i.e., u v inthesenseof,. Does 0 u, v A =0? Findanytwovectors x and y suchthat x, y A =0,i.e., x y inthesenseof, A. Minimization Problem: With A being SPD, solve (7.1) for x via minimizing the quadratic function φ( x)= 2 1 x T A x x T b = 1 2 A x, b, x x x R N (7.6) Example7.3. Asimplestexamplefor(7.1)and(7.6)isgivenby N =1, A=2,b=1. Thesolutionofthelinear problem2x=1isx =1/2. Find theminimizerx forthefollowingquadraticfunction(anonlinearproblem) φ( x) = 1 2 xt Ax x T b = x2 x=x 2 x (7.7) 31

4 [ 1 0 Example 7.4. Solve the linear problem 0 9 then solve it via minimizing the quadratic function [ x1 x 2 = [ 0 0 and φ( x) = 1 2 xt Ax x T b = 1 2 [ x1 x 2 [ [ x1 x 2 [ x 1 x 2 [ 0 0 (7.8) 1 Minimization of the Quadratic Function Asnotedabove,φ( x)isnonlinearanda x = b islinear. Givenanyvectors x and p,lethbeafunctionofαdefinedby h(α) = φ( x +α p) (7.9) = 1 2 A( x +α p), x +α b, p x +α p = α2 2 A p, p+αa p, x+ 1 2 A x, b, x x +α p = 1 2 A p, pα 2 + A x b, p α+φ( x) (7.10) Theorem 7.1. x minimizesφ( x) A x = b (7.11) Proof: (= ) Foranyvector p,define h(α) =φ( x +α p) Sinceφ( x )isminimum,thefunctionh(α)attainsitsminimumatα=0 32

5 withtheminimumvalueofh(0)=φ( x ). Thisimpliesthat 0 = dh(0) dα =lim h(α) h(0) α 0 α φ( x +α p) φ( x ) = lim α 0 α 1 = lim 2 A p, pα 2 + α 0 ( 1 = lim α 0 2 A p, pα+ = A x b, p = A x = b. A x b, p α+φ( x ) φ( x ) α A x b, p, p 0. ) ( =) A x = b = φ( x + p) = φ( x )+ = φ( x )isminimum. A x b, p A p, p φ( x ), p Asaby-productfromtheproof,weseethat,foranygivenvectors x and p 0, d 0 = dα φ( x +α p)=h (α)=a p, pα+ A x b, p = α= A p, p A x b, p (7.12) which gives an optimal stepping length as defined in(s1). Remark 7.2. This theorem says that solving the linear system(7.1) is equivalent to minimizing the nonlinear function. In general, a linear problem ismucheasiertosolvethananonlinearone. Question7.1. Whydowemakeadetourfromaneasyroadtoarough road? 33

6 2 Directional Derivative(Gradient Operator) D p φ( φ( x+t p) φ( x) x) = D p φ(x 1,x 2 ) =lim t 0 t ( x = (x 1,x 2 )anyfixedpoint, p =(p 1,p 2 )anyunitdirection.) φ(x 1 +tp 1,x 2 +tp 2 ) φ(x 1,x 2 ) = lim t 0 t φ(x 1 +tp 1,x 2 +tp 2 ) φ(x 1,x 2 +tp 2 )+φ(x 1,x 2 +tp 2 ) φ(x 1,x 2 ) = lim t 0 t = lim t 0 [φ(x 1 +tp 1,x 2 +tp 2 ) φ(x 1,x 2 +tp 2 )p 1 tp 1 + [φ(x 1,x 2 +tp 2 ) φ(x 1,x 2 )p 2 tp 2 (7.13 [φ(x 1 + x 1,x 2 +tp 2 ) φ(x 1,x 2 +tp 2 )p 1 = lim + [φ(x 1,x 2 + x 2 ) φ(x 1,x 2 )p 2 t 0 x x 2 = φ(x 1,x 2 ) p 1 + φ(x 1,x 2 ) p 2 x 1 x 2 ( =, = grad = del, the vector differential operator gradient) x 1 x 2 = φ(x 1,x ) 2 p (7.14 = φ p cosθ MaxvalueofD p φ is φ withθ=0and p =1 MinvalueofD p φis φ in p = φ/ φ directionθ=180 Conclusion 7.1. At any point x, the functional φ decreases most (

7 rapidly(steepest)inthedirectionofthenegativegradient φ( x). φ( x) = φ(x 1,x 2,,x N ) = 1 2 A x, b, x x = 1 [ x1 x N 2 = 1 2 = 1 2 [ b 1 b N N N a ij x j x i j=1 N x i b i a 11 a 1N a 21 a 2N.. a N1 a NN x 1.. x N x 1.. x N N (a i1 x 1 +a i2 x 2 + +a ii x i + +a ik x k + )x i = 1 2 (a 11x 1 +a 12 x 2 + +a 1i x i + +a 1k x k + )x 1 N x i b i (a 21x 1 +a 22 x 2 + +a 2i x i + +a 2k x k + )x (a k1x 1 +a k2 x 2 + +a ki x i + +a kk x k + )x k (7.16) N 2 (a N1x 1 +a N2 x 2 + +a Ni x i + +a Nk x k + )x N x i b i i φ xk ( x) = φ( x) x k = N a ik x i b k, ( a ik =a ki ) (7.17) φ( x) = ( φ x1,φ x2,,φ xn ) = b A x (7.18) 35

8 The slope (grade) of a line φ(x) = ax+b (a scalar function) is defined as the rise over the run, m = φ = x φ (x) = dφ. The gradient (or gradient dx vectorfield)of ascalarfunctionφ( x)withrespecttoavectorvariable x isdenotedby φ=gradφwhere (thenablasymbol)denotesthevector differential operator del(grad). By definition, the gradient is a vector field whose components are the partial derivatives of φ as(7.18). HW 7.1. Consider a unit circle. Define the unit circle by using a quadraticfunctionalφ( x). Find φatthepoint(1,0)onthecircle. What canyousay(geometrically)aboutthegradientatthatpointandthenatany point? Tangent vector? Normal vector? So what is your conclusion? If we changetheunitcircletoasetoflevelcurves,doesyourconclusionstillhold? Prove and draw a picture to explain the conclusion. 3 Method of Steepest Descent Therefore, we should choose the direction p as b A x if we want to minimize φ(x, y) in the steepest way, i.e., D p φ(x,y) = φ( x) p (7.19) p = φ( x) (7.20) p = b A x (7.21) r = b A x (7.22) Iteration: x p = x (k 1) +α k p (7.23) = φ( x (k 1) )= r (k 1) (7.24) Algorithm MSD: Method of Steepest Descent x (0) = 0 for k=1,2,,kmax r (k 1) = b A x (k 1) if r (k 1) = 0 thenquit else p = r (k 1) (initialguessoranyothervector) (searchdirection p = φ) (solutionfound) 36

9 p, r (k 1) r (k 1) α k = p,a p =, r (k 1) r (k 1),A r (k 1) x = x (k 1) +α kp (by(7.12)) Remark 7.3. The convergence results for MSD are(see J. R. Shewchuk 1994) ( ) k e κ 1 A e (0) κ+1 A (7.25) e (0) = x x (0) e = x x e A = e,a e κ = λ max λ min, {λ i } areeigenvalueofa (7.26) A SPD 0<λ 1 λ 2 λ 3 λ N κ = λ N λ 1 (iii) TheconvergenceisslowifAisill-conditioned,i.e.,κ>>1. HW 7.2. Prove(7.25). Example 7.5. [ 3 2 A= 2 6 Solution: det(a λi) = det 3 λ λ =0 (3 λ)(6 λ) 4=0 0 = λ 2 9λ+14 λ = 9± λ max = 7 λ min = 2 Example 7.6. A= κ = 7 2 [ b = [ 2 8

10 Solution: φ( x) = 1 2 A x, b, x x = 1 [ [ [ [ 3x1 +2x 2 x1 2 x1,, 2 2x 1 +6x 2 x 2 8 x 2 = 1 ( ) 3x x 1 x 2 +6x 2 2 (2x1 8x 2 ) = C (Different contours with different constants C) Example 7.7. N =2,A= Solution: x 1 = x 1 cosθ+x 2 sinθ x 2 = x 1sinθ+x 2cosθ (x 1 p 1 ) 2 a 2 + (x 2 p 2 ) 2 [ d1 0, [ 0 b = 0 d 2 0 φ( x) = 1 2 A x, x= 1 2 = 1 2 (d 1x 2 +d 2 y 2 ) b 2 =1, [ x x = y, d 2 d 1 >0 [ x, y [ d d 2 [ x y = C 1 (constant)>0 a level curve x2 a 2 +y2 b 2 =1 a b d2 d 1 κ HW7.3. Letd 1 =1andd 2 =1/9. Drawlevelcurves(contours)ofthe functionφ( x)andthegradient φ( x)atsomepoint x =(1,1/9). Find thetangentlineat x =(1,1/9)tothelevelcurvepassingthrough(1,1/9). Whatistherelationbetween φ(1,1/9)andthisline? Findtheminimum valueandminimizerofφ( x)bymsdstartingfromthepoint(1,1/9). 38

11 Figure 1: Steepest Descent vs. Conjugate Gradient Conclusion 7.2. Case (1) If the contour is a circle, i.e. κ = 1, the solutionisfoundinone stepbymsd.case(2)aisill-conditioned κ 1 level curves very elongated requires a lot of iterations to find an approximation solution MSD converges very slowly. HW7.4. ProveCase(1). Example 7.8. A comparison (Fig. 1) of the convergence of gradient (steepest) descent with optimal step size(in green) and conjugate gradient (in red) for minimizing the quadratic form associated with a given linear system. Conjugate gradient converges in at most N steps where N is the sizeofthematrixofthesystem(heren =2). 4 Conjugate Gradient Method To avoid Case(2), another approach was proposed by choosing { p (1), p (2), p (3), } { r (0), r (1), r (2), } (7.27) 39

12 thatsatisfy p (i),a p (j) =0 i j (7.28) This is called an A-orthogonal (A-conjugate) condition and the set { p (1), p (2), p (3), } is said to be A-orthogonal. The vectors p (i) are calledconjugatedirections. SinceAisSPD,theset { p (1), p (2), p (3),, p (N)} is linearly independent, i.e., R N = span { p (1), p (2), p (3),, p (N)}? (7.29) (7.12) h (α) = 0 b A x (k 1), p α k = x A p, p = x (k 1) +α k p = r (k 1), p A p, p (7.30) (Next Iterate) (7.31) Theorem 7.2. Let { p (1), p (2), p (3),, p (N)} be an A-orthogonal set of nonzero vectors associated with the SPD matrix A and let x (0) be arbitrary. Definetheiterates(7.31) fork=1,2,3,,with(7.30). Then, assuming exact arithmetic, we have A x (N) = b (N stepstogetsolution) (7.32) 40

13 Proof: A x = b, A:N N, x R N φ( x) = 1 2 A x, b, x x, x R N h(α) = φ( x +α p)=φ( x)+α A x b, p + α2 2 A p, p φ( x) = r = b A x x = x (k 1) +α kp A x (N) = A x (N 1) +α N A p (N) = A x (N 2) +α N 1 A p (N 1) +α N A p (N) = A x (0) +α 1 A p (1) +α 2 A p (2) + +α N 1 A p (N 1) +α N A p (N) A x (N) b = A x (0) N b + α i A p (i) A x (N) b, p = = A x (0) b, p + N α i A p (i), p A x (0) b, p +α k A p, p ( { p (1), p (2), p (3),, p (N)} isa-orthogonal ) b = A x (0) b, A x (k 1), p p + A p, A p, p p = A x (0) A x (k 1), p (A x (k 1) = A x (k 2) +α k 1 A p (k 1) =A x (0) +α 1 A p (1) + +α k 1 A p (k 1) ) k 1 = α i A p (i), p =0 k=1,,n = A x (N) = b HowtodeterminetheA-orthogonalset { p (1), p (2), p (3),, p (N)} (set of search directions, conjugate directions)? 41

14 Lemma 7.1. Show that r, p = 0 fork=1,,n, (7.33) p (k+1),a p = 0 fork=1,,n 1, (7.34) p,a p = r (k 1),A p fork=1,,n, (7.35) r, r (k 1) = 0 fork=1,,n, (7.36) Proof: r = b A x (x) ( x = x (k 1) +α kp ) b = A x b A (k 1), p x (k 1) A p, A p p r, p b = A x (k 1), b p A x (k 1), p = 0 (7.33) (S5) A p (k+1) =A r +β k A p A p (k+1), p = A r, p +β k A p, p =0 (7.34)by(S4)andthisistheanswertothefirst? in(s4). Fork=1, p (1),A p (1) = r (0),A p (1) bythedefinitionof p (1). For2 k N, p,a p = r (k 1) +β k 1 p (k 1),A p by(s5) = r (k 1),A p +β k 1 p (k 1),A p = r (k 1),A p by(7.34) (7.35) HW7.5. Showthesecondequalityin(S1)firstandthen(7.36). Theorem 7.3. p (j),a p (m) =0, r (j 1), r (m 1) =0 j m. (7.37) 42

15 Proof: Thisresultisprovedbyinductiononm. From(7.34)and(7.36), wefirstnotethat p (2),A p (1) = r (1), r (0) =0. Assumenowthat p (l),a p = r (l 1), r (k 1) =0 for 1 k<l m. (7.38) Wewanttoshowthatthesamerelationholdstruefor0 k<l m+1. It hasbeenshowninlemma7.1fork=mandl=m+1. Taking1 k<m andl=m+1,itfollows r (m), r (k 1) = r (m 1), r (k 1) α m A p (m), r (k 1) = α m A p (m), r (k 1) by(7.38) = α m A p (m), p β k 1 p (k 1) by(s5) = 0 by(7.38) (7.39) p (m+1),a p = r (m) +β m p (m),a p = r (m),a p +β m p (m),a p = r (m),a p by(7.38) = r (m), r (k 1) r α 1 k by(s3) = 0 by(7.39) 43

Chapter 7 Iterative Techniques in Matrix Algebra

Chapter 7 Iterative Techniques in Matrix Algebra Per-Olof Persson persson@berkeley.edu Department of Mathematics University of California, Berkeley Math 128B Numerical Analysis Vector Norms Definition