The Conjugate Gradient Method Jason E. Hicken Aerospace Design Lab Department of Aeronautics & Astronautics Stanford University 14 July 2011
Lecture Objectives describe when CG can be used to solve Ax = b relate CG to the method of conjugate directions describe what CG does geometrically explain each line in the CG algorithm
We are interested in solving the linear system Ax = b where x, b R n and A R n n Matrix is symmetric positive-definite (SPD) A T = A (symmetric) x T Ax > 0, x 0 (positive-definite) discretization of elliptic PDEs optimization of quadratic functionals nonlinear optimization problems
We are interested in solving the linear system Ax = b where x, b R n and A R n n Matrix is symmetric positive-definite (SPD) A T = A (symmetric) x T Ax > 0, x 0 (positive-definite) discretization of elliptic PDEs optimization of quadratic functionals nonlinear optimization problems
When A is SPD, solving the linear system is the same as minimizing the quadratic form f(x) = 1 2 xt Ax b T x. Why? If x is the minimizing point, then f(x ) = Ax b = 0 and, for x x f(x) f(x ) > 0. (homework)
When A is SPD, solving the linear system is the same as minimizing the quadratic form f(x) = 1 2 xt Ax b T x. Why? If x is the minimizing point, then f(x ) = Ax b = 0 and, for x x f(x) f(x ) > 0. (homework)
Definitions Let x i be the approximate solution to Ax = b at iteration i. error: e i x i x residual: r i b Ax i The following identities for the residual will be useful later. r i = Ae i r i = f(x i )
Model problem [ 5 3 3 5 ]( x1 ) ( 4 = x 2 4) 3.5 3 2.5 x 2 2 1.5 1 3x 1 + 5x 2 = 4 5x 1 3x 2 = 4 x = [2 2] T 0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 x 1
Model problem [ 5 3 3 5 ]( x1 ) ( 4 = x 2 4) 3.5 3 2.5 x 2 2 1.5 1 3x 1 + 5x 2 = 4 5x 1 3x 2 = 4 x = [2 2] T 0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 x 1
Review: Steepest Descent Method Qualitatively, how will steepest descent proceed on our model problem, starting at x 0 = ( 1 3,1)T? 3.5 3 x 2 2.5 2 1.5 1 x 0 0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 x 1
Review: Steepest Descent Method Qualitatively, how will steepest descent proceed on our model problem, starting at x 0 = ( 1 3,1)T? 3.5 3 x 2 2.5 2 1.5 x 0 1 x 1 0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 x 1
Review: Steepest Descent Method Qualitatively, how will steepest descent proceed on our model problem, starting at x 0 = ( 1 3,1)T? 3.5 3 x 2 2.5 2 1.5 1 x 0 0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 x 1
How can we eliminate this zig-zag behaviour? To find the answer, we begin by considering the easier problem [ ]( ) ( ) 2 0 x1 4 2 =, 0 8 x 2 0 f(x) = x 2 1 +4x 2 2 4 2x 1. Here, the equations are decoupled, so we can minimize in each direction independently. What do the contours of the corresponding quadratic form look like?
How can we eliminate this zig-zag behaviour? To find the answer, we begin by considering the easier problem [ ]( ) ( ) 2 0 x1 4 2 =, 0 8 x 2 0 f(x) = x 2 1 +4x 2 2 4 2x 1. Here, the equations are decoupled, so we can minimize in each direction independently. What do the contours of the corresponding quadratic form look like?
Simplified problem [ 2 0 0 8 ]( x1 ) = x 2 ( 4 2 0 ) 1.5 1 0.5 0 x 2 0.5 1 1.5 0 0.5 1 1.5 2 2.5 3 3.5 4 x 1
Simplified problem [ 2 0 0 8 ]( x1 ) = x 2 ( 4 2 0 ) 1.5 1 0.5 0 x 2 0.5 1 1.5 0 0.5 1 1.5 2 2.5 3 3.5 4 x 1
Simplified problem [ 2 0 0 8 ]( x1 ) = x 2 ( 4 2 0 ) 1.5 1 0.5 0 x 2 e 0 0.5 1 1.5 0 0.5 1 1.5 2 2.5 3 3.5 4 x 1
Simplified problem [ 2 0 0 8 ]( x1 ) = x 2 ( 4 2 0 ) 1.5 1 0.5 0 x 2 e 0 0.5 1 1.5 0 0.5 1 1.5 2 2.5 3 3.5 4 x 1
Method of Orthogonal Directions Idea: Express error as a sum of n orthogonal search directions n 1 e x 0 x = α i d i. i=0 At iteration i +1, eliminate component α i d i. never need to search along d i again converge in n iterations! How would we apply the method of orthogonal directions to a non-diagonal matrix?
Method of Orthogonal Directions Idea: Express error as a sum of n orthogonal search directions n 1 e x 0 x = α i d i. i=0 At iteration i +1, eliminate component α i d i. never need to search along d i again converge in n iterations! How would we apply the method of orthogonal directions to a non-diagonal matrix?
Method of Orthogonal Directions Idea: Express error as a sum of n orthogonal search directions n 1 e x 0 x = α i d i. i=0 At iteration i +1, eliminate component α i d i. never need to search along d i again converge in n iterations! How would we apply the method of orthogonal directions to a non-diagonal matrix?
Review of Inner Products The search directions in the method of orthogonal directions are orthogonal with respect to the dot product. The dot product is an example of an inner product. Inner Product For x,y,z R n and α R, an inner product (,) : R n R n R satisfies symmetry: (x,y) = (y,x) linearity: (αx +y,z) = α(x,z)+(y,z) positive-definiteness: (x,x) > 0 x 0
Review of Inner Products The search directions in the method of orthogonal directions are orthogonal with respect to the dot product. The dot product is an example of an inner product. Inner Product For x,y,z R n and α R, an inner product (,) : R n R n R satisfies symmetry: (x,y) = (y,x) linearity: (αx +y,z) = α(x,z)+(y,z) positive-definiteness: (x,x) > 0 x 0
Fact: (x,y) A x T Ay is an inner product A-orthogonality (conjugacy) We say two vectors x,y R n are A-orthogonal, or conjugate, if (x,y) A = x T Ay = 0. What happens if we use A-orthogonality rather than standard orthogonality in the method of orthogonal directions?
Let {p 0,p 1,...,p n 1 } be a set of n linearly independent vectors that are A-orthogonal. If p i is the i th column of P, then P T AP = Σ where Σ is a diagonal matrix. Substitute x = Py into the quadratic form: f(py) = y T Σy (P T b) T y. We can apply the method of orthogonal directions in y-space.
Let {p 0,p 1,...,p n 1 } be a set of n linearly independent vectors that are A-orthogonal. If p i is the i th column of P, then P T AP = Σ where Σ is a diagonal matrix. Substitute x = Py into the quadratic form: f(py) = y T Σy (P T b) T y. We can apply the method of orthogonal directions in y-space.
New Problem: how do we get the set {p i } of conjugate vectors? Gram-Schmidt Conjugation Let {d 0,d 1,...,d n 1 } be a set of linearly independent vectors, e.g., coordinate axes. set p 0 = d 0 for i > 0 i 1 p i = d i β ij p j j=0 where β ij = (d i,p j ) A /(p j,p j ) A.
New Problem: how do we get the set {p i } of conjugate vectors? Gram-Schmidt Conjugation Let {d 0,d 1,...,d n 1 } be a set of linearly independent vectors, e.g., coordinate axes. set p 0 = d 0 for i > 0 i 1 p i = d i β ij p j j=0 where β ij = (d i,p j ) A /(p j,p j ) A.
The Method of Conjugate Directions Force the error at iteration i +1 to be conjugate to the search direction p i. p T i Ae i+1 = p T i A(e i +α i p i ) = 0 α i = pt i Ae i p T i Ap i = p T i r i p T i Ap i never need to search along p i again converge in n iterations!
The Method of Conjugate Directions Force the error at iteration i +1 to be conjugate to the search direction p i. p T i Ae i+1 = p T i A(e i +α i p i ) = 0 α i = pt i Ae i p T i Ap i = p T i r i p T i Ap i never need to search along p i again converge in n iterations!
The Method of Conjugate Directions Force the error at iteration i +1 to be conjugate to the search direction p i. p T i Ae i+1 = p T i A(e i +α i p i ) = 0 α i = pt i Ae i p T i Ap i = p T i r i p T i Ap i never need to search along p i again converge in n iterations!
The Method of Conjugate Directions [ 5 3 3 5 ]( x1 ) ( 4 = x 2 4) 3.5 3 x 2 2.5 2 1.5 d 1 1 d 0 0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 x 1
The Method of Conjugate Directions [ 5 3 3 5 ]( x1 ) ( 4 = x 2 4) 3.5 3 x 2 2.5 2 1.5 p 1 1 p 0 0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 x 1
The Method of Conjugate Directions [ 5 3 3 5 ]( x1 ) ( 4 = x 2 4) 3.5 3 x 2 2.5 2 1.5 1 x 0 0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 x 1
The Method of Conjugate Directions [ 5 3 3 5 ]( x1 ) ( 4 = x 2 4) 3.5 3 x 2 2.5 2 1.5 1 x 0 0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 x 1
The Method of Conjugate Directions [ 5 3 3 5 ]( x1 ) ( 4 = x 2 4) 3.5 3 x 2 2.5 2 1.5 1 x 0 0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 x 1
The Method of Conjugate Directions is well defined, and avoids the zig-zagging of Steepest Descent. What about computational expense? If we choose the d i in Gram-Schmidt conjugation to be the coordinate axes, the Method of Conjugate Directions is equivalent to Gaussian elimination. Keeping all the p i is the same as storing a dense matrix! Can we find a smarter choice for d i?
The Method of Conjugate Directions is well defined, and avoids the zig-zagging of Steepest Descent. What about computational expense? If we choose the d i in Gram-Schmidt conjugation to be the coordinate axes, the Method of Conjugate Directions is equivalent to Gaussian elimination. Keeping all the p i is the same as storing a dense matrix! Can we find a smarter choice for d i?
Error Decomposition Using p i 3.5 3 x 2 2.5 2 1.5 e 0 1 0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 x 1
Error Decomposition Using p i 3.5 3 x 2 2.5 2 1.5 e 0 α 1 p 1 1 α 0 p 0 0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 x 1
The error at iteration i can be expressed as n 1 e i = α k p k, k=i so the error must be conjugate to p j for j < i: pj T Ae i = 0, pj T r i = 0, but from Gram-Schmidt conjugation we have = =
The error at iteration i can be expressed as n 1 e i = α k p k, k=i so the error must be conjugate to p j for j < i: pj T Ae i = 0, pj T r i = 0, but from Gram-Schmidt conjugation we have = =
The error at iteration i can be expressed as n 1 e i = α k p k, k=i so the error must be conjugate to p j for j < i: pj T Ae i = 0, pj T r i = 0, but from Gram-Schmidt conjugation we have = =
The error at iteration i can be expressed as n 1 e i = α k p k, k=i so the error must be conjugate to p j for j < i: pj T Ae i = 0, pj T r i = 0, but from Gram-Schmidt conjugation we have j 1 p j = d j β jk p k. k=0
The error at iteration i can be expressed as n 1 e i = α k p k, k=i so the error must be conjugate to p j for j < i: p T j Ae i = 0, p T j r i = 0, but from Gram-Schmidt conjugation we have p T j r i = d T j r i j 1 k=0 0 = d T j r i, j < i. β jk p T k r i
Thus, the residual at iteration i is orthogonal to the vectors d j used in the previous iterations: d T j r i = 0, j < i Idea: what happens if we choose d i = r i? residuals become mutually orthogonal r i is orthogonal to p j, for j < i r i+1 becomes conjugate to p j, for j < i This last point is not immediately obvious, so we will prove it. This result has significant implications for Gram-Schmidt conjugation. we showed this is true for any choice of d i
Thus, the residual at iteration i is orthogonal to the vectors d j used in the previous iterations: d T j r i = 0, j < i Idea: what happens if we choose d i = r i? residuals become mutually orthogonal r i is orthogonal to p j, for j < i r i+1 becomes conjugate to p j, for j < i This last point is not immediately obvious, so we will prove it. This result has significant implications for Gram-Schmidt conjugation. we showed this is true for any choice of d i
The solution is updated according to x j+1 = x j +α j p j r j+1 = r j α j Ap j Ap j = 1 α j (r j r j+1 ). Next, take the dot product of both sides with an arbitrary residual r i : ri T r i α i, i = j ri T Ap j = rt i r i α i 1, i = j +1 0, otherwise.
The solution is updated according to x j+1 = x j +α j p j r j+1 = r j α j Ap j Ap j = 1 α j (r j r j+1 ). Next, take the dot product of both sides with an arbitrary residual r i : ri T r i α i, i = j ri T Ap j = rt i r i α i 1, i = j +1 0, otherwise.
The solution is updated according to x j+1 = x j +α j p j r j+1 = r j α j Ap j Ap j = 1 α j (r j r j+1 ). Next, take the dot product of both sides with an arbitrary residual r i : ri T r i α i, i = j ri T Ap j = rt i r i α i 1, i = j +1 0, otherwise.
The solution is updated according to x j+1 = x j +α j p j r j+1 = r j α j Ap j Ap j = 1 α j (r j r j+1 ). Next, take the dot product of both sides with an arbitrary residual r i : ri T r i α i, i = j ri T Ap j = rt i r i α i 1, i = j +1 0, otherwise.
The solution is updated according to x j+1 = x j +α j p j r j+1 = r j α j Ap j Ap j = 1 α j (r j r j+1 ). Next, take the dot product of both sides with an arbitrary residual r i : ri T r i α i, i = j ri T Ap j = rt i r i α i 1, i = j +1 0, otherwise.
We can show that the first case (i = j) contains no new information (homework). Divide the remaining cases by p T j Ap j and insert the definition of α i 1 : ri T Ap j pj T = Ap j }{{} β ij rt i r i ri 1 T r i 1, i = j +1 0, otherwise. We recognize the L.H.S. as the coefficients in Gram-Schmidt conjugation only one coefficient is nonzero!
We can show that the first case (i = j) contains no new information (homework). Divide the remaining cases by p T j Ap j and insert the definition of α i 1 : r T i Ap j p T j Ap j }{{} β ij = rt i r i ri 1 T r i 1, i = j +1 0, otherwise. We recognize the L.H.S. as the coefficients in Gram-Schmidt conjugation only one coefficient is nonzero!
The Conjugate Gradient Method Set p 0 = r 0 = b Ax 0 and i = 0 α i = (p T i r i )/(p T i Ap i ) x i+1 = x i +α i p i r i+1 = r i α i Ap i (step length) (sol. update) (resid. update) β i+1,i = (r T i+1r i+1 )/(r T i r i ) (G.S. coeff.) p i+1 = r i+1 β i+1,i p i (Gram Schmidt) i := i +1
The Conjugate Gradient Method Set p 0 = r 0 = b Ax 0 and i = 0 α i = (p T i r i )/(p T i Ap i ) x i+1 = x i +α i p i r i+1 = r i α i Ap i (step length) (sol. update) (resid. update) β i+1,i = (r T i+1r i+1 )/(r T i r i ) (G.S. coeff.) p i+1 = r i+1 β i+1,i p i (Gram Schmidt) i := i +1
The Conjugate Gradient Method Set p 0 = r 0 = b Ax 0 and i = 0 α i = (p T i r i )/(p T i Ap i ) x i+1 = x i +α i p i r i+1 = r i α i Ap i (step length) (sol. update) (resid. update) β i+1,i = (r T i+1r i+1 )/(r T i r i ) (G.S. coeff.) p i+1 = r i+1 β i+1,i p i (Gram Schmidt) i := i +1
The Conjugate Gradient Method Set p 0 = r 0 = b Ax 0 and i = 0 α i = (p T i r i )/(p T i Ap i ) x i+1 = x i +α i p i r i+1 = r i α i Ap i (step length) (sol. update) (resid. update) β i+1,i = (r T i+1r i+1 )/(r T i r i ) (G.S. coeff.) p i+1 = r i+1 β i+1,i p i (Gram Schmidt) i := i +1
The Conjugate Gradient Method Set p 0 = r 0 = b Ax 0 and i = 0 α i = (p T i r i )/(p T i Ap i ) x i+1 = x i +α i p i r i+1 = r i α i Ap i (step length) (sol. update) (resid. update) β i+1,i = (r T i+1r i+1 )/(r T i r i ) (G.S. coeff.) p i+1 = r i+1 β i+1,i p i (Gram Schmidt) i := i +1
The Conjugate Gradient Method Set p 0 = r 0 = b Ax 0 and i = 0 α i = (p T i r i )/(p T i Ap i ) x i+1 = x i +α i p i r i+1 = r i α i Ap i (step length) (sol. update) (resid. update) β i+1,i = (r T i+1r i+1 )/(r T i r i ) (G.S. coeff.) p i+1 = r i+1 β i+1,i p i (Gram Schmidt) i := i +1
The Conjugate Gradient Method Set p 0 = r 0 = b Ax 0 and i = 0 α i = (p T i r i )/(p T i Ap i ) x i+1 = x i +α i p i r i+1 = r i α i Ap i (step length) (sol. update) (resid. update) β i+1,i = (r T i+1r i+1 )/(r T i r i ) (G.S. coeff.) p i+1 = r i+1 β i+1,i p i (Gram Schmidt) i := i +1
The Conjugate Gradient Method [ 5 3 3 5 ]( x1 ) ( 4 = x 2 4) 3.5 3 x 2 2.5 2 1.5 1 x 0 0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 x 1
The Conjugate Gradient Method [ 5 3 3 5 ]( x1 ) ( 4 = x 2 4) 3.5 3 x 2 2.5 2 1.5 1 0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 x 1
The Conjugate Gradient Method [ 5 3 3 5 ]( x1 ) ( 4 = x 2 4) 3.5 3 x 2 2.5 2 1.5 1 0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 x 1
Lecture Objectives describe when CG can be used to solve Ax = b A must be symmetric positive-definite relate CG to the method of conjugate directions CG is a method of conjugate directions with the choice d i = r i, which simplifies Gram-Schmidt conjugation describe what CG does geometrically Performs the method of orthogonal directions in a transformed space where the contours of the quadratic form are aligned with the coordinate axes explain each line in the CG algorithm
Lecture Objectives describe when CG can be used to solve Ax = b A must be symmetric positive-definite relate CG to the method of conjugate directions CG is a method of conjugate directions with the choice d i = r i, which simplifies Gram-Schmidt conjugation describe what CG does geometrically Performs the method of orthogonal directions in a transformed space where the contours of the quadratic form are aligned with the coordinate axes explain each line in the CG algorithm
Lecture Objectives describe when CG can be used to solve Ax = b A must be symmetric positive-definite relate CG to the method of conjugate directions CG is a method of conjugate directions with the choice d i = r i, which simplifies Gram-Schmidt conjugation describe what CG does geometrically Performs the method of orthogonal directions in a transformed space where the contours of the quadratic form are aligned with the coordinate axes explain each line in the CG algorithm
Lecture Objectives describe when CG can be used to solve Ax = b A must be symmetric positive-definite relate CG to the method of conjugate directions CG is a method of conjugate directions with the choice d i = r i, which simplifies Gram-Schmidt conjugation describe what CG does geometrically Performs the method of orthogonal directions in a transformed space where the contours of the quadratic form are aligned with the coordinate axes explain each line in the CG algorithm
Lecture Objectives describe when CG can be used to solve Ax = b A must be symmetric positive-definite relate CG to the method of conjugate directions CG is a method of conjugate directions with the choice d i = r i, which simplifies Gram-Schmidt conjugation describe what CG does geometrically Performs the method of orthogonal directions in a transformed space where the contours of the quadratic form are aligned with the coordinate axes explain each line in the CG algorithm
Lecture Objectives describe when CG can be used to solve Ax = b A must be symmetric positive-definite relate CG to the method of conjugate directions CG is a method of conjugate directions with the choice d i = r i, which simplifies Gram-Schmidt conjugation describe what CG does geometrically Performs the method of orthogonal directions in a transformed space where the contours of the quadratic form are aligned with the coordinate axes explain each line in the CG algorithm
Lecture Objectives describe when CG can be used to solve Ax = b A must be symmetric positive-definite relate CG to the method of conjugate directions CG is a method of conjugate directions with the choice d i = r i, which simplifies Gram-Schmidt conjugation describe what CG does geometrically Performs the method of orthogonal directions in a transformed space where the contours of the quadratic form are aligned with the coordinate axes explain each line in the CG algorithm
References Saad, Y., Iterative Methods for Sparse Linear Systems, second edition Shewchuk, J. R., An introduction to the Conjugate Gradient method without the agonizing pain