Conjugate Gradient Tutorial Prof. Chung-Kuan Cheng Computer Science and Engineering Department University of California, San Diego ckcheng@ucsd.edu December 1, 2015 Prof. Chung-Kuan Cheng (UC San Diego) CSE291:Topics on Scientific Computation December 1, 2015 1 / 19
Overview 1 Introduction Overview Formulation 2 Steepest Descent: Descent in One Vector Direction Steepest Descent Formula Steepest Descent Properties Steepest Descent Convergence Preconditioning 3 Conjugate Gradient: Descent with Multiple Vectors Multiple Vector Optimization Global Procedure in Matrix Form V k Conjugate Gradient: Wish List Conjugate Gradient Descent: Formula Validation of the Properties 4 Summary 5 References Prof. Chung-Kuan Cheng (UC San Diego) CSE291:Topics on Scientific Computation December 1, 2015 2 / 19
Introduction: Overview Conjugate Gradient is an extension of steepest gradient descent. For steepest gradient, we step in one direction per iteration. Through the iterations, we found that the new directions may contain the component of the old directions and the process walks in zig-zag patterns. For conjugate gradient, we consider multiple directions simulteneously. Hence, we avoid to repeat the old directions. In 1952, Hestenes and Stiefel independently introduced conjugate gradient formula to simplify the multiple direction search. Prof. Chung-Kuan Cheng (UC San Diego) CSE291:Topics on Scientific Computation December 1, 2015 3 / 19
Introduction: Overview Steepest Gradient Descent: We derive the method and properties of the steepest descent method. We view the steepest descent method as an one-direction per iteration approach. The method suffers slow zig-zag winding in a narrow valley of equal potential terrain. Preconditioning: From the properties of the steepest descent method, we find that preconditioning improves the convergence rate. Conjugate Gradient in Global View: We view conjugate gradient method from the aspect of gradient descent. However, the descent method considers multiple directions simultaneously. Conjugate Gradient Formula: We state the formula of conjugate gradient. Conjugate Gradient Method Properties: We show that the global view of conjugate gradient method can be used to optimize each step independent of the other steps. Therefore, the process can repeat recursively and converge after n iterations, where n is the number of variables. Finally, we show and prove the property that validates the formula. Prof. Chung-Kuan Cheng (UC San Diego) CSE291:Topics on Scientific Computation December 1, 2015 4 / 19
Introduction: Formulation The original problem is to solve a simultaenous linear equation, Ax = b, where matrix A is symmetric and positive definite. Calculating the inverse x = A 1 b can be complicated, e.g. n is huge. To avoid a direct solver, we formulate the problem with a quadratic convex objective function. Formulation minimize 1 2 xt Ax b T x, A S n ++ Solution: x = A 1 b. To avoid direct solvers, use Gradient Descent iteratively to find the answer. Prof. Chung-Kuan Cheng (UC San Diego) CSE291:Topics on Scientific Computation December 1, 2015 5 / 19
Steepest Descent Formula Given initial k = 0,x k = x 0. We descent one direction per iteration along the gradient of the objective function. Derive residual r k = f(x k ) = b Ax k Set x k+1 = x k +α k r k, where step size α k is derived analytically. Step size α k = argmin s 0 f(x k +sr k ), From f(x k+αr k ) α k = 0, we have α k = rt k r k r T k Ar k Therefore, we have x k+1 = x k + rt k r k rk TAr r k k Repeat the above steps with k = k +1 until the norm of r k is within tolerance. Prof. Chung-Kuan Cheng (UC San Diego) CSE291:Topics on Scientific Computation December 1, 2015 6 / 19
Steepest Descent Properties Formula: x k+1 = x k +α k r k = x k + rt k r k rk TAr r k k Objective function: f(x k ) f(x k +α k r k ) = (rt k r k) 2 2r T k Ar k Residual r k+1 = (I α k A)r k = (I (rt k r k) 2 A)r k Proof: r T k Ar k r k+1 = b Ax k+1 = b A(x k +α k r k ) = r k α k Ar k = (I α k A)r k Property of the next direction: r k+1 r k Proof: rk Tr k+1 = rk T(I (rt k r k) 2 A)r k = 0. r T k Ar k Prof. Chung-Kuan Cheng (UC San Diego) CSE291:Topics on Scientific Computation December 1, 2015 7 / 19
Steepest Descent: Convergence We denote x = x +e, where x is the optimal solution and e is the error that we try to reduce. We try to decrease the residual so that e can be reduced. As r 0, e 0. r k = b Ax k = b Ax Ae k = Ae k Prof. Chung-Kuan Cheng (UC San Diego) CSE291:Topics on Scientific Computation December 1, 2015 8 / 19
Gradient Descent: Preconditioning We want to reduce the residual r k = Ae k. Let e k = n i=1 ξ iv i, where v i are the eigenvectors of A, i = 1,2,...,n. Then, we have r k = Ae k = n i=1 λ iξ i v i, where λ i are the eigenvalues of A. Thus, the next residual becomes r k+1 = ( I rt k r k rk TAr k = n i=1 ) A r k n i=1 λ i ξ i v i + λ2 i ξ2 i n i=1 λ3 i ξ2 i n λ 2 iξ i v i. Suppose that all eigenvalues are equal, i.e. λ i = λ, i. We have r k+1 = λ n i=1 ξ i v i + λ2 n i=1 ξ2 i λ 3 n i=1 ξ2 i i=1 n λ 2 ξ i v i = 0 Prof. Therefore, Chung-Kuan Cheng the(uc convergence San Diego) CSE291:Topics accelerates, on Scientific if Computation we can precondition December matrix 1, 2015 A. 9 / 19 i=1
Gradient Descent: Preconditioning f(x) = Ax b = 0 Ax = b Preconditioning: To transform Ax = b into another system with more favorable properties for it to be iteratively solved With the preconditioner M, M 1 Ax = M 1 b (e.g. incomplete LU scaling) Prof. Chung-Kuan Cheng (UC San Diego) CSE291:Topics on Scientific Computation December 1, 2015 10 / 19
Conjugate Gradient: Descent with Multiple Vectors For conjugate gradient, we consider multiple vectors V k = [v 0,v 1,...,v k ] in stage k. Let x k+1 = x k +V k y, where y = [y 1,y 2,...,y k ] T is a vector of parameters. We can write V k y = k i=1 y iv i. To minimize f(x k+1 ), the solution is y = (V T k AV k) 1 V T k r k. Therefore, x k+1 = x k +V k y = x k +V k (V T k AV k) 1 V T k r k. Proof: To minimize f(x k+1 ), we want y f(x k+1 ) = 0. We have { } 1 y f(x k+1 ) = y 2 (x k +V k y) T A(x k +V k y) b T (x k +V k y) = V T k AV ky +V T k Ax k V T k b = VT k AV ky V T k r k = 0 y = (V T k AV k) 1 V T k r k. Prof. Chung-Kuan Cheng (UC San Diego) CSE291:Topics on Scientific Computation December 1, 2015 11 / 19
Conjugate Gradient: Multiple Vector Optimization For the descent on multiple directions, we have the following properties. Function: Since y = (V T k AV k) 1 V T k r k, we have f(x k+1 ) = f(x k )+ 1 2 yt V T k AV ky +y T V T k (Ax b) = f(x k ) 1 2 rt k V k(v T k AV k) 1 V T k r k. Residual: r k+1 = b Ax k+1 = b A(x k +V k (Vk T AV k) 1 Vk T r k) = (I AV k (Vk T AV k) 1 Vk T )r k. Property A: r k+1 V k. The proof is independent of the choice of V k. Proof:Vk T r k+1 = Vk T (I AV k(vk T AV k) 1 Vk T )r k = (Vk T VT k )r k = 0 Prof. Chung-Kuan Cheng (UC San Diego) CSE291:Topics on Scientific Computation December 1, 2015 12 / 19
Global Procedure in Matrix Form V k Through iterations, we want to increase the size of matrix V k = [v 0,v 1,...,v k ] to V k+1 by adding a new vector v k+1 at the last column for iteration k +1. Initial k = 0,v 0 = r 0 = b Ax 0. Repeat: Update x k+1 = x k +V k (V T k AV k) 1 V T k r k and r k+1 = b x k+1. Exit if the norm of r k+1 < tolerance. Derive v k+1 as a function of r k+1 and V k (to be described in CG formula). Construct V k+1 by appending v k+1 to the last column of V k. k = k +1. Property B (independent of the choice of v k ): According to the procedure, we have V T k r k = [0,...,0,v T k r k] T. Proof: From Property A, we have V T k 1 r k = 0, thus V T k r k = [0,...,0,v T k r k] T. Prof. Chung-Kuan Cheng (UC San Diego) CSE291:Topics on Scientific Computation December 1, 2015 13 / 19
Conjugate Gradient: Wish List We hope that V T AV = D = diagd i is a diagonal matrix. In this case, we call that the vectors v i in V are mutually conjugate with respect to matrix A. If V T AV = D = diagd i, we have d i = v T i Av i Therefore, we have x k+1 = x k +V k (V T k AV k) 1 V T k r k = x k +V k D 1 [0,...,0,v T k r k] T = x k +α k v k (Property B), where α k = vt k r k v T k Av k Hopefully, for the new matrix V k+1, the conjugate property remains to be true. Then, we can repeat the steps by increasing k = k +1. When k = n 1, we have r T n V n 1 = 0 (property A). The last residual r n = 0, since matrix V n 1 is full ranked. Thus, we have the solution x n = x. Prof. Chung-Kuan Cheng (UC San Diego) CSE291:Topics on Scientific Computation December 1, 2015 14 / 19
Conjugate Gradient Descent Formula Given x 0, we set initial: k = 0,v k = r k = b Ax 0. x k+1 = x k +α k v k, where α k = vt k r k vk TAv (= rt k r k k vk TAv ). k r k+1 = b Ax k+1 = b Ax k α k Av k = r k α k Av k. v k+1 = r k+1 +β k+1 v k, where β k+1 = 1 α k r T k+1 r k+1 v T k Av k = rt k+1 r k+1 rk Tr. k Repeat the iteration with k = k +1 until the residual is smaller than the tolerance. Lemma: v T k r k = r T k r k. Proof: From Property A, we have v T k r k = (r k +β k v k 1 ) T r k = r T k r k. Prof. Chung-Kuan Cheng (UC San Diego) CSE291:Topics on Scientific Computation December 1, 2015 15 / 19
Validation of the Properties Theorem: The solution x k+1 of the conjugate gradient formula is consistent with the global procedure, i.e. vectors v i produced by the formula are mutually conjugate. The consistence is based on the following three equalities. Property A: ri T v j = 0, i > j. Residuals: ri T r j = 0, i > j. Conjugates: vi T Av j = 0, i > j. Proof: We prove the three equalities by induction. For the case when index i = 1, we have Property A: r1 Tv 0 = 0 Residuals: r1 Tr 0 = 0 (r 0 = v 0 ) Conjugates: v T 1 Av 0 = (r 1 +β 1 v 0 ) T Av 0 = r T 1 Av 0 +β 1 v T 0 Av 0 = r1 T ( r 0 r 1 )+ 1 r1 Tr 1 α 0 α 0 v0 TAv v0 T Av 0 = 0 (r1 T v 0 = 0,r 0 = v 0 ) 0 Prof. Chung-Kuan Cheng (UC San Diego) CSE291:Topics on Scientific Computation December 1, 2015 16 / 19
Validation of the Wish List Proof by induction (continue): Suppose that the statement is true up to index i = k. By assumption of the three equalities, the conjugate gradient formula is consistent with the global procedure up to x k+1 = x k +α k v k. When index is i = k +1, we have Property A: r T k+1 V k = 0 Residuals: r T k+1 r j = r T k+1 (v j β j v j 1 ) = 0, j < k Conjugates: Case j = k:v T k+1 Av k = (r k+1 +β k+1 v k ) T Av k = r T k+1 Av k +β k+1 v T k Av k = rk+1 T (r k r k+1 )+ 1 rk+1 T r k+1 α k α k vk TAv vk T Av k k = 0 (r T k+1 r k = 0). Case j < k:v T k+1 Av j = (r k+1 +β k+1 v k ) T Av j = r T k+1 Av j = r T k+1 (r j r j+1 α j ) = 0, j < k. Prof. Chung-Kuan Cheng (UC San Diego) CSE291:Topics on Scientific Computation December 1, 2015 17 / 19
Summary We view the conjugate gradient method as an extension from one-direction descent of steepest gradient method to multiple-direction descent. From the global procedure of the multiple vector search, we can derive the basic properties of the optimization. The optimization result shows that the inversion of V T AV is one main cause of the zig-zag winding of the steepest descent approach. The formula of conjugate gradient method transforms the product V T AV into a diagonal matrix and thus simplifies the optimization procedure. Consequently, we can achive the desired properties and the convergence of the solution. Acknowledgement: The note is scribed by YT Jerry Peng for class CSE291, Fall 2015. Prof. Chung-Kuan Cheng (UC San Diego) CSE291:Topics on Scientific Computation December 1, 2015 18 / 19
References J.R. Shewchuk, An introduction to the conjugate gradient method without the agonizing pain, CMU Technical Report, 1994. Convex optimization, by S. Boyd and L. Vandenberghe, Cambridge University Press, 2004. Matrix computations, G.H. Golub and C.F. Van Loan, Johns Hopkins, 2013. Numerical Recipes: The Art of Scientific Computing, by W.H. Press, S.A. Teukolsky, W.T. Vetterling, and B.P. Flannery, Cambridge University Press, 2007. Prof. Chung-Kuan Cheng (UC San Diego) CSE291:Topics on Scientific Computation December 1, 2015 19 / 19