Conjugate Gradient Tutorial

Similar documents
1 Conjugate gradients

EECS 275 Matrix Computation

Notes on Some Methods for Solving Linear Systems

Some minimization problems

The Conjugate Gradient Method

An Iterative Descent Method

The Conjugate Gradient Method for Solving Linear Systems of Equations

Iterative Methods for Smooth Objective Functions

Conjugate Gradient Method

4.6 Iterative Solvers for Linear Systems

Conjugate Gradients: Idea

Chapter 7 Iterative Techniques in Matrix Algebra

Iterative Methods for Solving A x = b

Solutions and Notes to Selected Problems In: Numerical Optimzation by Jorge Nocedal and Stephen J. Wright.

Iterative methods for Linear System of Equations. Joint Advanced Student School (JASS-2009)

CS137 Introduction to Scientific Computing Winter Quarter 2004 Solutions to Homework #3

Math 5630: Conjugate Gradient Method Hung M. Phan, UMass Lowell March 29, 2019

Preconditioning Techniques Analysis for CG Method

LECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION

M.A. Botchev. September 5, 2014

Notes on PCG for Sparse Linear Systems

the method of steepest descent

Introduction to Optimization

Conjugate Gradient Method

Iterative methods for Linear System

Linear Solvers. Andrew Hazel

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization

Applied Mathematics 205. Unit V: Eigenvalue Problems. Lecturer: Dr. David Knezevic

ITERATIVE METHODS BASED ON KRYLOV SUBSPACES

FEM and sparse linear system solving

6.4 Krylov Subspaces and Conjugate Gradients

Course Notes: Week 4

Computational Linear Algebra

Conjugate Gradient (CG) Method

This ensures that we walk downhill. For fixed λ not even this may be the case.

Lecture 17 Methods for System of Linear Equations: Part 2. Songting Luo. Department of Mathematics Iowa State University

From Stationary Methods to Krylov Subspaces

Constrained optimization. Unconstrained optimization. One-dimensional. Multi-dimensional. Newton with equality constraints. Active-set method.

Unconstrained optimization

Krylov Space Solvers

Key words. conjugate gradients, normwise backward error, incremental norm estimation.

7.3 The Jacobi and Gauss-Siedel Iterative Techniques. Problem: To solve Ax = b for A R n n. Methodology: Iteratively approximate solution x. No GEPP.

Improving the Convergence of Back-Propogation Learning with Second Order Methods

Optimization Methods

Computational Linear Algebra

10. Unconstrained minimization

Lecture 11: CMSC 878R/AMSC698R. Iterative Methods An introduction. Outline. Inverse, LU decomposition, Cholesky, SVD, etc.

The speed of Shor s R-algorithm

Lecture 9: Krylov Subspace Methods. 2 Derivation of the Conjugate Gradient Algorithm

Conjugate Gradient Method

Nonlinear Programming

7.2 Steepest Descent and Preconditioning

The Lanczos and conjugate gradient algorithms

Algorithmen zur digitalen Bildverarbeitung I

Iterative Methods for Linear Systems of Equations

January 29, Non-linear conjugate gradient method(s): Fletcher Reeves Polak Ribière January 29, 2014 Hestenes Stiefel 1 / 13

Introduction to Iterative Solvers of Linear Systems

ORIE 6326: Convex Optimization. Quasi-Newton Methods

Line Search Methods for Unconstrained Optimisation

Lecture # 20 The Preconditioned Conjugate Gradient Method

PETROV-GALERKIN METHODS

Lecture 11. Fast Linear Solvers: Iterative Methods. J. Chaudhry. Department of Mathematics and Statistics University of New Mexico

The Conjugate Gradient Method

Conjugate-Gradient. Learn about the Conjugate-Gradient Algorithm and its Uses. Descent Algorithms and the Conjugate-Gradient Method. Qx = b.

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725

Tsung-Ming Huang. Matrix Computation, 2016, NTNU

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

Numerical Optimization of Partial Differential Equations

Topics. The CG Algorithm Algorithmic Options CG s Two Main Convergence Theorems

U.C. Berkeley CS270: Algorithms Lecture 21 Professor Vazirani and Professor Rao Last revised. Lecture 21

Programming, numerics and optimization

The amount of work to construct each new guess from the previous one should be a small multiple of the number of nonzeros in A.

AMS526: Numerical Analysis I (Numerical Linear Algebra)

A function(al) f is convex if dom f is a convex set, and. f(θx + (1 θ)y) < θf(x) + (1 θ)f(y) f(x) = x 3

Conjugate gradient method. Descent method. Conjugate search direction. Conjugate Gradient Algorithm (294)

TMA4180 Solutions to recommended exercises in Chapter 3 of N&W

Review: From problem to parallel algorithm

Introduction to Scientific Computing

Total least squares. Gérard MEURANT. October, 2008

Conjugate Gradient algorithm. Storage: fixed, independent of number of steps.

Gradient Descent and Implementation Solving the Euler-Lagrange Equations in Practice

Optimization for Machine Learning

Lecture Note 7: Iterative methods for solving linear systems. Xiaoqun Zhang Shanghai Jiao Tong University

Gradient Method Based on Roots of A

Math 411 Preliminaries

Numerical Methods - Numerical Linear Algebra

AMSC 600 /CMSC 760 Advanced Linear Numerical Analysis Fall 2007 Krylov Minimization and Projection (KMP) Dianne P. O Leary c 2006, 2007.

Numerical Optimization

Contribution of Wo¹niakowski, Strako²,... The conjugate gradient method in nite precision computa

Matrix Derivatives and Descent Optimization Methods

Chapter 4. Unconstrained optimization

Iterative Methods for Sparse Linear Systems

PDE Solvers for Fluid Flow

Some definitions. Math 1080: Numerical Linear Algebra Chapter 5, Solving Ax = b by Optimization. A-inner product. Important facts

Nonlinear Optimization

A Quick Tour of Linear Algebra and Optimization for Machine Learning

The Conjugate Gradient Method

MATH 425-Spring 2010 HOMEWORK ASSIGNMENTS

Lec10p1, ORF363/COS323

Transcription:

Conjugate Gradient Tutorial Prof. Chung-Kuan Cheng Computer Science and Engineering Department University of California, San Diego ckcheng@ucsd.edu December 1, 2015 Prof. Chung-Kuan Cheng (UC San Diego) CSE291:Topics on Scientific Computation December 1, 2015 1 / 19

Overview 1 Introduction Overview Formulation 2 Steepest Descent: Descent in One Vector Direction Steepest Descent Formula Steepest Descent Properties Steepest Descent Convergence Preconditioning 3 Conjugate Gradient: Descent with Multiple Vectors Multiple Vector Optimization Global Procedure in Matrix Form V k Conjugate Gradient: Wish List Conjugate Gradient Descent: Formula Validation of the Properties 4 Summary 5 References Prof. Chung-Kuan Cheng (UC San Diego) CSE291:Topics on Scientific Computation December 1, 2015 2 / 19

Introduction: Overview Conjugate Gradient is an extension of steepest gradient descent. For steepest gradient, we step in one direction per iteration. Through the iterations, we found that the new directions may contain the component of the old directions and the process walks in zig-zag patterns. For conjugate gradient, we consider multiple directions simulteneously. Hence, we avoid to repeat the old directions. In 1952, Hestenes and Stiefel independently introduced conjugate gradient formula to simplify the multiple direction search. Prof. Chung-Kuan Cheng (UC San Diego) CSE291:Topics on Scientific Computation December 1, 2015 3 / 19

Introduction: Overview Steepest Gradient Descent: We derive the method and properties of the steepest descent method. We view the steepest descent method as an one-direction per iteration approach. The method suffers slow zig-zag winding in a narrow valley of equal potential terrain. Preconditioning: From the properties of the steepest descent method, we find that preconditioning improves the convergence rate. Conjugate Gradient in Global View: We view conjugate gradient method from the aspect of gradient descent. However, the descent method considers multiple directions simultaneously. Conjugate Gradient Formula: We state the formula of conjugate gradient. Conjugate Gradient Method Properties: We show that the global view of conjugate gradient method can be used to optimize each step independent of the other steps. Therefore, the process can repeat recursively and converge after n iterations, where n is the number of variables. Finally, we show and prove the property that validates the formula. Prof. Chung-Kuan Cheng (UC San Diego) CSE291:Topics on Scientific Computation December 1, 2015 4 / 19

Introduction: Formulation The original problem is to solve a simultaenous linear equation, Ax = b, where matrix A is symmetric and positive definite. Calculating the inverse x = A 1 b can be complicated, e.g. n is huge. To avoid a direct solver, we formulate the problem with a quadratic convex objective function. Formulation minimize 1 2 xt Ax b T x, A S n ++ Solution: x = A 1 b. To avoid direct solvers, use Gradient Descent iteratively to find the answer. Prof. Chung-Kuan Cheng (UC San Diego) CSE291:Topics on Scientific Computation December 1, 2015 5 / 19

Steepest Descent Formula Given initial k = 0,x k = x 0. We descent one direction per iteration along the gradient of the objective function. Derive residual r k = f(x k ) = b Ax k Set x k+1 = x k +α k r k, where step size α k is derived analytically. Step size α k = argmin s 0 f(x k +sr k ), From f(x k+αr k ) α k = 0, we have α k = rt k r k r T k Ar k Therefore, we have x k+1 = x k + rt k r k rk TAr r k k Repeat the above steps with k = k +1 until the norm of r k is within tolerance. Prof. Chung-Kuan Cheng (UC San Diego) CSE291:Topics on Scientific Computation December 1, 2015 6 / 19

Steepest Descent Properties Formula: x k+1 = x k +α k r k = x k + rt k r k rk TAr r k k Objective function: f(x k ) f(x k +α k r k ) = (rt k r k) 2 2r T k Ar k Residual r k+1 = (I α k A)r k = (I (rt k r k) 2 A)r k Proof: r T k Ar k r k+1 = b Ax k+1 = b A(x k +α k r k ) = r k α k Ar k = (I α k A)r k Property of the next direction: r k+1 r k Proof: rk Tr k+1 = rk T(I (rt k r k) 2 A)r k = 0. r T k Ar k Prof. Chung-Kuan Cheng (UC San Diego) CSE291:Topics on Scientific Computation December 1, 2015 7 / 19

Steepest Descent: Convergence We denote x = x +e, where x is the optimal solution and e is the error that we try to reduce. We try to decrease the residual so that e can be reduced. As r 0, e 0. r k = b Ax k = b Ax Ae k = Ae k Prof. Chung-Kuan Cheng (UC San Diego) CSE291:Topics on Scientific Computation December 1, 2015 8 / 19

Gradient Descent: Preconditioning We want to reduce the residual r k = Ae k. Let e k = n i=1 ξ iv i, where v i are the eigenvectors of A, i = 1,2,...,n. Then, we have r k = Ae k = n i=1 λ iξ i v i, where λ i are the eigenvalues of A. Thus, the next residual becomes r k+1 = ( I rt k r k rk TAr k = n i=1 ) A r k n i=1 λ i ξ i v i + λ2 i ξ2 i n i=1 λ3 i ξ2 i n λ 2 iξ i v i. Suppose that all eigenvalues are equal, i.e. λ i = λ, i. We have r k+1 = λ n i=1 ξ i v i + λ2 n i=1 ξ2 i λ 3 n i=1 ξ2 i i=1 n λ 2 ξ i v i = 0 Prof. Therefore, Chung-Kuan Cheng the(uc convergence San Diego) CSE291:Topics accelerates, on Scientific if Computation we can precondition December matrix 1, 2015 A. 9 / 19 i=1

Gradient Descent: Preconditioning f(x) = Ax b = 0 Ax = b Preconditioning: To transform Ax = b into another system with more favorable properties for it to be iteratively solved With the preconditioner M, M 1 Ax = M 1 b (e.g. incomplete LU scaling) Prof. Chung-Kuan Cheng (UC San Diego) CSE291:Topics on Scientific Computation December 1, 2015 10 / 19

Conjugate Gradient: Descent with Multiple Vectors For conjugate gradient, we consider multiple vectors V k = [v 0,v 1,...,v k ] in stage k. Let x k+1 = x k +V k y, where y = [y 1,y 2,...,y k ] T is a vector of parameters. We can write V k y = k i=1 y iv i. To minimize f(x k+1 ), the solution is y = (V T k AV k) 1 V T k r k. Therefore, x k+1 = x k +V k y = x k +V k (V T k AV k) 1 V T k r k. Proof: To minimize f(x k+1 ), we want y f(x k+1 ) = 0. We have { } 1 y f(x k+1 ) = y 2 (x k +V k y) T A(x k +V k y) b T (x k +V k y) = V T k AV ky +V T k Ax k V T k b = VT k AV ky V T k r k = 0 y = (V T k AV k) 1 V T k r k. Prof. Chung-Kuan Cheng (UC San Diego) CSE291:Topics on Scientific Computation December 1, 2015 11 / 19

Conjugate Gradient: Multiple Vector Optimization For the descent on multiple directions, we have the following properties. Function: Since y = (V T k AV k) 1 V T k r k, we have f(x k+1 ) = f(x k )+ 1 2 yt V T k AV ky +y T V T k (Ax b) = f(x k ) 1 2 rt k V k(v T k AV k) 1 V T k r k. Residual: r k+1 = b Ax k+1 = b A(x k +V k (Vk T AV k) 1 Vk T r k) = (I AV k (Vk T AV k) 1 Vk T )r k. Property A: r k+1 V k. The proof is independent of the choice of V k. Proof:Vk T r k+1 = Vk T (I AV k(vk T AV k) 1 Vk T )r k = (Vk T VT k )r k = 0 Prof. Chung-Kuan Cheng (UC San Diego) CSE291:Topics on Scientific Computation December 1, 2015 12 / 19

Global Procedure in Matrix Form V k Through iterations, we want to increase the size of matrix V k = [v 0,v 1,...,v k ] to V k+1 by adding a new vector v k+1 at the last column for iteration k +1. Initial k = 0,v 0 = r 0 = b Ax 0. Repeat: Update x k+1 = x k +V k (V T k AV k) 1 V T k r k and r k+1 = b x k+1. Exit if the norm of r k+1 < tolerance. Derive v k+1 as a function of r k+1 and V k (to be described in CG formula). Construct V k+1 by appending v k+1 to the last column of V k. k = k +1. Property B (independent of the choice of v k ): According to the procedure, we have V T k r k = [0,...,0,v T k r k] T. Proof: From Property A, we have V T k 1 r k = 0, thus V T k r k = [0,...,0,v T k r k] T. Prof. Chung-Kuan Cheng (UC San Diego) CSE291:Topics on Scientific Computation December 1, 2015 13 / 19

Conjugate Gradient: Wish List We hope that V T AV = D = diagd i is a diagonal matrix. In this case, we call that the vectors v i in V are mutually conjugate with respect to matrix A. If V T AV = D = diagd i, we have d i = v T i Av i Therefore, we have x k+1 = x k +V k (V T k AV k) 1 V T k r k = x k +V k D 1 [0,...,0,v T k r k] T = x k +α k v k (Property B), where α k = vt k r k v T k Av k Hopefully, for the new matrix V k+1, the conjugate property remains to be true. Then, we can repeat the steps by increasing k = k +1. When k = n 1, we have r T n V n 1 = 0 (property A). The last residual r n = 0, since matrix V n 1 is full ranked. Thus, we have the solution x n = x. Prof. Chung-Kuan Cheng (UC San Diego) CSE291:Topics on Scientific Computation December 1, 2015 14 / 19

Conjugate Gradient Descent Formula Given x 0, we set initial: k = 0,v k = r k = b Ax 0. x k+1 = x k +α k v k, where α k = vt k r k vk TAv (= rt k r k k vk TAv ). k r k+1 = b Ax k+1 = b Ax k α k Av k = r k α k Av k. v k+1 = r k+1 +β k+1 v k, where β k+1 = 1 α k r T k+1 r k+1 v T k Av k = rt k+1 r k+1 rk Tr. k Repeat the iteration with k = k +1 until the residual is smaller than the tolerance. Lemma: v T k r k = r T k r k. Proof: From Property A, we have v T k r k = (r k +β k v k 1 ) T r k = r T k r k. Prof. Chung-Kuan Cheng (UC San Diego) CSE291:Topics on Scientific Computation December 1, 2015 15 / 19

Validation of the Properties Theorem: The solution x k+1 of the conjugate gradient formula is consistent with the global procedure, i.e. vectors v i produced by the formula are mutually conjugate. The consistence is based on the following three equalities. Property A: ri T v j = 0, i > j. Residuals: ri T r j = 0, i > j. Conjugates: vi T Av j = 0, i > j. Proof: We prove the three equalities by induction. For the case when index i = 1, we have Property A: r1 Tv 0 = 0 Residuals: r1 Tr 0 = 0 (r 0 = v 0 ) Conjugates: v T 1 Av 0 = (r 1 +β 1 v 0 ) T Av 0 = r T 1 Av 0 +β 1 v T 0 Av 0 = r1 T ( r 0 r 1 )+ 1 r1 Tr 1 α 0 α 0 v0 TAv v0 T Av 0 = 0 (r1 T v 0 = 0,r 0 = v 0 ) 0 Prof. Chung-Kuan Cheng (UC San Diego) CSE291:Topics on Scientific Computation December 1, 2015 16 / 19

Validation of the Wish List Proof by induction (continue): Suppose that the statement is true up to index i = k. By assumption of the three equalities, the conjugate gradient formula is consistent with the global procedure up to x k+1 = x k +α k v k. When index is i = k +1, we have Property A: r T k+1 V k = 0 Residuals: r T k+1 r j = r T k+1 (v j β j v j 1 ) = 0, j < k Conjugates: Case j = k:v T k+1 Av k = (r k+1 +β k+1 v k ) T Av k = r T k+1 Av k +β k+1 v T k Av k = rk+1 T (r k r k+1 )+ 1 rk+1 T r k+1 α k α k vk TAv vk T Av k k = 0 (r T k+1 r k = 0). Case j < k:v T k+1 Av j = (r k+1 +β k+1 v k ) T Av j = r T k+1 Av j = r T k+1 (r j r j+1 α j ) = 0, j < k. Prof. Chung-Kuan Cheng (UC San Diego) CSE291:Topics on Scientific Computation December 1, 2015 17 / 19

Summary We view the conjugate gradient method as an extension from one-direction descent of steepest gradient method to multiple-direction descent. From the global procedure of the multiple vector search, we can derive the basic properties of the optimization. The optimization result shows that the inversion of V T AV is one main cause of the zig-zag winding of the steepest descent approach. The formula of conjugate gradient method transforms the product V T AV into a diagonal matrix and thus simplifies the optimization procedure. Consequently, we can achive the desired properties and the convergence of the solution. Acknowledgement: The note is scribed by YT Jerry Peng for class CSE291, Fall 2015. Prof. Chung-Kuan Cheng (UC San Diego) CSE291:Topics on Scientific Computation December 1, 2015 18 / 19

References J.R. Shewchuk, An introduction to the conjugate gradient method without the agonizing pain, CMU Technical Report, 1994. Convex optimization, by S. Boyd and L. Vandenberghe, Cambridge University Press, 2004. Matrix computations, G.H. Golub and C.F. Van Loan, Johns Hopkins, 2013. Numerical Recipes: The Art of Scientific Computing, by W.H. Press, S.A. Teukolsky, W.T. Vetterling, and B.P. Flannery, Cambridge University Press, 2007. Prof. Chung-Kuan Cheng (UC San Diego) CSE291:Topics on Scientific Computation December 1, 2015 19 / 19