The Conjugate Gradient Method

Similar documents
Line Search Methods for Unconstrained Optimisation

The Conjugate Gradient Algorithm

SECTION: CONTINUOUS OPTIMISATION LECTURE 4: QUASI-NEWTON METHODS

Lecture 10: September 26

Convex Optimization. Problem set 2. Due Monday April 26th

MATH 4211/6211 Optimization Quasi-Newton Method

The Fundamental Theorem of Linear Inequalities

Chapter 10 Conjugate Direction Methods

The Steepest Descent Algorithm for Unconstrained Optimization

FALL 2018 MATH 4211/6211 Optimization Homework 4

Conjugate-Gradient. Learn about the Conjugate-Gradient Algorithm and its Uses. Descent Algorithms and the Conjugate-Gradient Method. Qx = b.

Unconstrained optimization

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

January 29, Non-linear conjugate gradient method(s): Fletcher Reeves Polak Ribière January 29, 2014 Hestenes Stiefel 1 / 13

The Conjugate Gradient Method

Programming, numerics and optimization

6.252 NONLINEAR PROGRAMMING LECTURE 10 ALTERNATIVES TO GRADIENT PROJECTION LECTURE OUTLINE. Three Alternatives/Remedies for Gradient Projection

Numerical Methods for Large-Scale Nonlinear Systems

Conjugate Gradient (CG) Method

Quasi-Newton Methods

Nonlinear Programming

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09

MATHEMATICS FOR COMPUTER VISION WEEK 8 OPTIMISATION PART 2. Dr Fabio Cuzzolin MSc in Computer Vision Oxford Brookes University Year

Gradient Descent Methods

EECS 275 Matrix Computation

x k+1 = x k + α k p k (13.1)

Nonlinear Optimization: What s important?

R-Linear Convergence of Limited Memory Steepest Descent

Numerical Methods - Numerical Linear Algebra

5 Quasi-Newton Methods

M 2 F g(i,j) = )))))) Mx(i)Mx(j) If G(x) is constant, F(x) is a quadratic function and can be expressed as. F(x) = (1/2)x T Gx + c T x + " (1)

Chapter 7 Iterative Techniques in Matrix Algebra

Lecture 3: Linesearch methods (continued). Steepest descent methods

4 Newton Method. Unconstrained Convex Optimization 21. H(x)p = f(x). Newton direction. Why? Recall second-order staylor series expansion:

Notes on Some Methods for Solving Linear Systems

You should be able to...

Conjugate gradient algorithm for training neural networks

Methods that avoid calculating the Hessian. Nonlinear Optimization; Steepest Descent, Quasi-Newton. Steepest Descent

Steepest descent algorithm. Conjugate gradient training algorithm. Steepest descent algorithm. Remember previous examples

ECE580 Fall 2015 Solution to Midterm Exam 1 October 23, Please leave fractions as fractions, but simplify them, etc.

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

E5295/5B5749 Convex optimization with engineering applications. Lecture 8. Smooth convex unconstrained and equality-constrained minimization

Unconstrained minimization of smooth functions

17 Solution of Nonlinear Systems

Higher-Order Methods

Conjugate gradient method. Descent method. Conjugate search direction. Conjugate Gradient Algorithm (294)

Reduced-Hessian Methods for Constrained Optimization

8 Numerical methods for unconstrained problems

Search Directions for Unconstrained Optimization

ORIE 6326: Convex Optimization. Quasi-Newton Methods

Course Notes: Week 1

IE 5531: Engineering Optimization I

Topics. The CG Algorithm Algorithmic Options CG s Two Main Convergence Theorems

Chapter 4. Unconstrained optimization

Course Notes: Week 4

Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2

Iterative Methods for Solving A x = b

Chapter 8 Gradient Methods

R-Linear Convergence of Limited Memory Steepest Descent

Gradient Methods Using Momentum and Memory

ECE580 Partial Solution to Problem Set 3

Numerical optimization

Steepest Descent. Juan C. Meza 1. Lawrence Berkeley National Laboratory Berkeley, California 94720

Second Order Optimality Conditions for Constrained Nonlinear Programming

Convex Optimization CMU-10725

Optimization methods

ON THE CONNECTION BETWEEN THE CONJUGATE GRADIENT METHOD AND QUASI-NEWTON METHODS ON QUADRATIC PROBLEMS

Continuous Optimisation, Chpt 6: Solution methods for Constrained Optimisation

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Notes on PCG for Sparse Linear Systems

2. Quasi-Newton methods

Optimization: Nonlinear Optimization without Constraints. Nonlinear Optimization without Constraints 1 / 23

Lecture 14: October 17

Linear Solvers. Andrew Hazel

Solutions and Notes to Selected Problems In: Numerical Optimzation by Jorge Nocedal and Stephen J. Wright.

Numerical Optimization

An Iterative Descent Method

Lec10p1, ORF363/COS323

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Constrained optimization. Unconstrained optimization. One-dimensional. Multi-dimensional. Newton with equality constraints. Active-set method.

1 Numerical optimization

A New Trust Region Algorithm Using Radial Basis Function Models

Neural Network Training

Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb Shanno February 6, / 25 (BFG. Limited memory BFGS (L-BFGS)

The Conjugate Gradient Method

Lecture 22. r i+1 = b Ax i+1 = b A(x i + α i r i ) =(b Ax i ) α i Ar i = r i α i Ar i

Conjugate Gradients I: Setup

Numerical optimization. Numerical optimization. Longest Shortest where Maximal Minimal. Fastest. Largest. Optimization problems

Computational Linear Algebra

Numerical Optimization of Partial Differential Equations

The conjugate gradient method

Numerical Optimization: Basic Concepts and Algorithms

Step lengths in BFGS method for monotone gradients

1. Search Directions In this chapter we again focus on the unconstrained optimization problem. lim sup ν

Quasi-Newton methods for minimization

ECE580 Exam 1 October 4, Please do not write on the back of the exam pages. Extra paper is available from the instructor.

PETROV-GALERKIN METHODS

TMA4180 Solutions to recommended exercises in Chapter 3 of N&W

AM 205: lecture 18. Last time: optimization methods Today: conditions for optimality

, b = 0. (2) 1 2 The eigenvectors of A corresponding to the eigenvalues λ 1 = 1, λ 2 = 3 are

Transcription:

The Conjugate Gradient Method Lecture 5, Continuous Optimisation Oxford University Computing Laboratory, HT 2006 Notes by Dr Raphael Hauser (hauser@comlab.ox.ac.uk)

The notion of complexity (per iteration) of an algorithm we used so far is simplistic: We counted the number of basic computer operations, without taking into account that some operations are less costly than others. We did not take into account the memory requirements of an algorithm and the time a computer spends shifting data between different levels of the memory hierarchy.

Memory requirement: Quasi-Newton methods create need to keep a n n matrix H k (the inverse of the approximate Hessian B k ) or L k (the Cholesky factor of B k ) in the computer memory, i.e., O(n 2 ) data units. The steepest descent method only occupies O(n) memory at any given time, by storing x k and f(x k ) and overwriting registers with new data. Can cope with much larger n than q.-n..

The conjugate gradient method has O(n) memory requirement, O(n) complexity per iteration, but converges much faster than steepest descent. This method can be used when the memory requirement of quasi- Newton methods exceeds the active memory of the CPU, or alternatively, to solve positive definite systems of linear equations.

Let B 0 be symmetric positive definite and consider (P) min f(x) = x R n xt Bx + b T x + a. Since f is convex, f(x) = 0 is a sufficient optimality condition, i.e., (P) is equivalent to solving the positive definite linear system 2Bx = b with solution x = (1/2)B 1 b.

Let A R n n be real symmetric and recall: A has real eigenvalues λ 1 λ n and there exists Q orthogonal such that A = Q Diag(λ)Q T. A 1 = QD 1 Q T, i.e., A is nonsingular iff λ i 0 i, A is positive definite iff λ i > 0 i, and then A 1/2 := Q Diag(λ 1/2 )Q T is unique symmetric positive definite s.t. A 1/2 A 1/2 = A.

Geometric motivation of CG: Adding a constant to the objective function of (P) min f(x) = x R n xt Bx + b T x + a does not change the global minimiser x = (1/2)B 1 b. Therefore, it is equivalent to solve (P ) min f(x) = (x x ) T B(x x ) = y T y = g(y), where y = B 1/2 (x x ). Thus, the objective function of our minimisation problem looks particularly simple in the transformed variables y. Use these to understand the geometry of the method.

We aim to construct an iterative sequence (x k ) k N such that the corresponding sequence of y k = B 1/2 (x k x ) behaves sensibly. Let the current iterate be x k and apply an exact line search α k = arg min α f(x k + αd k ) to x k in the search direction d k. Translated into y-coordinates, α k = arg min α where p k = B 1 2d k, and g(y k + αp k ) = arg min α y k 2 + 2αp T k y k + α 2 p k 2. α k = pt k y k p k 2.

If we set y k+1 = y k + α k p k, then we find y T k+1 p k = ( y pt k y ) T k p k 2p k p k = yk T p k yk T p k = 0. (1) Key observation: (1) holds independently of the location of x k. Applying an exact line search α = arg min α R f(x + αd), to an arbitrary point x in the search direction d = ±d k, the point x + = x + α d ends up lying in the affine hyper-plane π k := x + B 1/2 p k.

In subsequent searches, it therefore never makes sense to leave π k again! x yk porth x+ x* xk+ y+ y* pik dk xk yk+ pk y

The requirement that all subsequent line searches are to be conducted within π k amounts to the condition p j p k for all j > k, or equivalently expressed in x-coordinates, d T k Bd j = 0 j k + 1. (2) If this relation holds, we say that d k and d j are B-conjugate (which is the same as orthogonality with respect to the Euclidean inner product defined by B).

Observations: f πk is a strictly convex quadratic function on π k. Choosing d k+1 satisfying (2), we can thus repeat our argument and find that x k+2 will lie in an affine hyper-plane π k+1 of π k to which any future line-search must be restricted. Arguing iteratively, the dimension of the search space π k is decreased by 1 in every iteration, thus termination occurs in n iterations. Thus will have chosen mutually B-conjugate search directions d T i Bd j = 0 i j.

Theorem 1. Let f(x) := x T Bx + b T x + a, where B 0. For k = 0,..., n 1 let d k be chosen such that Let x 0 R n be arbitrary and d T i Bd j = 0 i j. x k+1 = x k + α k d k (k = 0,..., n 1), where α k = arg min α R f(x k + αd k ). Then x n is the global minimiser of f.

How to choose B-conjugate search directions? Lemma 1: Gram-Schmidt orthogonalisation. Let v 0,..., v n 1 R n be linearly independent vectors, and let d 0,..., d n 1 be recursively defined as follows, d k = v k Then d T i Bd k = 0 for all i k. k 1 j=0 d T j Bv k d T j Bd d j. (3) j

Proof: Induction over k. For k = 0 there is nothing to prove. Assume that d T i Bd j = 0 for all i, j {0,..., k 1}, i j. For i < k, d T i Bd k = d T i Bv k k 1 j=0 d T j Bv k d T j Bd d T i Bd j = d T i Bv k d T i Bv k = 0. j The linear independence of the v j guarantees that none of the d j is zero, and hence d T j Bd j > 0 for all j.

Unfortunately, this procedure would require that we hold the vectors d j (j < k) in the computer memory. Thus, as k approaches n the method would require O(n 2 ) memory. A second key observation shows that we can get away with O(n) storage if we choose the steepest descent direction as v k : Lemma 2: Orthogonality. Choose d 0 = f(x 0 ) and for k = 1,..., n 1 let d k be computed via d k = f(x k ) k 1 j=0 d T j B( f(x k)) d T j Bd d j. (4) j Then f(x j ) T f(x k ) = 0 and d T j f(x k) = 0 for j < k.

Proof: Note that f(x k ) = 2Bx k + b for all k. By induction over k we prove d T j f(x k) = 0 for all j < k. Okay for k = 0. Assume it holds for k. Then d T j f(x k+1) = d T j (2B(x k + α k d k ) + b) = d T j f(x k) + 2α k d T j Bd k = 0, (j = 0,..., k 1). Furthermore, d T k f(x k+1) = 0 is the first order optimality condition for the line search min α f(x k + αd k ) defining x k+1.

Next, (4) implies that for all k, span(d 0,..., d k ) = span( f(x 0 ),..., f(x k )). For j < k there exist therefore λ 1,..., λ j such that f(x j ) = j i=0 λ id i, and we have f(x j ) T f(x k ) = j i=1 λd T i f(x k) = 0.

Putting the pieces together: Recall (4), d k = f(x k ) k 1 j=0 d T j B( f(x k)) d T j Bd d j. j Substituting f(x j+1 ) f(x j ) = 2α j Bd j into (4), d k = f(x k ) + k 1 j=0 f(x j+1 ) T f(x k ) f(x j ) T f(x k ) f(x j+1 ) T d j f(x j ) T d j d j. Lemma 2 implies that all but the last summand in the the right hand side expression are zero, d k = f(x k ) f(x k) T f(x k ) f(x k 1 ) T d k 1 d k 1. (5)

Multiplying (4) by f(x k ) T and then replacing k by k 1, Lemma 2 implies d T k 1 f(x k 1) = f(x k 1 ) 2. Substituting into (5), d k = f(x k ) + f(x k) 2 f(x k 1 ) 2d k 1. This is the conjugate gradient rule for updating the search direction.

In the computation of d k we only need to keep two vectors and one number stored in the main memory: d k 1, x k, and f(x k 1 ) 2. The registers occupied by these data can be overwritten during the computation of the new data d k, x k+1, and f(x k ) 2. The method terminates in at most n iterations. Furthermore, in general x k approximates x closely after very few iterations, and the remaining iterations are used for finetuning the result.

Algorithm 1: Conjugate Gradients. x 0 R n, d 0 := f(x 0 ). For k = 0, 1,..., n 1 repeat S1 Compute α k = arg min α f(x k +αd k ) and set x k+1 = x k +α k d k. S2 If k < n 1, compute d k+1 = f(x k+1 ) + f(x k+1) 2 f(x k ) 2 d k. Return x = x n.

The Fletcher-Reeves Method: Algorithm 1 can be adapted for the minimisation of an arbitrary C 1 objective function f and is then called Fletcher-Reeves method. The main differences are the following: Exact line-searches have to be replaced by practical linesearches. A termination criterion f(x k ) < ɛ has to be used to guarantee that the algorithm terminates in finite time. Since Lemma 2 only holds for quadratic functions, the conjugacy of d k is only be achieved approximately. To overcome this problem, reset d k to f(x k ) periodically.

Reading Assignment: Lecture-Note 5.