Programming, numerics and optimization

Similar documents
Methods that avoid calculating the Hessian. Nonlinear Optimization; Steepest Descent, Quasi-Newton. Steepest Descent

Chapter 4. Unconstrained optimization

Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb Shanno February 6, / 25 (BFG. Limited memory BFGS (L-BFGS)

Lecture 7 Unconstrained nonlinear programming

Optimization and Root Finding. Kurt Hornik

Unconstrained optimization

Nonlinear Programming

5 Quasi-Newton Methods

Optimization: Nonlinear Optimization without Constraints. Nonlinear Optimization without Constraints 1 / 23

1 Numerical optimization

Quasi-Newton Methods

Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09

Algorithms for Constrained Optimization

Convex Optimization CMU-10725

1 Numerical optimization

Gradient-Based Optimization

Statistics 580 Optimization Methods

Convex Optimization. Problem set 2. Due Monday April 26th

EECS260 Optimization Lecture notes

Higher-Order Methods

NonlinearOptimization

Chapter 10 Conjugate Direction Methods

Line Search Methods for Unconstrained Optimisation

Optimization II: Unconstrained Multivariable

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Quasi-Newton methods for minimization

Nonlinear Optimization: What s important?

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Notes on Numerical Optimization

Lecture V. Numerical Optimization

Static unconstrained optimization

Optimization Methods

SECTION: CONTINUOUS OPTIMISATION LECTURE 4: QUASI-NEWTON METHODS

Improving the Convergence of Back-Propogation Learning with Second Order Methods

Lecture 14: October 17

MATH 4211/6211 Optimization Quasi-Newton Method

Numerical Optimization of Partial Differential Equations

Scientific Computing: An Introductory Survey

Scientific Computing: An Introductory Survey

Cubic regularization in symmetric rank-1 quasi-newton methods

Optimization 2. CS5240 Theoretical Foundations in Multimedia. Leow Wee Kheng

The Conjugate Gradient Algorithm

The Conjugate Gradient Method

2. Quasi-Newton methods

Optimization II: Unconstrained Multivariable

Optimization Methods for Circuit Design

AM 205: lecture 19. Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods

Geometry optimization

Math 273a: Optimization Netwon s methods

8 Numerical methods for unconstrained problems

Constrained optimization. Unconstrained optimization. One-dimensional. Multi-dimensional. Newton with equality constraints. Active-set method.

Lecture 10: September 26

Lecture 7: Optimization methods for non linear estimation or function estimation

Outline. Scientific Computing: An Introductory Survey. Optimization. Optimization Problems. Examples: Optimization Problems

Chapter 3 Numerical Methods

OPER 627: Nonlinear Optimization Lecture 14: Mid-term Review

Matrix Derivatives and Descent Optimization Methods

AM 205: lecture 19. Last time: Conditions for optimality, Newton s method for optimization Today: survey of optimization methods

Stochastic Quasi-Newton Methods

OPER 627: Nonlinear Optimization Lecture 9: Trust-region methods

Conjugate Gradient (CG) Method

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Optimization Methods for Machine Learning

EAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725

j=1 r 1 x 1 x n. r m r j (x) r j r j (x) r j (x). r j x k

E5295/5B5749 Convex optimization with engineering applications. Lecture 8. Smooth convex unconstrained and equality-constrained minimization

Numerical Optimization

A New Trust Region Algorithm Using Radial Basis Function Models

5 Handling Constraints

Conjugate Directions for Stochastic Gradient Descent

Mathematical optimization

Scientific Computing: An Introductory Survey

Lecture 18: November Review on Primal-dual interior-poit methods

The Conjugate Gradient Method

Multivariate Newton Minimanization

On nonlinear optimization since M.J.D. Powell

6.252 NONLINEAR PROGRAMMING LECTURE 10 ALTERNATIVES TO GRADIENT PROJECTION LECTURE OUTLINE. Three Alternatives/Remedies for Gradient Projection

FALL 2018 MATH 4211/6211 Optimization Homework 4

University of Houston, Department of Mathematics Numerical Analysis, Fall 2005

MATHEMATICS FOR COMPUTER VISION WEEK 8 OPTIMISATION PART 2. Dr Fabio Cuzzolin MSc in Computer Vision Oxford Brookes University Year

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization

Review of Classical Optimization

ORIE 6326: Convex Optimization. Quasi-Newton Methods

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

Maria Cameron. f(x) = 1 n

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

Comparative study of Optimization methods for Unconstrained Multivariable Nonlinear Programming Problems

Nonlinear Optimization for Optimal Control

Unconstrained minimization of smooth functions

Sub-Sampled Newton Methods

Unconstrained Multivariate Optimization

Zero-Order Methods for the Optimization of Noisy Functions. Jorge Nocedal

Numerical solutions of nonlinear systems of equations

January 29, Non-linear conjugate gradient method(s): Fletcher Reeves Polak Ribière January 29, 2014 Hestenes Stiefel 1 / 13

Iterative Methods for Solving A x = b

Numerical Methods - Numerical Linear Algebra

ECS550NFB Introduction to Numerical Methods using Matlab Day 2

Transcription:

Programming, numerics and optimization Lecture C-3: Unconstrained optimization II Łukasz Jankowski ljank@ippt.pan.pl Institute of Fundamental Technological Research Room 4.32, Phone +22.8261281 ext. 428 June 5, 2018 1 1 Current version is available at http://info.ippt.pan.pl/ ljank. 1/64

Outline 1 Line search methods 2 Zero order methods 3 Steepest descent 4 Conjugate gradient methods 5 Newton methods 6 Quasi-Newton methods 7 Least-squares problems 2/64

Outline 1 Line search methods 3/64

Line search methods The basic outline of all line search algorithms is: Select any feasible point x 0. Having calculated points x 0, x 1,..., x k, iteratively calculate the successive point x k+1 : 1 Choose the direction d k of optimization. 2 Starting from x k, perform a (usually approximate) 1D optimization in the direction of d k, that is find s k R that sufficiently 2 decreases f k and f k, where Then, take the step f k (s) = f (x k + s d k ). x k+1 := x k + s k d k. 3 Check the stop conditions. 2 Sufficiently, that is significantly enough to guarantee convergence to the minimum. Exact minimization is usually not necessary; it can be costly and sometimes it can even make the convergence slower. 4/64

Line search methods 1 Choose the direction of optimization. 2 Perform a (usually approximate) 1D optimization in that direction. 3 Check the stop conditions Two problems: 1 Direction choice 2 Step size 5/64

Outline 2 Zero order methods Coordinate descent Powell s direction set Rosenbrock method 6/64

Zero order methods Coordinate descent The simplest method. All coordinate directions e 1,, e n are used sequentially, d k := e k mod n. The simplest method Very slow or even nonconvergent. In 2D and with exact line minimizations equivalent to the steepest descent (besides the first step). The set of the search directions can be periodically modified to include directions deemed to be more effective. 7/64

Zero order methods Powell s direction set Given x 0, 1 Optimize along all the directions e 1,..., e n and yield the points x 1,..., x n. 2 Modify the directions 1 e 1 := e 2 e n 1 := e n 0 e n := x n x 0 3 Optimize along e n and yield the point x 0. 4 Repeat until stop conditions are satisfied. 8/64

Zero order methods Powell s direction set For a quadratic objective function and exact line minimizations, Powell s method generates a set of conjugate directions (besides the first step). Iteratively generated directions tend to fold up on each other and become linearly dependent. Possible solutions: Every m iterations reset the directions to the original set. Every m iterations reset the directions to any orthogonal basis (making use of some of the already generated directions). When modifying the directions (step 2), instead of discarding the first direction e 1, discard the direction of the largest decrease (since it is anyway a major component of the newly generated direction e n ). 9/64

Zero order methods Rosenbrock method The Rosenbrock method 3 involves approximate optimization that cycles over all the directions. The direction set is modified after each approximate optimization. Single stage of the Rosenbrock method Given x 0. 1 Approximately optimize f using all the directions e 1,..., e n and yield the points x 1,..., x n. 2 Modify the directions so that e n := x n x 0. Orthogonalize the resulting set. 3 Let x 0 := x n. 3 H.H. Rosenbrock. An Automatic Method for Finding the Greatest or Least Value of a Function. The Computer Journal 3(3):175 184, 1960. Full text: http://dx.doi.org/10.1093/comjnl/3.3.175 10/64

Zero order methods Rosenbrock method Approximate optimization 1 Let s i be an arbitrary initial step length in the ith direction. 2 Let α > 1 and β (0, 1) be two given constants (step elongation and step shortening). 3 Repeat for every direction i = 1,..., n 1 Make a step s i in the ith direction. 2 If successful (f not increased), then s i = α s i, else if failed (f increased), then s i = β s i. until the loop is executed N times or until all s i become too small or until at least one success and one failure in each direction. Rosenbrock method requires a cheap objective function. 11/64

Zero order methods Rosenbrock method 1 Rosenbrock method 0 12/64

Outline 3 Steepest descent Simplified version of the method Plotting the number of iterations Attraction basins of minima 13/64

Simplified steepest descent The steepest descent method is often implemented in the following simplified version: x k+1 = x k α f (x k ), where α > 0 is a given coefficient. This version does not satisfy the Wolfe (strong Wolfe, Goldstein and Price, backtracking) conditions and so can perform extremely poorly and should not be used. However, investigation of its properties reveals astonishing complexity 4. Consider the following characteristics: the number of iterations necessary for the method to converge to the minimum from a given starting point, the attraction basins of the minima. 4 See C. A. Reiter s home page at http://webbox.lafayette.edu/ reiterc. 14/64

Plotting the number of iterations The simplified steepest descent method: x k+1 = x k α f (x k ). For every value of the coefficient α, the regions of quick and slow convergence of the method can be illustrated using the following characteristics of starting points x 0 : n steps (x; α, ɛ) = arg min k min m x k x m < ɛ, with x 0 = x, where x m are all local minima of the objective function f. Thus, n steps (x; α, ɛ) is the number of the iterations necessary for {x k } to converge from x to any minimum of f. 15/64

Plotting the number of iterations Consider the Rosenbrock banana function, focus on [0, 2] [1, 3] The function f has a single global minimum at (x, y ) = (1, 1). f (x, y) = 100(y x 2 ) 2 + (x 1) 2 16/64

Plotting the number of iterations Legend few iterations many iterations The scale begins at zero iterations ends at the maximum number of iterations in the current frame (but not more than 20 000). All images have been computed with the resolution 400 400 5. The accuracy ɛ = 0.001. 5 600 600 for the accompanying video. 17/64

Outline Line search Zero order Steepest descent Conjugate gradient Newton Quasi-Newton Least-squares Plotting the number of iterations 18/64

Plotting the number of iterations zoom for α = 0.00195 19/64

Plotting the attraction basins of minima The simplified steepest descent method: x k+1 = x k α f (x k ). For every value of the coefficient α, the attraction basins of the minima of the objective function can be illustrated by plotting n min (x; α, ɛ) = arg min m x m lim k x k, x 0 = x, if {x k } is convergent and 0 otherwise. Thus, n min (x; α, ɛ) is the number of the minimum to which {x k } converges from x (or 0, if {x k } is nonconvergent). 20/64

Plotting the attraction basins of minima Consider a two-minimum modification of the Rosenbrock banana function: f (x, y) = 100(y x 2 ) 2 + (x + 1 2 )2 (x 1) 2 The function f has two global minima at ( 1 2, 1 4 ) and (1, 1). 21/64

Plotting the attraction basins of minima Legend no convergence minimum (1, 1) minimum ( 1 2, 1 4 ) All images have been computed with the resolution 400 400 6. The accuracy ɛ = 0.001. Up to 20 000 iterations performed. 6 600 600 for the accompanying video. 22/64

Plotting the attraction basins of minima 23/64

Outline 4 Conjugate gradient methods Conjugate directions Linear conjugate gradient Nonlinear conjugate gradient 24/64

Search directions Conjugate direction Any (smooth enough) function can be approximated by a quadratic form, that is by its second-order Taylor series, f (x k + s d k ) f (x k ) + s d T k f (x k ) + 1 2 s2 d T k 2 f (x k ) d k. The gradient of the approximation is f (x k + s d k ) f (x k ) + s 2 f (x k ) d k. 25/64

Search directions Conjugate direction If f (x k ) is the exact minimum along the previous search direction d k 1, the gradient at the minimum x k is perpendicular to the search direction d k 1, d T k 1 f (x k ) = 0. It is reasonable to expect that the minimization along the next direction d k does not jeopardize the minimization along the previous direction d k 1. Therefore, the gradient in the points along the new search direction d k should still stay perpendicular to d k 1. Thus we require that 0 = d T k 1 f (x k + s d k ) ] d T k 1 [ f (x k ) + s 2 f (x k ) d k = s d T k 1 2 f (x k ) d k. 26/64

Conjugate directions Conjugate direction Every direction d k, which satisfies 0 = d T k 1 2 f (x k ) d k, is said to be conjugate to d k 1 (at x with respect to f ). Conjugate set A set of vectors d i that pairwise satisfy 0 = d T i 2 f (x) d j, is called a conjugate set (at x with respect to f ). If f is an n-dimensional quadratic form, then n global conjugate directions can be always found. If f is also positive definite, then single exact optimization along each of them leads directly to the global minimum. 27/64

Conjugate directions Coordinate descent Conjugate directions 28/64

Linear conjugate directions Let A be an n n positive definite matrix and f (x) := c + x T b + 1 2 xt Ax, r(x) := f (x) = b + Ax, and d i (i = 1, 2,..., n) be mutually conjugate directions with respect to A, d T i Ad j = 0 for i j. 29/64

Linear conjugate directions According to the line search principle x k+1 := x k + s k d k, where s k minimizes f k (s) := f (x k + sd k ). For quadratic f, f k is a parabola with the exact minimizer s k = dt k r k d T k Ad, k where r k = r(x k ) = f (x k ) = b + Ax k. The matrix A is n n, hence exactly n such optimum steps along all conjugate directions d i lead to the global minimum, n 1 x n = x = x 0 + s k d k. k=0 30/64

Linear conjugate directions Choosing the next direction Let d i, i = 0, 1,..., k be (already computed) conjugate directions with respect to A. Then is the minimum in the subspace k x k+1 = x 0 + s i d i i=0 x 0 + span {d 0,..., d k } and thus the gradient r k+1 = f (x k+1 ) is perpendicular to span {d 0,..., d k } and so r T k+1d i = 0, i = 0, 1,..., k. 31/64

Linear conjugate directions Choosing the next direction The next conjugate direction d k+1 can be hence computed using the gradient r k+1 as k d k+1 = r k+1 + η k+1,i d i, i=0 where the coefficient η k+1,i ensure conjugacy of d k+1 with all the previous directions, d T k+1ad i = 0, i = 0, 1,..., k which yields Therefore η k+1,i = rt k+1 Ad i d T i Ad i. d k+1 = r k+1 + k i=0 r T k+1 Ad i d T i Ad i d i. 32/64

Linear conjugate directions Linear conjugate directions (for a quadratic form) k d k+1 := r k+1 + η k+1,i d i, i=0 η k+1,i = rt k+1 Ad i d T i Ad i, where f (x) = c + x T b + 1 2 xt Ax, r(x) := f (x) = b + Ax. Therefore, the computation of the next conjugate direction d k+1 requires k + 1 coefficients η k+1,i (i = 0, 1,..., k) to be computed the entire set of all conjugate directions d i, i = 0,..., n 1, requires O(n 2 ) time with a large constant and can be numerically unstable (a scheme similar to Gram-Schmidt orthogonalization). 33/64

Linear conjugate gradient method Fortunately, it turns out that if the first direction d 0 is the steepest descent direction, d 0 = r 0 = f (x 0 ), then in k d k+1 := r k+1 + η k+1,i d i i=0 it is enough to take into account the last direction only, d k+1 := r k+1 + η k d k = r k+1 + rt k+1 Ad k d T k Ad d k, k and the resulting direction d k+1 is automatically conjugate to all previous directions d i, i = 0, 1,..., k. This is called (a bit misleadingly) the conjugate gradient method. 34/64

Linear conjugate gradient method Using r k+1 r k = s k Ad k, d k+1 = r k+1 + η k d k, r T k+1d i = 0 for i = 0, 1,..., k, it is easy to express the optimum step length s k for the use with x k+1 := x k + s k d k as s k = dt k r k d T k Ad k = rt k r k d T k Ad. k simplify η k for the use with d k+1 := r k+1 + η k d k, η k = rt k+1 Ad k d T k Ad k = rt k+1 r k+1 rk Tr. k 35/64

Linear conjugate gradient method Linear conjugate gradient Given f (x) = c + x T b + 1 2 xt Ax, chose the initial point x 0. Initialize: r 0 := b + Ax 0, d 0 := r 0 and k := 0. While stop conditions not satisfied do rt k r k s k := d T k Ad, k x k+1 := x k + s k d k, r k+1 := r k + s k Ad k, η k := rt k+1 r k+1 rk Tr, k d k+1 := r k+1 + η k d k, k := k + 1. This is basically the CG method for solving Ax = b (Lecture B-3). 36/64

Nonlinear conjugate gradient method The conjugate gradient algorithm can be (almost) directly used with non-quadratic objective functions, provided the exactly minimizing step is replaced with a line search. It is inexpensive numerically (no Hessian, just gradients; data storage only one step back) and in general yields a superlinear convergence. The line search need not to be exact. However, the step length s k has to satisfy the strong Wolfe conditions with 0 < c 1 < c 2 < 0.5 (Lecture C-2) in order to assure that the next direction d k+1 := r k+1 + η k d k is a descent direction with η k := rt k+1 r k+1 r T k r k = f (x k+1) T f (x k+1 ) f (x k ) T. f (x k ) 37/64

Nonlinear conjugate gradient method Nonlinear conjugate gradient Given f (x), chose the initial point x 0. Initialize: r 0 := f (x 0 ), d 0 := r 0 and k := 0. While stop conditions not satisfied do 1 Find step length s k satisfying strong Wolfe conditions with 0 < c 1 < c 2 < 0.5, set x k+1 := x k + s k d k. 2 Compute r k+1 := f (x k+1 ). 3 Compute Fletcher-Reeves η k := rt k+1 r k+1 r T k r k Polak-Ribière [ η k := max 4 Compute d k+1 := r k+1 + η k d k. 5 Update the step number k := k + 1. 0, rt k+1 (r k+1 r k ) r T k r k ] 38/64

Nonlinear conjugate gradient method Iteratively generated search directions d k can tend to fold up on each other and become linearly dependent. Therefore, the directions should be periodically reset by assuming η k := 0, which effectively restarts the procedure of generating the directions from the scratch (that is, the steepest descent direction). Restart every N iterations. With quadratic objective function and exact line searches, consecutive gradients are orthogonal, r T i r j = 0 if i j. The procedure can be thus restarted when r T k+1 r k r k+1 r k > c 0.1, which in practice happens rather frequently. Polak-Ribière method restarts anyway when the computed correction term is negative (rather infrequently). 39/64

Outline 5 Newton methods Inexact Newton methods Modified Newton methods 40/64

Newton methods The Newton direction d k := [ 2 f (x k ) ] 1 f (xk ) is based on the quadratic approximation to the objective function: f (x k + x) f (x k ) + x T f (x k ) + 1 2 xt 2 f (x k ) x. If the Hessian 2 f (x k ) is positive definite (the approximation is convex), the minimum is found by solving f (x k + d k ) f (x k ) + 2 f (x k ) d k = 0. The Hessian has to be computed (a 2 nd order method). Inverting a large Hessian is time-consuming. Far from the minimum the Hessian may not be positive definite and d k may be an ascent direction. Quick quadratic convergence near the minimum. 41/64

Newton methods Far from the minimum the Hessian may not be positive definite and the exact Newton direction may be an ascent direction. In practice the Newton method is implemented as Inexact Newton methods, which solve 2 f (x k ) d k = f (x k ) inexactly to assure that d k is a descent direction (Newton conjugate gradient method). Modified Newton methods, which modify the Hessian matrix 2 f (x n ) so that it becomes positive definite. The solution to H k d k = f (x k ), where H k is the modified Hessian, is then a descent direction. 42/64

Inexact Newton methods Inexact solution of 2 f (x k ) d k = f (x k ) Is quicker, since the Hessian is not inverted. Can guarantee that the inexact solution is a descent direction. Stop condition is usually based on the norm of the residuum, normalized with respect to the norm of the RHS ( f (x k )): r k f (x k ) = 2 f (x k ) d k + f (x k ) α k, f (x k ) where at least α k < α < 1 or preferably α k 0, for example [ ] 1 α k = min 2, f (x k) or [ 1 ] α k = min 2, f (x k ). 43/64

Inexact Newton methods Newton conjugate gradient The Newton conjugate gradient method solves in each step 2 f (x k ) d k = f (x k ) iteratively (an iteration in each step of the iteration) using the linear conjugate gradient method, with the stop conditions Direction p i+1 of a negative curvature is generated, p T i+1 2 f (x k )p i+1 0, or the Newton equation is inexactly solved, r k f (x k ) α k, where [ 1 ] α k = min 2, f (x k ). 44/64

Modified Newton methods Modified Newton methods modify the Hessian matrix 2 f (x k ) (as little as possible) by H k = 2 f (x k ) + E k, so that it becomes sufficiently positive definite and the solution to the modified equation [ 2 f (x k ) + E k ] d k = f (x k ) is a descent direction. There are several possibilities to choose E k : 1 A multiple of identity, E k = τi. 2 Obtained during Cholesky factorisation of the Hessian with immediate increase of the diagonal elements (Cholesky modification). 3 others (Gershgorin modification, etc.). 45/64

Outline 6 Quasi-Newton methods Approximating the Hessian DFP method BFGS method Broyden class and SR1 method 46/64

Quasi-Newton methods All Newton methods require the Hessian to be computed and some of them (modified Newton) invert or factorize it. Both tasks are time-consuming and error-prone. Quasi-Newton methods compute the search direction using a gradient-based approximation to the Hessian (or to its inverse): d k = H 1 k f (x k) or d k = B k f (x k ). Quasi-Newton methods are useful in large problems (the Hessian is dense and too large), with severely ill-conditioned Hessians, when second derivatives are unavailable. 47/64

Quasi-Newton methods Approximating the Hessian Approximation is based on gradients and updated after each step using the previous step length and the change of the gradient. Approximate the function around the last two iterates: f k 1 (x) := f (x k 1 + x) f (x k 1 ) + x T f (x k 1 ) + 1 2 xt H k 1 x, f k (x) := f (x k + x) f (x k ) + x T f (x k ) + 1 2 xt H k x, where x k := x k 1 + s k 1 d k 1. Assume the previous-step approximate Hessian H k 1 is known. The next approximation H k can be obtained by comparing the gradients in x k 1 : f k ( s k 1 d k 1 ) = f k 1 (0). 48/64

Quasi-Newton methods Approximating the Hessian The requirement f k ( s k 1 d k 1 ) = f k 1 (0) leads to H k (x k x k 1 ) = f (x k ) f (x k 1 ), which is usually stated in the shorter form and called the secant equation H k y k = g k, where y k := x k x k 1, g k := f (x k ) f (x k 1 ). Such symmetric positive definite H k exists (non-uniquely) only if y T l H k y k = y T k g k > 0, which is guaranteed if the step length s k satisfies Wolfe conditions. 49/64

Quasi-Newton methods Approximating the Hessian Since the solution is non-unique, H k should be chosen so that it is the closest approximate to the last step approximation H k 1 : Find a symmetric positive definite matrix H k, which satisfies the secant equation H k y k = g k and minimizes H k H k 1 for a given norm. The approximation is used to find the quasi-newton direction d k = H 1 k f (x k), which requires inverting the approximated Hessian. The task is thus often reformulated to find an approximation to the inverse B k = H 1 k : Find a symmetric positive definite matrix B k, which satisfies the inverse secant equation B k g k = y k and minimizes B k B k 1 for a given norm. 50/64

Quasi-Newton methods Approximating the Hessian In both cases, the solution is easy to find if the weighted Frobenius norm 7 is used, A W = W 1/2 AW 1/2 F, where the the weighting matrix W satisfies the inverse (or direct) secant equation. If W 1 (or W) is the average Hessian over the last step, 1 2 f (x k 1 + τs k 1 d k 1 ) dτ, 0 then two updating formulas are obtained: DFP (Davidon, Fletcher, Powell) formula for Hessian updating and BFGS (Broyden, Fletcher, Goldfarb, Shanno) formula for inverse Hessian updating. Both formulas use rank two updates to the previous step matrix. 7 The Frobenius norm of a matrix A is defined as A 2 F = i,j a2 ij. 51/64

Quasi-Newton methods DFP method DFP (Davidon, Fletcher, Powell) Hessian updating formula ( ) ( ) H k 1 H DFP k = I g ky T k g T k y k I y kg T k g T k y k + g kg T k g T k y k The corresponding formula for updating the inverse Hessian B DFP k = B k 1 B k 1g k g T k B k 1 g T k B k 1g k + y ky T k g T k y k 52/64

Quasi-Newton methods BFGS method BFGS (Broyden, Fletcher, Goldfarb, Shanno) inverse Hessian updating formula ( ) ( ) B k 1 B BFGS k = I y kg T k g T k y k I g ky T k g T k y k + y ky T k g T k y k The corresponding formula for updating the Hessian H BFGS k = H k 1 H k 1y k y T k H k 1 y T k H k 1y k + g kg T k g T k y k 53/64

Quasi-Newton methods Broyden class The DFP and the BFGS formulas for updating the Hessian can be linearly combined to form a general updating formula for the Broyden class methods. Broyden class methods H k = (1 φ)h BFGS k + φ H DFP k. The restricted Broyden class is defined by 0 φ 1. 54/64

Quasi-Newton methods SR1 method A member of the Broyden class is the SR1 (symmetric rank one) method, which yields rank one updates to the Hessian. SR1 method φ SR1 k = y T k g k y T k (g k H k 1 y k ) The weighting coefficient φ SR1 k may fall outside [0, 1] interval. The SR1 formula approximates Hessian well, but the approximated Hessian might not be positive semidefinite, the denominator in the above formula can vanish or be small. It is thus usually used with modified Newton methods (trust region) and sometimes the update is skipped. 55/64

Outline 7 Least-squares problems Hessian approximation Gauss-Newton method Levenberg-Marquardt method 56/64

Least-squares problems In many problems (like parameter fitting, structural optimization, etc.) the objective function is a sum of squares (residuals): Then f (x) = 1 2 r(x)t r(x) = 1 2 n i=1 r 2 i (x). n f (x) = r i (x) r i (x) = J(x) T r(x), i=1 n n 2 f (x) = r i (x) r i (x) T + r i (x) 2 r i (x) i=1 i=1 n = J(x) T J(x) + r i (x) 2 r i (x), i=1 where J(x) is the Jacobi matrix of the residuals, J(x) = [ ri (x) x j ]i,j. 57/64

Least-squares problems Hessian approximation If the residuals r i (x), at least near the minimum, vanish (the optimum point yields is a good fit between the model and the data), that is r i (x) 0, or are linear functions of x, that is 2 r i (x) 0, then the Hessian simplifies to n 2 f (x) = J(x) T J(x) + r i (x) 2 r i (x) i=1 J(x) T J(x), which is always positive semidefinite and can be computed using the first derivatives of the residuals only. 58/64

Least-squares problems Gauss-Newton method Substitution of the approximation into the Newton formula, yields the Gauss-Newton formula 2 f (x k ) d k = f (x k ), J(x k ) T J(x k ) d k = J(x k ) T r(x k ), which may be problematic, if J(x k ) T J(x k ) is not a good approximation to the Hessian. 59/64

Least-squares problems Levenberg-Marquardt method A modified version of the Gauss-Newton method is called the Levenberg-Marquardt method 8 Levenberg-Marquardt formula [ ] J(x k ) T J(x k ) + λi d k = J(x k ) T r(x k ). The parameter λ allows the step length to be smoothly controlled: For small λ, the steps are Gauss-Newton (or Newton) in character and take advantage of the superlinear convergence near the minimum. For large λ, the identity matrix dominates and the steps are similar to steepest descent steps. Sometimes a heuristic is advocated, which uses the diagonal of the approximated Hessian J(x k ) T J(x k ) instead of the identity matrix. 8 Notice that this is essentially a trust-region approach. 60/64

Least-squares problems Levenberg-Marquardt method (quadratically) approximated objective function steepest descent identity matrix diagonal heuristic 61/64

Least-squares problems Levenberg-Marquardt method The rule with the diagonal of the Hessian matrix (instead of the identity) is a heuristic only. It can be quicker, but sometimes (and not so rare) may be also much slower. 20 15 10 5 0-5 -10-20 -10 0 10 20 30 steepest descent identity matrix diagonal heuristic 62/64

Least-squares problems Levenberg-Marquardt method In the Levenberg-Marquardt formula, [ ] J(x k ) T J(x k ) + λi d k = J(x k ) T r(x k ), the parameter λ controls the step length. It is updated in each optimization step based on the accuracy ρ = f (x k) f (x k + d k ) f k (0) f k (d k ) = actual decrease predicted decrease of the quadratic approximation f k to the objective function f (x k + x) f k (x) := f (x k ) + x T J(x k ) T r(x k ) + 1 2 xt J(x k ) T J(x k )x. The denominator in the formula for ρ simplifies to f k (0) f k (d k ) = 1 2 d k [λ d k f (x k )]. 63/64

Least-squares problems Levenberg-Marquardt method Given the coefficients ρ min 0.1, ρ max 0.75 and k 25. Single optimization step of the Levenberg-Marquardt method Compute r(x k ), J(x k ), J(x k ) T J(x k ) and f (x k ). do 1 Solve [ J(xk ) T J(x k ) + λi ] d k = J(x k ) T r(x k ). 2 Compute f (x k + d k ). 3 Compute ρ = f (x k) f (x k + d k ) f k (0) f k (d k ) = actual decrease predicted decrease. 4 If ρ < ρ min, then λ = kλ, else if ρ > ρ max, then λ = λ/k. while ρ < 0 64/64