Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2

Similar documents
Lecture 3: Linesearch methods (continued). Steepest descent methods

5 Quasi-Newton Methods

Nonlinear Optimization: What s important?

Line Search Methods for Unconstrained Optimisation

Nonlinear Programming

Programming, numerics and optimization

Higher-Order Methods

Unconstrained optimization

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

8 Numerical methods for unconstrained problems

Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb Shanno February 6, / 25 (BFG. Limited memory BFGS (L-BFGS)

Global and derivative-free optimization Lectures 1-4

Methods that avoid calculating the Hessian. Nonlinear Optimization; Steepest Descent, Quasi-Newton. Steepest Descent

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Gradient-Based Optimization

Part 2: Linesearch methods for unconstrained optimization. Nick Gould (RAL)

1 Numerical optimization

Lecture 15: SQP methods for equality constrained optimization

Chapter 4. Unconstrained optimization

Algorithms for Constrained Optimization

Quasi-Newton Methods

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Lecture 14: October 17

Maria Cameron. f(x) = 1 n

2. Quasi-Newton methods

Optimization 2. CS5240 Theoretical Foundations in Multimedia. Leow Wee Kheng

Optimization Tutorial 1. Basic Gradient Descent

Static unconstrained optimization

E5295/5B5749 Convex optimization with engineering applications. Lecture 8. Smooth convex unconstrained and equality-constrained minimization

Unconstrained Optimization

Trust Regions. Charles J. Geyer. March 27, 2013

Conditional Gradient (Frank-Wolfe) Method

Convex Optimization. Problem set 2. Due Monday April 26th

Lectures 9 and 10: Constrained optimization problems and their optimality conditions

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

1 Numerical optimization

ORIE 6326: Convex Optimization. Quasi-Newton Methods

Quasi-Newton methods for minimization

Optimal Newton-type methods for nonconvex smooth optimization problems

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09

AM 205: lecture 19. Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods

Optimization Methods

Lecture 7 Unconstrained nonlinear programming

arxiv: v1 [math.oc] 1 Jul 2016

Math 273a: Optimization Netwon s methods

MS&E 318 (CME 338) Large-Scale Numerical Optimization

ECS550NFB Introduction to Numerical Methods using Matlab Day 2

Suppose that the approximate solutions of Eq. (1) satisfy the condition (3). Then (1) if η = 0 in the algorithm Trust Region, then lim inf.

OPER 627: Nonlinear Optimization Lecture 14: Mid-term Review

Optimization II: Unconstrained Multivariable

CPSC 540: Machine Learning

NONSMOOTH VARIANTS OF POWELL S BFGS CONVERGENCE THEOREM

SECTION: CONTINUOUS OPTIMISATION LECTURE 4: QUASI-NEWTON METHODS

Nonlinear Optimization for Optimal Control

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725

Algorithms for constrained local optimization

How do we recognize a solution?

Complexity analysis of second-order algorithms based on line search for smooth nonconvex optimization

Lecture 11 and 12: Penalty methods and augmented Lagrangian methods for nonlinear programming

Worst Case Complexity of Direct Search

Unconstrained minimization of smooth functions

Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications

Newton s Method. Javier Peña Convex Optimization /36-725

Improving the Convergence of Back-Propogation Learning with Second Order Methods

EECS260 Optimization Lecture notes

Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem

MATH 4211/6211 Optimization Quasi-Newton Method

Trust Region Methods. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725

Optimization and Root Finding. Kurt Hornik

Introduction. New Nonsmooth Trust Region Method for Unconstraint Locally Lipschitz Optimization Problems

Part 4: Active-set methods for linearly constrained optimization. Nick Gould (RAL)

Improved Damped Quasi-Newton Methods for Unconstrained Optimization

ECE580 Exam 1 October 4, Please do not write on the back of the exam pages. Extra paper is available from the instructor.

Cubic regularization in symmetric rank-1 quasi-newton methods

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization

Notes on Numerical Optimization

Multipoint secant and interpolation methods with nonmonotone line search for solving systems of nonlinear equations

Optimization: Nonlinear Optimization without Constraints. Nonlinear Optimization without Constraints 1 / 23

Statistics 580 Optimization Methods

A derivative-free nonmonotone line search and its application to the spectral residual method

Optimization Methods. Lecture 19: Line Searches and Newton s Method

Geometry optimization

4 Newton Method. Unconstrained Convex Optimization 21. H(x)p = f(x). Newton direction. Why? Recall second-order staylor series expansion:

NonlinearOptimization

GRADIENT = STEEPEST DESCENT

On fast trust region methods for quadratic models with linear constraints. M.J.D. Powell

Optimisation in Higher Dimensions

On Lagrange multipliers of trust-region subproblems

An Inexact Sequential Quadratic Optimization Method for Nonlinear Optimization

1. Introduction. We analyze a trust region version of Newton s method for the optimization problem

Miscellaneous Nonlinear Programming Exercises

Review of Classical Optimization

The Conjugate Gradient Method

Lecture Notes to Accompany. Scientific Computing An Introductory Survey. by Michael T. Heath. Chapter 5. Nonlinear Equations

Search Directions for Unconstrained Optimization

Unconstrained Multivariate Optimization

Transcription:

Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2 Coralia Cartis, University of Oxford INFOMM CDT: Modelling, Analysis and Computation of Continuous Real-World Problems Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 1/51

Problems and solutions minimize f(x) subject to x Ω R n. ( ) f : Ω R is (sufficiently) smooth (f C i (Ω), i {1,2}). f objective; x variables; Ω feasible set (determined by finitely many constraints). n may be large. minimizing f(x) maximizing f(x). Wlog, minimize. x global minimizer of f over Ω f(x) f(x ), x Ω. x local minimizer of f over Ω there exists N(x,δ) such that f(x) f(x ), for all x Ω N(x,δ), where N(x,δ) := {x R n : x x δ} and is the Euclidean norm. Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 2/51

Example problem in one dimension Example : minf(x) subject to a x b. f(x) x 1 x 2 a b The feasible region Ω is the interval [a,b]. The point x 1 is the global minimizer; x 2 is a local (non-global) minimizer; x = a is a constrained local minimizer. x Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 3/51

Example problems in two dimensions 2500 14 2000 12 1500 10 1000 8 6 500 4 3.0 2 4 2.5 2.0 2.0 2 1.5 1.5 0 4 2 x 2 0 4 y y 1.0 1.0 0.5 0.0 0.5 0.0 4 Ackeley s test function x 0.5 2 1.0 0.5 1.5 Rosenbrock s test function [see Wikipedia] Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 4/51

Main classes of continuous optimization problems Linear (Quadratic) programming: linear (quadratic) objective and linear constraints in the variables min x IR nct x ( + 1 2 xt Hx ) subjectto a T i x = b i,i E; a T i x b i,i I, where c, a i IR n for all i and H is n n matrix; E and I are finite index sets. Unconstrained (Constrained) nonlinear programming min x IR nf(x) (subjectto c i(x) = 0,i E; c i (x) 0,i I) where f,c i : IR n IR are (smooth, possibly nonlinear) functions for all i; E and I are finite index sets. Most real-life problems are nonlinear, often large-scale! Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 5/51

Example: an OR application [Gould 06] Optimization of a high-pressure gas network pressures p = (p i, i); flows q = (q j, j); demands d = (d k, k); compressors. Maximize net flow s.t. the constraints: Aq d = 0 A T p 2 + Kq 2.8359 = 0 A T 2 q + z c(p,q) = 0 p min p p max q min q q max A, A 2 {±1,0}; z {0,1} 200 nodes and pipes, 26 machines: 400 variables; variable demand, (p, d) 10mins. 58,000 vars; real-time. Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 6/51

Example: an inverse problem application[metoffice] Data assimilation for weather forecasting best estimate of the current state of the atmosphere find initial conditions x 0 for the numerical forecast by solving the (ill-posed) nonlinear inverse problem min x 0 m (H i [x i ] y i ) T R 1 i (H[x i ] y i ), i=0 x i = S(t i,t 0,x 0 ), S solution operator of the discrete nonlinear model; H i maps x i to observations y i, R i error covariance matrix of the observations at t i. x 0 of size 10 7 10 8 ; observations m 250,000. Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 7/51

Optimality conditions for unconstrained problems == algebraic characterizations of solutions suitable for computations. provide a way to guarantee that a candidate point is optimal (sufficient conditions) indicate when a point is not optimal (necessary conditions) minimize f(x) subject to x R n. (UP) First-order necessary conditions: f C 1 (R n ); x a local minimizer of f = f(x ) = 0. f(x) = 0 x stationary point of f. Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 8/51

Optimality conditions for unconstrained problems... Lemma. Let f C 1, x R n and s R n with s 0. Then f(x) T s < 0 = f(x + αs) < f(x), α > 0 sufficiently small. Proof. f C 1 = α > 0 such that f(x + αs) T s < 0, α [0,α]. Taylor s/mean value theorem: ( ) f(x + αs) = f(x) + α f(x + αs) T s, for some α (0,α). ( ) = f(x + αs) < f(x), α [0,α]. s descent direction for f at x if f(x) T s < 0. Proof of 1st order necessary conditions. assume f(x ) 0. s := f(x ) is a descent direction for f at x = x : f(x ) T ( f(x )) = f(x ) T f(x ) = f(x ) 2 < 0 since f(x ) 0 and a 0 with equality iff a = 0. Thus, by Lemma, x is not a local minimizer of f. Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 9/51

Optimality conditions for unconstrained problems... f(x) is a descent direction for f at x whenever f(x) 0. s descent direction for f at x if f(x) T s < 0, which is equivalent to cos f(x),s = ( f(x))t s f(x) s = f(x)t s f(x) s > 0, and so: f(x),s [0,π/2). p k f k A descent directionp k. Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 10/51

Summary of first-order conditions. A look ahead minimize f(x) subject to x R n. (UP) First-order necessary optimality conditions: f C 1 (R n ); x a local minimizer of f = f(x ) = 0. x = argmax x R n f(x) f( x) = 0. f(x) x 1 x 2 a b Look at higher-order derivatives to distinguish between minimizers and maximizers.... except for convex functions: x local minimizer/stationary point = x global minimizer. [see Problem Sheet] x Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 11/51

Second-order optimality conditions (nonconvex fcts.) Second-order necessary conditions: f C 2 (R n ); x local minimizer of f = 2 f(x ) positive semidefinite, i.e., s T 2 f(x )s 0, for all s R n. Example: f(x) := x 3, x = 0 not a local minimizer but f (0) = f (0) = 0. Second-order sufficient conditions: f C 2 (R n ); f(x ) = 0 and 2 f(x ) positive definite, namely, s T 2 f(x )s > 0, for all s 0. = x (strict) local minimizer of f. [local convexity] Example: f(x) := x 4, x = 0 is a (strict) local minimizer but f (0) = 0. Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 12/51

Stationary points of quadratic functions x* x* x* (a) Minimum (b) Maximum (c) Saddle (d) Semidefinite x 2 x 2 x 2 x* x* x 1 x 1 x 1 (e) Maximum or (f) Saddle (g) Semidefinite minimum H R n n, g R n : q(x ) = 0 Hx + g = 0; q(x) := g T x + 1 2 xt Hx. 2 q(x) = H, x. General f: approx. locally quadratic around x stationary. Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 13/51

Methods for local unconstrained optimization minimize f(x) subject to x R n (UP) [f C 1 (R n ) or f C 2 (R n )] A Generic Method (GM) Choose ǫ > 0 and x 0 R n. While (TERMINATION CRITERIA not achieved), REPEAT: compute the change x k+1 x k = F(x k, problem data), [linesearch, trust-region] to ensure f(x k+1 ) < f(x k ). set x k+1 := x k + F(x k, prob. data), k := k + 1. TC: f(x k ) ǫ; maybe also, λ min ( 2 f(x k )) ǫ. e.g., x k+1 minimizer of some (simple) model of f around x k linesearch, trust-region methods. if F = F(x k,x k 1, problem data) conjugate gradients mthd. Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 14/51

Issues to consider about GM Global convergence of GM: if ǫ := 0 and any x 0 R n : f(x k ) 0, as k? [maybe also, liminf k λ min ( 2 f(x k )) 0?] Local convergence of GM: if ǫ := 0 and x 0 sufficiently close to x stationary/local minimizer of f: x k x, k? Global/local complexity of GM: count number of iterations and their cost required by GM to generate x k within desired accuracy ǫ > 0, e.g., such that f(x k ) ǫ. [connection to convergence and its rate] Rate of global/local convergence of GM. Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 15/51

Rates of convergence of sequences: an example l k := (1/2) k 0 linearly, q k := (1/2) 2k 0 quadratically, s k := k k 0 superlinearly as k. 0 l k k l k q k 50 0 1 0.5 100 1 0.5 0.25 2 0.25 0.6 ( 1) 150 3 0.12 0.4 ( 2) 200 4 0.6 ( 2) 0.1 ( 4) 250 q k s k 5 0.3 ( 2) 0.2 ( 9) 6 0.2 ( 2) 0.5 ( 19) 300 0 20 40 60 80 100 Rates of convergence on a log scale. Notation: ( i) := 10 i. Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 16/51

Rates of convergence of sequences {x k } R n,x R n ; x k x as k. p-rate of convergence: x k x with rate p 1 if ρ > 0 and k 0 0 such that x k+1 x ρ x k x p, k k 0. ρ convergence factor; e k := x k x error in x k x. Linear convergence: p = 1 ρ < 1; (asymptotically,) no of correct digits grows linearly in the number of iterations. Quadratic convergence: p = 2; (asymptotically,) no of correct digits grows exponentially in the number of iterations. Superlinear convergence: x k+1 x / x k x 0 as k. [faster than linear, slower than quadratic; practically very acceptable ] Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 17/51

Linesearch methods Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 18/51

A generic linesearch method (UP): minimize f(x) subject to x R n, where f C 1 or C 2. A Generic Linesearch Method (GLM) Choose ǫ > 0 and x 0 R n. While f(x k ) > ǫ, REPEAT: compute a descent search direction s k R n, f(x k ) T s k < 0; compute a stepsize α k > 0 along s k such that f(x k + α k s k ) < f(x k ); set x k+1 := x k + α k s k and k := k + 1. Recall property of descent directions. Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 19/51

Performing a linesearch How to compute α k? Exact linesearch: α k := argmin α>0 f(x k + αs k ). computationally expensive for nonlinear objectives. x1-4 -2 2 4 6 x (1) 140 120 100 80 60 40 20 x2 4 2-2 -4-6 x (a) (c) 0.2 0.4 0.6 2.5 0 x2-2.5-5 x2 4 2 x1-4 -2 2 4 6-2 -4-6 -2.5 0 2.5 5 x1 x (2) (b) (d) 150 0 50 100 Exact linesearch for quadratic functions Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 20/51

Performing a linesearch... Inexact linesearch want stepsize α k not to be: too short too long compared to the decrease in f f(x) 3 f(x) 3 2.5 2.5 (x 1,f(x 1 ) (x 1,f(x 1 ) 2 2 1.5 (x 2,f(x 2 ) 1.5 (x 2,f(x 2 ) 1 (x 5,f(x 5 ) (x 3,f(x 3 ) (x 4,f(x 4 ) 1 (x 3,f(x 3 ) (x 5,f(x 5 ) (x 4,f(x 4 ) 0.5 0.5 0 0 2 1.5 1 0.5 0 0.5 1 1.5 2 x 2 1.5 1 0.5 0 0.5 1 1.5 2 x Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 21/51

Performing a linesearch... Inexact linesearch want stepsize α k not too long compared to the decrease in f. The Armijo condition Choose β (0,1). Compute α k > 0 such that f(x k + α k s k ) f(x k ) + βα k f(x k ) T s k ( ) is satisfied. in practice, β := 0.1 or even β := 0.001. due to the descent condition, α k > 0 such that ( ) holds for α [0,α k ]. Choose α k (0,α k ] as large as possible. Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 22/51

Performing a linesearch... Inexact linesearch... Φ k : R R, Φ k (α) := f(x k + αs k ), α 0. Then Armijo Φ k (α k ) Φ k (0) + βα k Φ (0). Let y β (α) := Φ k (0) + βαφ (0), α 0. y 2 y=y 0 (α) 1.5 1 0.5 0 y=φ(α) 0.5 y=y 1 (α) y=y 0.6 (α) 1 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 acceptable α=(0,a] A α Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 23/51

Performing a linesearch... Inexact linesearch The backtracking-armijo (barmijo) linesearch algorithm Choose α (0) > 0, τ (0,1) and β (0,1). While f(x k + α (i) s k ) > f(x k ) + βα (i) f(x k ) T s k, REPEAT: set α (i+1) := τα (i) and i := i + 1. END. Set α k := α (i). backtracking = stepsize α k not too short ; α (0) := 1; τ := 0.5 = α (0) := 1, α (1) := 0.5, α (2) := 0.25,... the barmijo linesearch algorithm terminates in a finite number of steps with α k > 0, due to the descent condition. Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 24/51

Steepest descent method Steepest descent (SD) direction: set s k := f(x k ), k 0. s k descent direction whenever f(x k ) 0: f(x k ) T s k < 0 f(x k ) T ( f(x k )) < 0 f(x k ) 2 < 0. s k steepest descent: unique global solution of minimize s R nf(x k ) + s T f(x k ) subject to s = f(x k ). Cauchy-Schwarz: s T f(x k ) s f(x k ), s, with equality iff s is proportional to f(x k ). Method of steepest descent (SD): GLM with s k == SD direction; any linesearch. SD-e :== SD method with exact linesearches; SD-bA :== SD method with barmijo linesearches. Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 25/51

Global convergence of steepest descent methods f C 1 (R n ); f is Lipschitz continuous (on R n ) iff L > 0, f(y) f(x) L y x, x, y R n. Theorem Let f C 1 (R n ) be bounded below on R n. Let f be Lipschitz continuous. Apply the SD-e or the SD-bA method to minimizing f with ǫ := 0. Then both variants of the SD method have the property: either there exists l 0 such that f(x l ) = 0 or f(x k ) 0 as k. SD methods have excellent global convergence properties (under weak assumptions). Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 26/51

Some disadvatanges of steepest descent methods SD methods are scale-dependent. poorly scaled variables = SD direction gives little progress. 5 4 3 2 1 0 1 2 3 4 5 5 4 3 2 1 0 1 2 3 4 5 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 2.2 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 The effect of problem scaling on SD-e performance. Let f(x 1,x 2 ) = 1 2 (ax2 1 + x2 2 ), where a > 0. Left figure: a = 100.6 (mildly poor scaling). Right figure: a = 1 ( perfect scaling). Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 27/51

Some disadvatanges of steepest descent methods... Usually, SD methods converge very slowly to solution, asymptotically. theory: very slow converge. numerics: break-down (cumulation of round-off and ill-conditioning). 1.5 1 0.5 0 f(x 1,x 2 ) = 10(x 2 x 2 1 )2 +... + (x 1 1) 2 0.5. 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 SD-bA applied to the Rosenbrock function f. Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 28/51

Local rate of convergence for steepest descent Locally, SD converges linearly to a solution, namely f(x k+1 ) f(x ) ρ f(x k ) f(x ), k suff. large BUT convergence factor ρ v. close to 1 usually! Theorem. f C 2 ; x local minimizer of f with 2 f(x ) positive definite λ max, λ min eigenvalues. Apply SD-e to minf. If x k x as k, then f(x k ) converges linearly to f(x ), with ρ ( κ(x ) 1 κ(x )+1) 2 := ρsd, where κ(x ) = λ max /λ min condition number of 2 f(x ). practice: ρ = ρ SD ; for Rosenbrock f: κ(x ) = 258.10, ρ SD 0.984. κ(x ) = 800, f(x 0 ) = 1, f(x ) = 0. SD-e gives f(x k ) 0.007 after 1000 iterations! Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 29/51

Other directions for GLMs Let B k symmetric, positive definite matrix. Let s k be def. by B k s k = f(x k ). ( ) = s k descent direction; = s k solves the problem minimize s R n m k (s) = f(x k ) + f(x k ) T s + 1 2 st B k s k. If 2 f(x k ) is positive definite and B k := 2 f(x k ) = damped Newton method. ( ) is a scaled steepest descent direction; resulting GLMs can be made scale-invariant for some B k (e.g., B k = 2 f(x k )); local convergence can be made faster than SD methods for some B k (e.g., quadratic for Newton); Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 30/51

Global convergence for general GLMs Theorem Let f C 1 (R n ) be bounded below on R n. Let f be Lipschitz continuous. Let the eigenvalues of B k be uniformly bounded above and below, away from zero, for all k. Apply the GLM to minf with barmijo linesearch, s k from ( ) and ǫ = 0. Then either there exists l 0 such that f(x l ) = 0 or f(x k ) 0 as k. Theorem requires locally strongly convex quadratic models of f for all k. In particular, Th. holds if f locally strongly convex at all iterates. See Problem Sheet for failure examples for Newton s method (without linesearch) to converge globally Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 31/51

Local convergence for damped Newton s method Theorem Let f C 2 (R n ) and 2 f be Lipschitz continuous and uniformly positive definite. Let s k =Newton direction in GLM and let barmijo linesearch have β 0.5 and α (0) = 1. Then, provided limit points of the sequence of iterates exist, α k = 1 for all sufficiently large k; x k x as k quadratically. Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 32/51

Local convergence for damped Newton... f(x 1,x 2 ) = 10(x 2 x 2 1 )2 + (x 1 1) 2 ; x = (1,1). 1.5 1 0.5 0 0.5 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 Damped Newton with barmijo linesearch applied to the Rosenbrock function f. Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 33/51

Modified Newton methods If 2 f(x k ) is not positive definite, it is usual to solve instead ( 2 f(x k ) + M k) s k = f(x k ), where M k chosen such that 2 f(x k ) + M k is sufficiently positive definite. M k := 0 when 2 f(x k ) is sufficiently positive definite. Popular option: Modified Cholesky: compute Cholesky factorization 2 f(x k ) = L k (L k ) where L k is lower triangular matrix. Modify the generated L k if the factorization is in danger of failing (modify small or negative diagonal pivots, etc.). Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 34/51

Quasi-Newton methods Secant approximations for computing B k 2 f(x k ) [when 2 f unavailable exactly or by function or gradient finite-differences] At the start of the GLM, choose B 0 (say, B 0 := I). After computing s k from B k s k = f(x k ) and x k+1 = x k + α k s k, compute update B k+1 of B k. Wish list: Compute B k+1 as a function of already-computed quantities f(x k+1 ), f(x k ),..., f(x 0 ), B k, s k B k+1 should be symmetric, nonsingular (pos. def?) B k+1 close to B k, a cheap update of B k B k 2 f(x k ), etc. = a new class of methods: faster than steepest descent method, cheaper to compute per iteration than Newton s. Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 35/51

Quasi-Newton methods... For the first wish, choose B k+1 to satisfy the secant equation γ k := f(x k+1 ) f(x k ) = B k+1 (x k+1 x k ) = B k+1 α k s k. It is satisfied by B k+1 = 2 f when f is a quadratic function. The change in gradient contains information about the Hessian. Many ways to compute B k+1 to satisfy secant eq; trade-off between wishes: rank-1 update of B k (Symmetric-Rank 1), rank 2 update of B k (BFGS (Broyden-Fletcher-Goldfarb-Shanno)), DFP (Davidson-Fletcher-Powell), etc. Work per iteration: O(n 2 ) (as opposed to the O(n 3 ) of Newton) For global convergence, must use Wolfe linesearch instead of barmijo linesearch. local Q-superlinear convergence [see Problem Sheet] Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 36/51

Trust-region methods Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 37/51

Linesearch versus trust-region methods (UP): minimize f(x) subject to x R n. Linesearch methods: liberal in the choice of search direction, keeping bad behaviour in control by choice of α k. choose descent direction s k, compute stepsize α k to reduce f(x k + αs k ), update x k+1 := x k + α k s k. Trust region (TR) methods: conservative in the choice of search direction, so that a full stepsize along it may really reduce the objective. pick direction s k to reduce a local model of f(x k + s k ), accept x k+1 := x k + s k if decrease in the model is also achieved by f(x k + s k ), else set x k+1 := x k and refine the model. Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 38/51

Trust-region models for unconstrained problems Approximate f(x k + s) by: linear model l k (s) := f(x k ) + s f(x k ) or quadratic model Impediments: q k (s) := f(x k ) + s f(x k ) + 1 2 s 2 f(x k )s. models may not resemble f(x k + s) when s is large, models may be unbounded from below, l k (s) always unbounded below (unless f(x k ) = 0) q k (s) is always unbounded below if 2 f(x k ) is negative definite or indefinite, and sometimes if 2 f(x k ) is positive semidefinite. Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 39/51

Trust region models and subproblem Prevent bad approximations by trusting the model only in a trust region, defined by the trust region constraint s k, (R) for some appropriate radius k > 0. The constraint (R) also prevents l k, q k from unboundedness! = the trust region subproblem min s R nm k(s) subject to s k, (TR) where m k := l k, k 0, or m k := q k, k 0. From now on, m k := q k. (TR) easier to solve than (P). May even solve (TR) only approximately. Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 40/51

Trust region models and subproblem - an example 1 quadratic TR model about x=(1, 0.5), =1 1 linear TR model about x=(1, 0.5), =1 0.5 0.5 0 0 0.5 0.5 1 1 1.5 1.5 2 1 0 1 2 2 1 0 1 2 1 0.5 0 0.5 1 1.5 quadratic TR model about x=(0,0), =1 0.5 0 0.5 1 1.5 quadratic TR model about x=( 0.25,0.5), =1 1 2 2 1 0 1 2 1 0 1 2 Trust-region models of f(x) = x 4 1 + x 1x 2 + (1 + x 2 ) 2. Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 41/51

Generic trust-region method Let s k be a(n approximate) solution of (TR). Then predicted model decrease: m k (0) m k (s k ) = f(x k ) m k (s k ). actual function decrease: f(x k ) f(x k + s k ). The trust region radius k is chosen based on the value of ρ k := f(xk ) f(x k + s k ). f(x k ) m k (s k ) If ρ k is not too smaller than 1, x k+1 := x k + s k, k+1 k. If ρ k close to or 1, k is increased. If ρ k 1, x k+1 = x k and k is reduced. Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 42/51

A Generic Trust Region (GTR) method Given 0 > 0, x 0 R n, ǫ > 0. While f(x k ) ǫ, do: 1. Form the local quadratic model m k (s) of f(x k + s). 2. Solve (approximately) the (TR) subproblem for s k with m k (s k ) < f(x k ) ( sufficiently ). Compute ρ k := [f(x k ) f(x k + s k )]/[f(x k ) m k (s k )]. 3. If ρ k 0.9, then [very successful step] set x k+1 := x k + s k and k+1 := 2 k. Else if ρ k 0.1, then [successful step] set x k+1 := x k + s k and k+1 := k. Else [unsuccessful step] set x k+1 = x k and k+1 := 1 2 k. 4. Let k := k + 1. Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 43/51

Trust-region methods Other sensible values of the parameters of the GTR are possible. Solving the (TR) subproblem min s R nm k(s) subject to s k, (TR)... exactly or even approximately may imply work. Want minimal condition of sufficient decrease in the model that ensures global convergence of the GTR method (the Cauchy cond.). In practice, we (usually) do much better than this condition! Example of applying a trust-region method: [Sartenaer, 2008]. approximate solution of (TR) subproblem: better than Cauchy, but not exact. notation: f/ m k ρ k. Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 44/51

The Cauchy point of the (TR) subproblem recall the steepest descent method has strong (theoretical) global convergence properties; same will hold for GTR method with SD direction. minimal condition of sufficient decrease in the model: require m k (s k ) m k (s k c ) and sk k, where s k c := αk c f(xk ), with α k c := arg min α>0 m k( α f(x k )) subject to α f(x k ) k. [i.e. a linesearch along steepest descent direction is applied to m k at x k and is restricted to the trust region.] Easy: α k c := arg min α m k( α f(x k )) subject to 0 < α y k c := xk + s k c is the Cauchy point. k f(x k ). [see Problem Sheet] Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 45/51

Global convergence of the GTR method Theorem (GTR Global Convergence) Let f be smooth on R n and bounded below on R n. Let f be Lipschitz continuous on R n. Let {x k }, k 0, be generated by the generic trust region (GTR) method, and let the computation of s k be such that m k (s k ) m k (s k c ), for all k. Then either or there exists k 0 such that f(x k ) = 0 lim k f(x k ) = 0. Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 46/51

Solving the (TR) subproblem On each TR iteration we compute or approximate the solution of min s R nm k(s) = f(x k ) + s f(x k ) + 1 2 s 2 f(x k )s subject to s k. also, s k must satisfy the Cauchy condition m k (s k ) m k (s k c ), where s k c := αk c f(xk ), with α k c := arg min α>0 m k( α f(x k )) subject to α f(x k ) k. [Cauchy condition ensures global convergence] solve (TR) exactly (i.e., compute global minimizer of TR) = TR akin to Newton-like method. solve (TR) approximately (i.e., an approximate global minimizer) = large-scale problems. Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 47/51

Solving the (TR) subproblem exactly For h R, > 0, g R n, H n n real matrix, consider min s R nm(s) := h + s g + 1 2 s Hs, s. t. s. (TR) Characterization result for the solution of (TR): Theorem Any global minimizer s of (TR) satisfies the equation (H + λ I)s = g, where H + λ I is positive semidefinite, λ 0, λ ( s ) = 0 and s. If H + λ I is positive definite, then s is unique. The above Theorem gives necessary and sufficient global optimality conditions for a nonconvex optimization problem! Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 48/51

Solving the (TR) subproblem exactly Computing the global solution s of (TR): Case 1. If H is positive definite and Hs = g satisfies s = s := s (unique), λ := 0 (by Theorem). Case 2. If H is positive definite but s >, or H is not positive definite, Theorem implies s satisfies (H + λi)s = g, s =, ( ) for some λ max{0, λ min (H)} := λ. Let s(λ) = (H + λi) 1 g, for any λ > λ. Then s = s(λ ) where λ λ solution of s(λ) =, λ λ. nonlinear equation in one variable λ. Use Newton s method to solve it. Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 49/51

Further considerations on trust-region methods Solving the (TR) subproblem accurately for large-scale (and dense) problems uses iterative methods (conjugate-gradient, Lanczos); Cauchy condition satisfied Quasi-Newton methods/approximate derivatives also possible in the trust-region framework (no need for positive definite updates for the Hessian); replace 2 f(x k ) with approximation B k in the quadratic local model m k (s). Conclusions: state-of-the-art software exists implementing linesearch and TR methods; similar performances for linesearch and TR methods (more heuristical approaches needed by linesearch methods to deal with negative curvature). Choosing between the two is mostly a matter of taste. Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 50/51

Numerical optimization bibliography J. Nocedal & S. J. Wright, Numerical Optimization, Springer, 2nd edition, 2006. R. Fletcher, Practical Methods of Optimization, Wiley, 2nd edition, 2000. J. Dennis and R. Schnabel, Numerical Methods for Unconstrained Optimization and Nonlinear Equations, (republished by) SIAM, 1996. Information on existing software can be found at the NEOS Center: http://www.neos-guide.org look under Optimization Guide and Optimization Tree, etc. Methods for Unconstrained OptimizationNumerical Optimization Lectures 1-2 p. 51/51