Unconstrained optimization I Gradient-type methods

Similar documents
8 Numerical methods for unconstrained problems

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

CPSC 540: Machine Learning

Chapter 1. Root Finding Methods. 1.1 Bisection method

The Steepest Descent Algorithm for Unconstrained Optimization

Unconstrained Optimization

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Part 2: Linesearch methods for unconstrained optimization. Nick Gould (RAL)

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares

1 Numerical optimization

ter. on Can we get a still better result? Yes, by making the rectangles still smaller. As we make the rectangles smaller and smaller, the

Numerical Optimization

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30

OPER 627: Nonlinear Optimization Lecture 14: Mid-term Review

Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2

E5295/5B5749 Convex optimization with engineering applications. Lecture 8. Smooth convex unconstrained and equality-constrained minimization

Review of Classical Optimization

1 Lecture 25: Extreme values

Simple Iteration, cont d

IE 5531: Engineering Optimization I

Nonlinear Optimization: What s important?

Line Search Methods for Unconstrained Optimisation

Gradient Descent. Dr. Xiaowei Huang

1 Numerical optimization

Outline. Scientific Computing: An Introductory Survey. Nonlinear Equations. Nonlinear Equations. Examples: Nonlinear Equations

Optimization Methods. Lecture 19: Line Searches and Newton s Method

Lecture Notes to Accompany. Scientific Computing An Introductory Survey. by Michael T. Heath. Chapter 5. Nonlinear Equations

V. Graph Sketching and Max-Min Problems

Generating Function Notes , Fall 2005, Prof. Peter Shor

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Motivation: We have already seen an example of a system of nonlinear equations when we studied Gaussian integration (p.8 of integration notes)

Chapter 4. Unconstrained optimization

ROOT FINDING REVIEW MICHELLE FENG

Lecture 4: Training a Classifier

Numerical Methods in Informatics

Scientific Computing: An Introductory Survey

THE SECANT METHOD. q(x) = a 0 + a 1 x. with

Lecture 10: Powers of Matrices, Difference Equations

CS 450 Numerical Analysis. Chapter 5: Nonlinear Equations

Numerical Optimization Prof. Shirish K. Shevade Department of Computer Science and Automation Indian Institute of Science, Bangalore

5 Overview of algorithms for unconstrained optimization

Bindel, Fall 2011 Intro to Scientific Computing (CS 3220) Week 6: Monday, Mar 7. e k+1 = 1 f (ξ k ) 2 f (x k ) e2 k.

Lecture 4: Training a Classifier

Numerical solutions of nonlinear systems of equations

We are going to discuss what it means for a sequence to converge in three stages: First, we define what it means for a sequence to converge to zero

CS 323: Numerical Analysis and Computing

Numerical Methods I Solving Nonlinear Equations

2.5 The Fundamental Theorem of Algebra.

STOP, a i+ 1 is the desired root. )f(a i) > 0. Else If f(a i+ 1. Set a i+1 = a i+ 1 and b i+1 = b Else Set a i+1 = a i and b i+1 = a i+ 1

To get horizontal and slant asymptotes algebraically we need to know about end behaviour for rational functions.

MATH 1A, Complete Lecture Notes. Fedor Duzhin

ExtremeValuesandShapeofCurves

Unconstrained optimization

Single Variable Minimization

Introduction to Nonlinear Optimization Paul J. Atzberger

4 damped (modified) Newton methods

εx 2 + x 1 = 0. (2) Suppose we try a regular perturbation expansion on it. Setting ε = 0 gives x 1 = 0,

1.1: The bisection method. September 2017

Chapter 3: Root Finding. September 26, 2005

We consider the problem of finding a polynomial that interpolates a given set of values:

Optimal Newton-type methods for nonconvex smooth optimization problems

Nonlinear equations and optimization

Unit 2: Solving Scalar Equations. Notes prepared by: Amos Ron, Yunpeng Li, Mark Cowlishaw, Steve Wright Instructor: Steve Wright

8.5 Taylor Polynomials and Taylor Series

Optimization and Calculus

CLASS NOTES Models, Algorithms and Data: Introduction to computing 2018

Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem

Convex Optimization. Problem set 2. Due Monday April 26th

Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Notes on Constrained Optimization

x 2 x n r n J(x + t(x x ))(x x )dt. For warming-up we start with methods for solving a single equation of one variable.

FIXED POINT ITERATION

5 Quasi-Newton Methods

Nonlinear Programming

Lecture Notes: Geometric Considerations in Unconstrained Optimization

Section 1.x: The Variety of Asymptotic Experiences

Optimization Methods. Lecture 18: Optimality Conditions and. Gradient Methods. for Unconstrained Optimization

Line Search Techniques

Nonlinear Optimization for Optimal Control

Unconstrained minimization of smooth functions

Numerical differentiation

Maria Cameron. f(x) = 1 n

Queens College, CUNY, Department of Computer Science Numerical Methods CSCI 361 / 761 Spring 2018 Instructor: Dr. Sateesh Mane.

3.1 Introduction. Solve non-linear real equation f(x) = 0 for real root or zero x. E.g. x x 1.5 =0, tan x x =0.

WEEK 7 NOTES AND EXERCISES

Approximation, Taylor Polynomials, and Derivatives

Math Lecture 4 Limit Laws

September Math Course: First Order Derivative

Sequence convergence, the weak T-axioms, and first countability

Nonlinearity Root-finding Bisection Fixed Point Iteration Newton s Method Secant Method Conclusion. Nonlinear Systems

Static unconstrained optimization

Chapter 3 Numerical Methods

6.252 NONLINEAR PROGRAMMING LECTURE 10 ALTERNATIVES TO GRADIENT PROJECTION LECTURE OUTLINE. Three Alternatives/Remedies for Gradient Projection

UNCONSTRAINED OPTIMIZATION

Written Examination

MATH 23b, SPRING 2005 THEORETICAL LINEAR ALGEBRA AND MULTIVARIABLE CALCULUS Midterm (part 1) Solutions March 21, 2005

CS 323: Numerical Analysis and Computing

2.098/6.255/ Optimization Methods Practice True/False Questions

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Transcription:

Unconstrained optimization I Gradient-type methods Antonio Frangioni Department of Computer Science University of Pisa www.di.unipi.it/~frangio frangio@di.unipi.it Computational Mathematics for Learning and Data Analysis Master in Computer Science University of Pisa

Outline Unconstrained optimization Gradient method for quadratic functions Gradient method for general functions Exact Line Search: first-order approaches Exact Line Search: second-order approaches Exact Line Search: zeroth-order approaches Inexact Line Search: Armijo-Wolfe Really inexact Line Search: fixed stepsize

Optimization algorithms Iterative procedures (doh!) Start from initial guess x 0, some process x i x i+1 Want the sequence { x i } to go towards an optimal solution Actually three different forms: (strong) { x i } x : the whole sequence converges to an optimal solution (weaker) all accumulation points of { x i } (if any) are optimal solutions (weakest) at least one accumulation point of { x i } (if any) is optimal X compact helps (accumulation points always ), but here X = R n f not convex = optimal stationary point Two general forms of the process: line search: first choose d i R n (direction), then choose α i R (stepsize) s.t. x i+1 x i + α i d i trust region: first choose α i (trust radius), then choose d i In ML, α i is often called learning rate Crucial concept: model of f used to construct next iterate

Outline Unconstrained optimization Gradient method for quadratic functions Gradient method for general functions Exact Line Search: first-order approaches Exact Line Search: second-order approaches Exact Line Search: zeroth-order approaches Inexact Line Search: Armijo-Wolfe Really inexact Line Search: fixed stepsize

First example of line search: gradient method Simplest idea: my model is linear Best linear model of f at x i : f i (x) = f (x i ) + f (x i )( x x i ) x i+1 argmin{ f i (x) : x R n } Except, of course, argmin is empty: f i is unbounded below on R n Go infinitely much along the steepest descent direction d i = f (x i ) But this clearly is trusting the model too much: f (x) f i (x) far from x i As you move along d i, f changes; soon f d i will no longer be negative Beware too long steps, as f will (probably) start growing after a while Too short steps are bad either: f will decrease, but only too little The best step ever: α i argmin{ f ( x i + αd i ) : α 0 } exact line search (doh!) Then, x i+1 x i + α i d i Exact line search is difficult in general, let s start simple Exercise: prove α i > 0

Gradient method for quadratic functions Couldn t be simpler than f (x) = 1 2 x T Qx + qx Think Q 0 as otherwise f is surely unbounded below x solves Qx = q (if ), so this is linear algebra Inverting/factorizing Q is O(n 3 ) in practice, can we do better? d i = f (x i ) = Qx i q (O(n 2 ) to compute) Good news: line search is easy, α i = d i 2 /( (d i ) T Qd i ) = procedure x = SDQ ( Q, q, x, ε ) { while( f (x) > ε ) do { d f (x); α d 2 /( d T Qd ); x x + αd; } } Exercise: prove the formula for α i Exercise: there is a glaring numerical problem in that procedure, fix it Exercise: something can go wrong with that formula: what does it mean? Improve the code to take that occurrence into account. Exercise: what happens if Q 0? Does the (improved) code need be fixed?

Gradient method: convergence The gradient method works : what does this mean? Asymptotic analysis: ε = 0 = { x i } is/contains a minimizing sequence Fundamental relationship: f (x i ), f (x i+1 ) = 0 Proof: d i, f (x i+1 ) = f d i (x i+1 ), but x i+1 local minimum along d i { x i } x = f (x) = 0 Proof: lim i f (x i ), f (x i+1 ) = 0 = f (x), f (x) (why?) Any subsequence that converges does so at a stationary point (weaker) Do (sub)sequence(s) converge? X compact would help, but X = R n ε > 0 = finitely terminates (why?), no convergence required Exercise: prove that if Q 0, then { x i } x, unique optimum

Gradient method: efficiency The gradient method is (not) fast : what does this mean? How rapidly x i x decreases... hard; { x i } may not converge different subsequences different optima (which x?) Typically, how rapidly f (x i ) f decreases (eventually, it has to) Rate/order of convergence: f (x i+1 ) f lim i ( f (x i ) f ) p = R p = 1, R = 1 = sublinear convergence 1 / i, 1 / i 2... p = 1, R < 1 = linear convergence γ i, γ < 1 p = 1, R = 0 = superlinear (!) convergence γ i 2, γ < 1 p = 2, R > 0 = quadratic (!!!) convergence γ 2i, γ < 1 Linear convergence: in the tail, f (x i+1 ) f R( f (x i ) f ) = f (x i ) f ( f (x 1 ) f )R i, as fast as a negative exponential f (x i ) f ε for i log( ( f (x 1 ) f )/ε ) / log(1/r) O( log( 1/ε ) ) [good!], but the constant as R 1 [bad!]

Gradient method: efficiency Analysis is not obvious, have to use property of x (unknown) In this case, nifty trick: f (x) = 1 2 (x x ) T Q(x x ) = f (x) + 1 2 x T Qx = f (x) f the error at x is the distance between x and x in Q Exercise: check the above formula (hint: remember Qx + q = 0) One can then prove that if Q 0 then ( f (x i+1 d i 4 ) ) = 1 ((d i ) T Qd i )((d i ) T Q 1 d i f (x i ) ) the error decreases by exactly a constant factor at each iteration Making sense of the above bound requires a bit of work Exercise: check the above formula (hint: for y i = x i x, d i = Qy i )

Gradient method: efficiency (cont.d) Recall a few facts: Λ(Q) = λ 1... λ n > 0 eigenvalues of Q = Λ(Q 1 ) = 1 / λ n... 1 / λ 1 > 0 eigenvalues of Q 1 λ n x 2 x T Qx λ 1 x 2 x R n Hence, x 2 /x T Qx 1/λ 1, x 2 /x T Q 1 x λ n (check) = x R n x 4 (x T Qx)(x T Q 1 x) λn λ 1 A better estimate is possible (technical, just believe it): A bit better: with λ 1 = 1000λ n x R n x 4 (x T Qx)(x T Q 1 x) 4λ1 λ n (λ 1 + λ n ) 2 λ n λ 1 = 0.001 < 4λ1 λ n (λ 1 + λ n ) 2 0.004

Gradient method: efficiency (wrap up) All in all: ( λ f (x i+1 1 λ n ) 2 ) f λ 1 + λ n ( f (x i ) f ) the prototype of all linear convergence results Good news: the bound is dimension independent does not depend on n = holds the same for very-large-scale Bad news: the bound depends badly on conditioning of Q Example: λ 1 = 1000λ n = R 0.996 1/ log(1/r) 576 Note: with coarser formula R = 0.999 1/ log(1/r) 2.301 With f (x 1 ) f = 1, ε = 10 6 requires 3500 iterations even for n = 2

Gradient method: efficiency (wrap up) All in all: ( λ f (x i+1 1 λ n ) 2 ) f λ 1 + λ n ( f (x i ) f ) the prototype of all linear convergence results Good news: the bound is dimension independent does not depend on n = holds the same for very-large-scale Bad news: the bound depends badly on conditioning of Q Example: λ 1 = 1000λ n = R 0.996 1/ log(1/r) 576 Note: with coarser formula R = 0.999 1/ log(1/r) 2.301 With f (x 1 ) f = 1, ε = 10 6 requires 3500 iterations even for n = 2... but also for n = 10 8 Dimension independence is liked a lot in ML, but R may 1 as n grows More bad news: the behaviour in practice is close to the bound Intuitively, the algorithm zig-zags a lot when level sets are very elongated

En passant: the stopping criterion The stopping criterion is not what one would want, which is f (x i ) f = ε A ε (absolute error) or ε A / f = ε R ε (relative error) (more or less alternative version has f (x i ) at the denominator) Exercise: the definition of ε R has a glaring numerical problem, fix it Exercise: explain exactly why ε R is better than ε A Except, f is unknown (most often) and cannot be used on-line Need a lower bound f f, tight at least towards termination Estimating f could be considered the true problem Often f not there, hence f (x i ) ε the only workable alternative But the relationship between the two ε is far from obvious Sometimes f (x) has a physical meaning that can be used Exercise: for X = B(0, r) and f convex, estimate ε A when f (x i ) ε

Outline Unconstrained optimization Gradient method for quadratic functions Gradient method for general functions Exact Line Search: first-order approaches Exact Line Search: second-order approaches Exact Line Search: zeroth-order approaches Inexact Line Search: Armijo-Wolfe Really inexact Line Search: fixed stepsize

Gradient method: non-quadratic case What happens when f is a general nonlinear function? Good news: convergence is the same (never used f quadratic ) Condition f (x i ), f (x i+1 ) = 0 holds at local minima (but also at local maxima and saddle points), so convexity not crucial Good/bad news: efficiency is basically the same. f C 2, x local minimum such that 2 f (x ) = Q 0; if { x i } x, then { f (x i ) } f (x ) linearly with the same R as in the quadratic case (depending on λ 1 and λ n of Q) In the tail of the convergence process f its second-order model, so convergence is the same Fundamental issue: exact line search is difficult Algebraic solution (compute f (x α f (x)), find its roots) possible only in a limited set of cases Has to algorithmically search along the line for the right α i (doh!)

Outline Unconstrained optimization Gradient method for quadratic functions Gradient method for general functions Exact Line Search: first-order approaches Exact Line Search: second-order approaches Exact Line Search: zeroth-order approaches Inexact Line Search: Armijo-Wolfe Really inexact Line Search: fixed stepsize

Line Search: first-order approaches For ϕ( α ) = f ( x i + αd i ) : R R, ϕ ( α ) = f ( x i + αd i ), d Exercise: prove this using the chain rule: f : R m R k, g : R n R m h(x) = f (g(x)) : R n R k = Jh(x) = Jf (g(x)) Jg(x) (note that Jf R k m, Jg R m n, in fact Jh R k m R m n = R k n ) Find α i s.t. ϕ ( α i ) = 0 f continuous = ϕ continuous (why?) α i must exist if ᾱ s.t. ϕ ( ᾱ ) > 0 Exercise: prove this (hint: use the intermediate value theorem) Obvious solution: ᾱ 1; // or whatever value while( ϕ ( ᾱ ) > 0 ) do ᾱ 2ᾱ; // or whatever factor Will work in practice for all reasonable function Works if ϕ coercive: lim α ϕ( α ) = (ex. f strongly convex) Exercise: construct an example where ᾱ exists but it is not found

Line Search: Bisection method Pretty darn obvious: procedure α = LSBM ( ϕ, α, ε ) { α 0; α + α; while( true ) do { α (α + + α )/2; v ϕ ( α ); if( v ε ) then break; if( v < 0 ) then α α; else α + α; } } Asymptotic convergence: ε = 0, { α k } infinite sequence { α k } [ 0, ᾱ ] = convergent subsequence to α (why?) α [ α k, α k + ] k, α k + α k = ᾱ2 k = { α k } α (why?) = { ϕ ( α k ) } ϕ ( α ) = 0 (why?) = finitely terminate for ε > 0 Exercise: prove: ϕ locally Lipschitz at α = { ϕ ( α k ) } 0 linearly (R?) Exercise: construct counter-example (ϕ not locally Lipschitz) Exercise: suggest assumptions for ϕ locally Lipschitz = linear convergence

Improving the bisection method: interpolation Choosing α k+1 right in the middle just the dumbest possible approach One knows a lot about ϕ: ϕ( α ), ϕ( α + ), ϕ ( α + ), ϕ ( α ) (need be computed, but usually free if one computes ϕ ) Quadratic interpolation: aα 2 + bα + c that agrees with ϕ at α +, α Three parameters, four conditions, something s gotta give (three cases) Example: 2aα + + b = ϕ ( α + ), 2aα + b = ϕ ( α ) = a = ϕ ( α + ) ϕ ( α ) 2(α + α ), b = α ϕ ( α + ) α + ϕ ( α ) α + α Minimum solves 2aα + b = 0 (c irrelevant) α = α ϕ ( α + ) α + ϕ ( α ) ϕ ( α + ) ϕ ( α ) a convex combination between α + and α (check) Exercise: develop the other cases of quadratic interpolation and discuss them

Improving the bisection method: more interpolation It can be proven (long and complicated) that, if ϕ C 3, then quadratic interpolation has convergence of order 1 < p < 2 (superlinear) For instance, the previous formula (a.k.a. method of false position or secant formula ) has p = (1 + 5)/2 1.618 Exercise: propose a simple modification that guarantees (linear) convergence even if ϕ / C 3 while changing as little as possible the normal run Four conditions = can fit a cubic polynomial and use its minima Rather tedious to write down, analyse and implement Theoretically pays: cubic interpolation has quadratic convergence (p = 2) Seems to work pretty well in practice Exercise (not for the faint of heart): develop cubic interpolation

Outline Unconstrained optimization Gradient method for quadratic functions Gradient method for general functions Exact Line Search: first-order approaches Exact Line Search: second-order approaches Exact Line Search: zeroth-order approaches Inexact Line Search: Armijo-Wolfe Really inexact Line Search: fixed stepsize

Line Search: second-order approaches More derivatives = same information with less points f C 2 = ϕ (α) = d T 2 f (x + αd)d and continuous (why?) Exercise: prove this using the chain rule Computing 2 f = quadratic convergence with only one point Newton s method (tangent method): first-order Taylor of ϕ at α k ϕ (α) ϕ (α k ) + ϕ (α k )( α α k ), solve ϕ (α) = 0 = α = α k ϕ (α k ) / ϕ (α k ) This is clearly second-order approximation of ϕ Fantastically simple procedure α = LSNM ( ϕ, ϕ, α, ε ) { while( ϕ ( α ) > ε ) do α α ϕ ( α ) / ϕ ( α ); } Extremely good convergence (under appropriate conditions) Clearly numerically delicate: what if ϕ ( α ) 0?

Analysis of Newton s method Theoretical analysis of Newton s method instructive If ϕ C 3, ϕ ( α ) = 0 and ϕ (α ) 0, then δ > 0 s.t. if Newton s method starts at α [ α δ, α + δ ], then { α k } α with p = 2 Proof: the iteration gives α k+1 α = α k α ( ϕ ( α k ) ϕ ( α ) ) / ϕ ( α k ) = [ ϕ ( α k ) ϕ ( α ) + ϕ ( α k )( α k α ) ] / ϕ ( α k ) For some β [ α k, α ], Taylor gives ϕ ( α ) = ϕ ( α k ) + ϕ ( α k )( α k α ) + ϕ ( β )( α k α ) 2 /2 = α k+1 α = [ ϕ ( β ) / 2ϕ ( α k ) ]( α k α ) 2 δ > 0 s.t. ϕ ( α ) k 2 > 0 (why?) and ϕ ( β ) k 1 < (why?) for α, β [ α δ, α + δ ] = α k+1 α [ k 1 / 2k 2 ]( α k α ) 2 k 1 ( α k α )/2k 2 1 = α k+1 α < α k α = { α k } α, and the convergence is quadratic Convergence only if α k+1 α small enough Nontrivial to ensure in practice

Outline Unconstrained optimization Gradient method for quadratic functions Gradient method for general functions Exact Line Search: first-order approaches Exact Line Search: second-order approaches Exact Line Search: zeroth-order approaches Inexact Line Search: Armijo-Wolfe Really inexact Line Search: fixed stepsize

Line Search: zeroth-order approaches Computing f / 2 f can be costly (d T 2 f d is O(n 2 ) already) Only use ϕ values: less derivatives = more points Golden ratio search: assuming ϕ( 0 ) ϕ( α ) procedure α = LSGRM ( ϕ, α, ε ) { α 0; α + α; α 0.382 α; α + = 0.618 α; while( α + α ε ) do if( ϕ( α ) > ϕ( α + ) ) then { α α ; α α α +; α + 0.618(α + α ); } else { α + α +; α + α α ; α 0.382(α + α ); } } 0.618 ( 5 1)/2 (golden ratio), 0.382 = 1 0.618 Property: 0.618 r = (1 r)/r 0.382/0.618, i.e., r : 1 = (1 r) : r Can compute only one ϕ( α ) per iteration Can do slightly better by using r k = F n k /F n k+1 (Fibonacci sequence) Exercise: picture out graphically how it works Exercise: analyse asymptotic and finite convergence of the approach

Gradient method and (inexact) line search Is ϕ ( α i ) ε enough for convergence? It depends on ε (of course) Trick: d i = f (x i )/ f (x i ) = d i = 1, ϕ ( 0 ) = f (x i ) ϕ ( α i ) = d i, f (x i+1 ) = f (x i )/ f (x i ), f (x i+1 ) { x i } x lim i f (x i )/ f (x i ), f (x i+1 ) = f (x)/ f (x), f (x) = f (x) ε (note: f (x i ) > ε) ε > 0 and { x i } x = for finite i, x i is approximate stationary point Note: with d i := f (x i ), ε := ε f (x i ) Other assumptions on f needed to ensure { x i } x (R n not compact) Simple one: f coercive lim x f (x) = + f continuous = f coercive S(f, v) compact v Exercise: prove f coercive (+ what else neded) = algorithm finitely stops Exercise: discuss how to get asymptotic convergence (ε = 0) Do we really need a close approximation to f (x) = 0?

Outline Unconstrained optimization Gradient method for quadratic functions Gradient method for general functions Exact Line Search: first-order approaches Exact Line Search: second-order approaches Exact Line Search: zeroth-order approaches Inexact Line Search: Armijo-Wolfe Really inexact Line Search: fixed stepsize

Gradient method and (really) inexact line search Don t need to get a local minimum, just decrease enough f(x) Armijo condition: 0 < m 1 < 1, x

Gradient method and (really) inexact line search Don t need to get a local minimum, just decrease enough f(x) Armijo condition: 0 < m 1 < 1, (A) ϕ( α ) ϕ( 0 ) + m 1 αϕ ( 0 ) x

Gradient method and (really) inexact line search Don t need to get a local minimum, just decrease enough f(x) Armijo condition: 0 < m 1 < 1, (A) ϕ( α ) ϕ( 0 ) + m 1 αϕ ( 0 ) m 1 ( 1) of the descent promised by ϕ x

Gradient method and (really) inexact line search Don t need to get a local minimum, just decrease enough f(x) Armijo condition: 0 < m 1 < 1, (A) ϕ( α ) ϕ( 0 ) + m 1 αϕ ( 0 ) m 1 ( 1) of the descent promised by ϕ Issue: arbitrarily short steps satisfy (A) x

Gradient method and (really) inexact line search Don t need to get a local minimum, just decrease enough f(x) Armijo condition: 0 < m 1 < 1, (A) ϕ( α ) ϕ( 0 ) + m 1 αϕ ( 0 ) m 1 ( 1) of the descent promised by ϕ Issue: arbitrarily short steps satisfy (A) x Goldstein condition: m 1 < m 2 < 1, (G) ϕ( α ) ϕ( 0 ) + m 2 αϕ ( 0 )

Gradient method and (really) inexact line search Don t need to get a local minimum, just decrease enough f(x) Armijo condition: 0 < m 1 < 1, (A) ϕ( α ) ϕ( 0 ) + m 1 αϕ ( 0 ) m 1 ( 1) of the descent promised by ϕ Issue: arbitrarily short steps satisfy (A) x Goldstein condition: m 1 < m 2 < 1, (G) ϕ( α ) ϕ( 0 ) + m 2 αϕ ( 0 ) Issue: (A) (G) can easily exclude all local minima

Gradient method and (really) inexact line search Don t need to get a local minimum, just decrease enough f(x) Armijo condition: 0 < m 1 < 1, (A) ϕ( α ) ϕ( 0 ) + m 1 αϕ ( 0 ) m 1 ( 1) of the descent promised by ϕ Issue: arbitrarily short steps satisfy (A) x Goldstein condition: m 1 < m 2 < 1, (G) ϕ( α ) ϕ( 0 ) + m 2 αϕ ( 0 ) Issue: (A) (G) can easily exclude all local minima Wolfe condition: m 1 < m 3 < 1, (W) ϕ ( α ) m 3 ϕ ( 0 )

Gradient method and (really) inexact line search Don t need to get a local minimum, just decrease enough f(x) Armijo condition: 0 < m 1 < 1, (A) ϕ( α ) ϕ( 0 ) + m 1 αϕ ( 0 ) m 1 ( 1) of the descent promised by ϕ Issue: arbitrarily short steps satisfy (A) x Goldstein condition: m 1 < m 2 < 1, (G) ϕ( α ) ϕ( 0 ) + m 2 αϕ ( 0 ) Issue: (A) (G) can easily exclude all local minima Wolfe condition: m 1 < m 3 < 1, (W) ϕ ( α ) m 3 ϕ ( 0 ) the curvature has to be a bit closer to 0 (but can be 0)

Gradient method and (really) inexact line search Don t need to get a local minimum, just decrease enough f(x) Armijo condition: 0 < m 1 < 1, (A) ϕ( α ) ϕ( 0 ) + m 1 αϕ ( 0 ) m 1 ( 1) of the descent promised by ϕ Issue: arbitrarily short steps satisfy (A) x Goldstein condition: m 1 < m 2 < 1, (G) ϕ( α ) ϕ( 0 ) + m 2 αϕ ( 0 ) Issue: (A) (G) can easily exclude all local minima Wolfe condition: m 1 < m 3 < 1, (W) ϕ ( α ) m 3 ϕ ( 0 ) the curvature has to be a bit closer to 0 (but can be 0) Strong Wolfe: (W ) ϕ ( α ) m 3 ϕ ( 0 ) = m 3 ϕ ( 0 ) cannot be 0, but still captures all local minima (and maxima) Clearly, (W ) = (W)

Gradient method and (really) inexact line search Don t need to get a local minimum, just decrease enough f(x) Armijo condition: 0 < m 1 < 1, (A) ϕ( α ) ϕ( 0 ) + m 1 αϕ ( 0 ) m 1 ( 1) of the descent promised by ϕ Issue: arbitrarily short steps satisfy (A) x Goldstein condition: m 1 < m 2 < 1, (G) ϕ( α ) ϕ( 0 ) + m 2 αϕ ( 0 ) Issue: (A) (G) can easily exclude all local minima Wolfe condition: m 1 < m 3 < 1, (W) ϕ ( α ) m 3 ϕ ( 0 ) the curvature has to be a bit closer to 0 (but can be 0) Strong Wolfe: (W ) ϕ ( α ) m 3 ϕ ( 0 ) = m 3 ϕ ( 0 ) cannot be 0, but still captures all local minima (and maxima) Clearly, (W ) = (W) (A) (W) / (W ) typically captures all local minima

Gradient method and (really) inexact line search Don t need to get a local minimum, just decrease enough f(x) Armijo condition: 0 < m 1 < 1, (A) ϕ( α ) ϕ( 0 ) + m 1 αϕ ( 0 ) m 1 ( 1) of the descent promised by ϕ Issue: arbitrarily short steps satisfy (A) x Goldstein condition: m 1 < m 2 < 1, (G) ϕ( α ) ϕ( 0 ) + m 2 αϕ ( 0 ) Issue: (A) (G) can easily exclude all local minima Wolfe condition: m 1 < m 3 < 1, (W) ϕ ( α ) m 3 ϕ ( 0 ) the curvature has to be a bit closer to 0 (but can be 0) Strong Wolfe: (W ) ϕ ( α ) m 3 ϕ ( 0 ) = m 3 ϕ ( 0 ) cannot be 0, but still captures all local minima (and maxima) Clearly, (W ) = (W) (A) (W) / (W ) typically captures all local minima unless m 1 too close to 1 (that s why m 1 0.0001)

Armijo-Wolfe line search ϕ C 1 ϕ( α ) bounded below for α 0 = α s.t. (A) (W ) holds Proof: l(α) = ϕ(0) + m 1 αϕ (0), d(α) = l(α) ϕ(α) = d(0) = 0, d (0) = (m 1 1)ϕ (0) > 0 (m 1 < 1) ᾱ > 0 s.t. d(ᾱ) = 0 = ϕ unbounded below (why?) Smallest ᾱ > 0 s.t. d(ᾱ) = 0: (A) is satisfied α (0, ᾱ] (why?) Rolle s theorem: d (ᾱ) < 0 = ϕ (ᾱ) > m 1 ϕ (0) (> m 3 ϕ (0) > ϕ (0)) Intermediate value theorem (on ϕ ): α (0, ᾱ) s.t. ϕ (α ) = m 3 ϕ (0) = (W ) also holds in α But how do I actually find such a point? m 1 small enough s.t. local minima are not cut = just go for the local minima and stop whenever (A) (W) / (W ) holds Hard to say if m 1 is small enough, although m 1 = 0.0001 most often is Specialized line search can be constructed for the odd case it is not Basic idea: find an interval [ α, ᾱ ] that surely contains points satisfying (A) (W) / (W ) (cf. proof above), restrict the search there inside Exercise (not for the faint of heart): develop specialized line search

Convergence with Armijo-Wolfe line search f Lipschitz continuous (A) (W) always hold = either f unbounded below or { f (x i ) } 0 Proof: (W) = ϕ ( α i ) ϕ ( 0 ) (1 m 3 )( ϕ ( 0 )) = f Lipschitz = ϕ Lipschitz and L does not depend on x i (check) = α i (1 m 3 )( ϕ ( 0 ))/L (check: where has gone?) ϕ ( 0 ) = f (x i ) ε > 0 = α i δ > 0 (A) = f (x i+1 ) f (x i ) m 1 α i f (x i ) f (x i ) m 1 δε = { f (x i ) } (or { f (x i ) } 0) Usual stuff: { x i } x = x a stationary point Hence, the algorithm finitely terminates with ε > 0 Insight from the proof: (W) (+ Lipschitz) serve to ensure that α k c f (x i ) for some c > 0 Can we get the same in a simpler way?

Backtracking line search Backtracking line search: procedure α = BLS ( ϕ, ϕ, α, m 1, τ ) { while( ϕ( α ) > ϕ( 0 ) + m 1αϕ ( 0 ) ) do α τ α; } f Lipschitz = gradient method with BLS works Proof: for simplicity, α = 1 (input). Remember the proof: ᾱ s.t. (A) holds α (0, ᾱ] and ϕ (ᾱ) > m 1 ϕ (0) > ϕ (0) = L(ᾱ 0) ϕ (ᾱ) ϕ (0) > (1 m 1 )( ϕ (0)) = ᾱ > (1 m 1 ) f (x i ) /L (same as before) f (x i ) > ε i = ᾱ > δ > 0 i h = min{ k : τ k δ } = α i τ h > 0 i = f (x i+1 ) f (x i ) m 1 τ h ε = { f (x i ) } or Now, { x i } x = x stationary blah blah Fundamental trick: α i can 0, but only as fast as f (x i ) Would be simpler if α i δ > 0 for good Exercise: remove assumption α = 1 (input)

Outline Unconstrained optimization Gradient method for quadratic functions Gradient method for general functions Exact Line Search: first-order approaches Exact Line Search: second-order approaches Exact Line Search: zeroth-order approaches Inexact Line Search: Armijo-Wolfe Really inexact Line Search: fixed stepsize

Line Search: really really inexact... no line search at all fixed step size Recall f Lipschitz = f (y) f (x) + f (x)( y x ) + L 2 y x 2 y := x i+1, x := x i, y x := α f (x i ) = f (x i+1 ) f (x i ) ( Lα 2 /2 α ) f (x i ) 2 (check) Powerful idea: find α that provides best worst-case improvement v(α) = Lα 2 /2 α, v (α) = Lα 1 = 0 = α = 1/L, v(α ) = 1/2L All in all: f (x i+1 ) f (x i ) f (x i ) 2 /(2L) Can t do better if you trust the quadratic upper estimate (which of course must not be trusted) In fact, α i = 1/L terrible in practice = use the previous methods Enticing because simple and inexpensive Selecting the parameters that lead to best performances for a model a very powerful idea in general

Fixed stepsize: convergence rate Once you have convergence, you can talk efficiency (easier with α fixed) Already know the error decreases, but how fast? i+1 := f (x i+1 ) f (x ) ( i := f (x i ) f (x ) ) f (x i ) 2 /(2L) x i any and f (x ) f (x i+1 ) = f (x ) f (x) f (x) 2 /(2L) f convex = f (x)(x x ) f (x) f (x ) f (x) 2 /(2L) x This proves r i := x i x decreases: (r i+1 ) 2 = x i+1 x 2 = x i x f (x i )/L 2 = = x i x 2 2 f (x i )(x i x )/L + f (x i )/L 2 x i x 2 = (r i ) 2 Hence, at the very least { x i } x (no problem here) Technical step: f (x i ) (r i /r 1 ) f (x i ) [Cauchy-Swartz] f (x i )(x i x ) (f (x i ) f (x ))/r 1 [convexity] = i /r 1 Conclusion: i+1 i f (x i ) 2 /(2L) i ( i ) 2 /(2(r 1 ) 2 L) = = i ( 1 i /(2(r 1 ) 2 L) ) not linear convergence as R is not constant sublinear

Fixed stepsize: convergence rate (cont d) What does this mean, exactly? i+1 i ( i ) 2 /(2(r 1 ) 2 L): divide by i+1 i = 1 i 1 i+1 i i+1 2(r 1 ) 2 L = 1 i+1 1 1 i + 2(r 1 ) 2 L (why?) 1/ grows by a constant at each i = 1/ i+1 1/ 1 + i/(2(r 1 ) 2 L) = i+1 2 1 (r 1 ) 2 L/( 2(r 1 ) 2 L + i 1 ) Error decreases as O( 1 / i ) = O( 1 / ε ) iterations (check details) Exponentially worse than O( 1 / log( ε ) ) However, this is unfair: we used Q nonsingular λ n > 0 Does it make a difference? You bet

Fixed stepsize: convergence rate with strong convexity Basically strong convexity Eigenvalues bounded both above and below: u I f 2 (x) L I, u > 0 Taylor = f ( x ) f ( x i ) + f (x i )( x x i ) + u x x i 2 /2 (why?) Minimize on x both sides independently = f ( x ) f ( x i ) f (x i ) 2 /(2u) (check) = f (x i ) 2 2u( f ( x i ) f ( x ) ) Put in f (x i+1 ) f (x ) f (x i ) f (x ) f (x i ) 2 /(2L) = f (x i+1 ) f (x ) ( f (x i ) f (x ) )( 1 u/l ) with exact step, funnily same as with coarse estimate, i.e., much worse A small difference in f makes a big difference in convergence Properties of f even more important than the algorithm O( 1/ε ) not the best for not strongly convex, can be O( 1/ ε ) better, but still much worse than O( 1 / log( ε ) ) Hence better algorithms do count, we ll work towards that However, O( 1/ ε ) is tight: can t do better without strong convexity Algorithms can only get so far with nasty problems

Wrap up Gradient (descent direction) + line search = convergence Line search by no means have to be exact

Wrap up Gradient (descent direction) + line search = convergence Line search by no means have to be exact... but not too coarse either Many different practical line searches, up to no search at all Convergence of gradient methods can be from quite bad to horrible

Wrap up Gradient (descent direction) + line search = convergence Line search by no means have to be exact... but not too coarse either Many different practical line searches, up to no search at all Convergence of gradient methods can be from quite bad to horrible... in practice as well as in theory Something better sorely needed