Unconstrained optimization I Gradient-type methods

Unconstrained optimization I Gradient-type methods Antonio Frangioni Department of Computer Science University of Pisa www.di.unipi.it/~frangio frangio@di.unipi.it Computational Mathematics for Learning and Data Analysis Master in Computer Science University of Pisa

Outline Unconstrained optimization Gradient method for quadratic functions Gradient method for general functions Exact Line Search: first-order approaches Exact Line Search: second-order approaches Exact Line Search: zeroth-order approaches Inexact Line Search: Armijo-Wolfe Really inexact Line Search: fixed stepsize

Optimization algorithms Iterative procedures (doh!) Start from initial guess x 0, some process x i x i+1 Want the sequence { x i } to go towards an optimal solution Actually three different forms: (strong) { x i } x : the whole sequence converges to an optimal solution (weaker) all accumulation points of { x i } (if any) are optimal solutions (weakest) at least one accumulation point of { x i } (if any) is optimal X compact helps (accumulation points always ), but here X = R n f not convex = optimal stationary point Two general forms of the process: line search: first choose d i R n (direction), then choose α i R (stepsize) s.t. x i+1 x i + α i d i trust region: first choose α i (trust radius), then choose d i In ML, α i is often called learning rate Crucial concept: model of f used to construct next iterate

First example of line search: gradient method Simplest idea: my model is linear Best linear model of f at x i : f i (x) = f (x i ) + f (x i )( x x i ) x i+1 argmin{ f i (x) : x R n } Except, of course, argmin is empty: f i is unbounded below on R n Go infinitely much along the steepest descent direction d i = f (x i ) But this clearly is trusting the model too much: f (x) f i (x) far from x i As you move along d i, f changes; soon f d i will no longer be negative Beware too long steps, as f will (probably) start growing after a while Too short steps are bad either: f will decrease, but only too little The best step ever: α i argmin{ f ( x i + αd i ) : α 0 } exact line search (doh!) Then, x i+1 x i + α i d i Exact line search is difficult in general, let s start simple Exercise: prove α i > 0

Gradient method for quadratic functions Couldn t be simpler than f (x) = 1 2 x T Qx + qx Think Q 0 as otherwise f is surely unbounded below x solves Qx = q (if ), so this is linear algebra Inverting/factorizing Q is O(n 3 ) in practice, can we do better? d i = f (x i ) = Qx i q (O(n 2 ) to compute) Good news: line search is easy, α i = d i 2 /( (d i ) T Qd i ) = procedure x = SDQ ( Q, q, x, ε ) { while( f (x) > ε ) do { d f (x); α d 2 /( d T Qd ); x x + αd; } } Exercise: prove the formula for α i Exercise: there is a glaring numerical problem in that procedure, fix it Exercise: something can go wrong with that formula: what does it mean? Improve the code to take that occurrence into account. Exercise: what happens if Q 0? Does the (improved) code need be fixed?

Gradient method: convergence The gradient method works : what does this mean? Asymptotic analysis: ε = 0 = { x i } is/contains a minimizing sequence Fundamental relationship: f (x i ), f (x i+1 ) = 0 Proof: d i, f (x i+1 ) = f d i (x i+1 ), but x i+1 local minimum along d i { x i } x = f (x) = 0 Proof: lim i f (x i ), f (x i+1 ) = 0 = f (x), f (x) (why?) Any subsequence that converges does so at a stationary point (weaker) Do (sub)sequence(s) converge? X compact would help, but X = R n ε > 0 = finitely terminates (why?), no convergence required Exercise: prove that if Q 0, then { x i } x, unique optimum

Gradient method: efficiency The gradient method is (not) fast : what does this mean? How rapidly x i x decreases... hard; { x i } may not converge different subsequences different optima (which x?) Typically, how rapidly f (x i ) f decreases (eventually, it has to) Rate/order of convergence: f (x i+1 ) f lim i ( f (x i ) f ) p = R p = 1, R = 1 = sublinear convergence 1 / i, 1 / i 2... p = 1, R < 1 = linear convergence γ i, γ < 1 p = 1, R = 0 = superlinear (!) convergence γ i 2, γ < 1 p = 2, R > 0 = quadratic (!!!) convergence γ 2i, γ < 1 Linear convergence: in the tail, f (x i+1 ) f R( f (x i ) f ) = f (x i ) f ( f (x 1 ) f )R i, as fast as a negative exponential f (x i ) f ε for i log( ( f (x 1 ) f )/ε ) / log(1/r) O( log( 1/ε ) ) [good!], but the constant as R 1 [bad!]

Gradient method: efficiency Analysis is not obvious, have to use property of x (unknown) In this case, nifty trick: f (x) = 1 2 (x x ) T Q(x x ) = f (x) + 1 2 x T Qx = f (x) f the error at x is the distance between x and x in Q Exercise: check the above formula (hint: remember Qx + q = 0) One can then prove that if Q 0 then ( f (x i+1 d i 4 ) ) = 1 ((d i ) T Qd i )((d i ) T Q 1 d i f (x i ) ) the error decreases by exactly a constant factor at each iteration Making sense of the above bound requires a bit of work Exercise: check the above formula (hint: for y i = x i x, d i = Qy i )

Gradient method: efficiency (cont.d) Recall a few facts: Λ(Q) = λ 1... λ n > 0 eigenvalues of Q = Λ(Q 1 ) = 1 / λ n... 1 / λ 1 > 0 eigenvalues of Q 1 λ n x 2 x T Qx λ 1 x 2 x R n Hence, x 2 /x T Qx 1/λ 1, x 2 /x T Q 1 x λ n (check) = x R n x 4 (x T Qx)(x T Q 1 x) λn λ 1 A better estimate is possible (technical, just believe it): A bit better: with λ 1 = 1000λ n x R n x 4 (x T Qx)(x T Q 1 x) 4λ1 λ n (λ 1 + λ n ) 2 λ n λ 1 = 0.001 < 4λ1 λ n (λ 1 + λ n ) 2 0.004

Gradient method: efficiency (wrap up) All in all: ( λ f (x i+1 1 λ n ) 2 ) f λ 1 + λ n ( f (x i ) f ) the prototype of all linear convergence results Good news: the bound is dimension independent does not depend on n = holds the same for very-large-scale Bad news: the bound depends badly on conditioning of Q Example: λ 1 = 1000λ n = R 0.996 1/ log(1/r) 576 Note: with coarser formula R = 0.999 1/ log(1/r) 2.301 With f (x 1 ) f = 1, ε = 10 6 requires 3500 iterations even for n = 2... but also for n = 10 8 Dimension independence is liked a lot in ML, but R may 1 as n grows More bad news: the behaviour in practice is close to the bound Intuitively, the algorithm zig-zags a lot when level sets are very elongated

En passant: the stopping criterion The stopping criterion is not what one would want, which is f (x i ) f = ε A ε (absolute error) or ε A / f = ε R ε (relative error) (more or less alternative version has f (x i ) at the denominator) Exercise: the definition of ε R has a glaring numerical problem, fix it Exercise: explain exactly why ε R is better than ε A Except, f is unknown (most often) and cannot be used on-line Need a lower bound f f, tight at least towards termination Estimating f could be considered the true problem Often f not there, hence f (x i ) ε the only workable alternative But the relationship between the two ε is far from obvious Sometimes f (x) has a physical meaning that can be used Exercise: for X = B(0, r) and f convex, estimate ε A when f (x i ) ε

Gradient method: non-quadratic case What happens when f is a general nonlinear function? Good news: convergence is the same (never used f quadratic ) Condition f (x i ), f (x i+1 ) = 0 holds at local minima (but also at local maxima and saddle points), so convexity not crucial Good/bad news: efficiency is basically the same. f C 2, x local minimum such that 2 f (x ) = Q 0; if { x i } x, then { f (x i ) } f (x ) linearly with the same R as in the quadratic case (depending on λ 1 and λ n of Q) In the tail of the convergence process f its second-order model, so convergence is the same Fundamental issue: exact line search is difficult Algebraic solution (compute f (x α f (x)), find its roots) possible only in a limited set of cases Has to algorithmically search along the line for the right α i (doh!)

Line Search: first-order approaches For ϕ( α ) = f ( x i + αd i ) : R R, ϕ ( α ) = f ( x i + αd i ), d Exercise: prove this using the chain rule: f : R m R k, g : R n R m h(x) = f (g(x)) : R n R k = Jh(x) = Jf (g(x)) Jg(x) (note that Jf R k m, Jg R m n, in fact Jh R k m R m n = R k n ) Find α i s.t. ϕ ( α i ) = 0 f continuous = ϕ continuous (why?) α i must exist if ᾱ s.t. ϕ ( ᾱ ) > 0 Exercise: prove this (hint: use the intermediate value theorem) Obvious solution: ᾱ 1; // or whatever value while( ϕ ( ᾱ ) > 0 ) do ᾱ 2ᾱ; // or whatever factor Will work in practice for all reasonable function Works if ϕ coercive: lim α ϕ( α ) = (ex. f strongly convex) Exercise: construct an example where ᾱ exists but it is not found

Line Search: Bisection method Pretty darn obvious: procedure α = LSBM ( ϕ, α, ε ) { α 0; α + α; while( true ) do { α (α + + α )/2; v ϕ ( α ); if( v ε ) then break; if( v < 0 ) then α α; else α + α; } } Asymptotic convergence: ε = 0, { α k } infinite sequence { α k } [ 0, ᾱ ] = convergent subsequence to α (why?) α [ α k, α k + ] k, α k + α k = ᾱ2 k = { α k } α (why?) = { ϕ ( α k ) } ϕ ( α ) = 0 (why?) = finitely terminate for ε > 0 Exercise: prove: ϕ locally Lipschitz at α = { ϕ ( α k ) } 0 linearly (R?) Exercise: construct counter-example (ϕ not locally Lipschitz) Exercise: suggest assumptions for ϕ locally Lipschitz = linear convergence

Improving the bisection method: interpolation Choosing α k+1 right in the middle just the dumbest possible approach One knows a lot about ϕ: ϕ( α ), ϕ( α + ), ϕ ( α + ), ϕ ( α ) (need be computed, but usually free if one computes ϕ ) Quadratic interpolation: aα 2 + bα + c that agrees with ϕ at α +, α Three parameters, four conditions, something s gotta give (three cases) Example: 2aα + + b = ϕ ( α + ), 2aα + b = ϕ ( α ) = a = ϕ ( α + ) ϕ ( α ) 2(α + α ), b = α ϕ ( α + ) α + ϕ ( α ) α + α Minimum solves 2aα + b = 0 (c irrelevant) α = α ϕ ( α + ) α + ϕ ( α ) ϕ ( α + ) ϕ ( α ) a convex combination between α + and α (check) Exercise: develop the other cases of quadratic interpolation and discuss them

Improving the bisection method: more interpolation It can be proven (long and complicated) that, if ϕ C 3, then quadratic interpolation has convergence of order 1 < p < 2 (superlinear) For instance, the previous formula (a.k.a. method of false position or secant formula ) has p = (1 + 5)/2 1.618 Exercise: propose a simple modification that guarantees (linear) convergence even if ϕ / C 3 while changing as little as possible the normal run Four conditions = can fit a cubic polynomial and use its minima Rather tedious to write down, analyse and implement Theoretically pays: cubic interpolation has quadratic convergence (p = 2) Seems to work pretty well in practice Exercise (not for the faint of heart): develop cubic interpolation

Line Search: second-order approaches More derivatives = same information with less points f C 2 = ϕ (α) = d T 2 f (x + αd)d and continuous (why?) Exercise: prove this using the chain rule Computing 2 f = quadratic convergence with only one point Newton s method (tangent method): first-order Taylor of ϕ at α k ϕ (α) ϕ (α k ) + ϕ (α k )( α α k ), solve ϕ (α) = 0 = α = α k ϕ (α k ) / ϕ (α k ) This is clearly second-order approximation of ϕ Fantastically simple procedure α = LSNM ( ϕ, ϕ, α, ε ) { while( ϕ ( α ) > ε ) do α α ϕ ( α ) / ϕ ( α ); } Extremely good convergence (under appropriate conditions) Clearly numerically delicate: what if ϕ ( α ) 0?

Analysis of Newton s method Theoretical analysis of Newton s method instructive If ϕ C 3, ϕ ( α ) = 0 and ϕ (α ) 0, then δ > 0 s.t. if Newton s method starts at α [ α δ, α + δ ], then { α k } α with p = 2 Proof: the iteration gives α k+1 α = α k α ( ϕ ( α k ) ϕ ( α ) ) / ϕ ( α k ) = [ ϕ ( α k ) ϕ ( α ) + ϕ ( α k )( α k α ) ] / ϕ ( α k ) For some β [ α k, α ], Taylor gives ϕ ( α ) = ϕ ( α k ) + ϕ ( α k )( α k α ) + ϕ ( β )( α k α ) 2 /2 = α k+1 α = [ ϕ ( β ) / 2ϕ ( α k ) ]( α k α ) 2 δ > 0 s.t. ϕ ( α ) k 2 > 0 (why?) and ϕ ( β ) k 1 < (why?) for α, β [ α δ, α + δ ] = α k+1 α [ k 1 / 2k 2 ]( α k α ) 2 k 1 ( α k α )/2k 2 1 = α k+1 α < α k α = { α k } α, and the convergence is quadratic Convergence only if α k+1 α small enough Nontrivial to ensure in practice

Line Search: zeroth-order approaches Computing f / 2 f can be costly (d T 2 f d is O(n 2 ) already) Only use ϕ values: less derivatives = more points Golden ratio search: assuming ϕ( 0 ) ϕ( α ) procedure α = LSGRM ( ϕ, α, ε ) { α 0; α + α; α 0.382 α; α + = 0.618 α; while( α + α ε ) do if( ϕ( α ) > ϕ( α + ) ) then { α α ; α α α +; α + 0.618(α + α ); } else { α + α +; α + α α ; α 0.382(α + α ); } } 0.618 ( 5 1)/2 (golden ratio), 0.382 = 1 0.618 Property: 0.618 r = (1 r)/r 0.382/0.618, i.e., r : 1 = (1 r) : r Can compute only one ϕ( α ) per iteration Can do slightly better by using r k = F n k /F n k+1 (Fibonacci sequence) Exercise: picture out graphically how it works Exercise: analyse asymptotic and finite convergence of the approach

Gradient method and (inexact) line search Is ϕ ( α i ) ε enough for convergence? It depends on ε (of course) Trick: d i = f (x i )/ f (x i ) = d i = 1, ϕ ( 0 ) = f (x i ) ϕ ( α i ) = d i, f (x i+1 ) = f (x i )/ f (x i ), f (x i+1 ) { x i } x lim i f (x i )/ f (x i ), f (x i+1 ) = f (x)/ f (x), f (x) = f (x) ε (note: f (x i ) > ε) ε > 0 and { x i } x = for finite i, x i is approximate stationary point Note: with d i := f (x i ), ε := ε f (x i ) Other assumptions on f needed to ensure { x i } x (R n not compact) Simple one: f coercive lim x f (x) = + f continuous = f coercive S(f, v) compact v Exercise: prove f coercive (+ what else neded) = algorithm finitely stops Exercise: discuss how to get asymptotic convergence (ε = 0) Do we really need a close approximation to f (x) = 0?

Gradient method and (really) inexact line search Don t need to get a local minimum, just decrease enough f(x) Armijo condition: 0 < m 1 < 1, x

Gradient method and (really) inexact line search Don t need to get a local minimum, just decrease enough f(x) Armijo condition: 0 < m 1 < 1, (A) ϕ( α ) ϕ( 0 ) + m 1 αϕ ( 0 ) x

Gradient method and (really) inexact line search Don t need to get a local minimum, just decrease enough f(x) Armijo condition: 0 < m 1 < 1, (A) ϕ( α ) ϕ( 0 ) + m 1 αϕ ( 0 ) m 1 ( 1) of the descent promised by ϕ Issue: arbitrarily short steps satisfy (A) x Goldstein condition: m 1 < m 2 < 1, (G) ϕ( α ) ϕ( 0 ) + m 2 αϕ ( 0 ) Issue: (A) (G) can easily exclude all local minima Wolfe condition: m 1 < m 3 < 1, (W) ϕ ( α ) m 3 ϕ ( 0 ) the curvature has to be a bit closer to 0 (but can be 0) Strong Wolfe: (W ) ϕ ( α ) m 3 ϕ ( 0 ) = m 3 ϕ ( 0 ) cannot be 0, but still captures all local minima (and maxima) Clearly, (W ) = (W)

Armijo-Wolfe line search ϕ C 1 ϕ( α ) bounded below for α 0 = α s.t. (A) (W ) holds Proof: l(α) = ϕ(0) + m 1 αϕ (0), d(α) = l(α) ϕ(α) = d(0) = 0, d (0) = (m 1 1)ϕ (0) > 0 (m 1 < 1) ᾱ > 0 s.t. d(ᾱ) = 0 = ϕ unbounded below (why?) Smallest ᾱ > 0 s.t. d(ᾱ) = 0: (A) is satisfied α (0, ᾱ] (why?) Rolle s theorem: d (ᾱ) < 0 = ϕ (ᾱ) > m 1 ϕ (0) (> m 3 ϕ (0) > ϕ (0)) Intermediate value theorem (on ϕ ): α (0, ᾱ) s.t. ϕ (α ) = m 3 ϕ (0) = (W ) also holds in α But how do I actually find such a point? m 1 small enough s.t. local minima are not cut = just go for the local minima and stop whenever (A) (W) / (W ) holds Hard to say if m 1 is small enough, although m 1 = 0.0001 most often is Specialized line search can be constructed for the odd case it is not Basic idea: find an interval [ α, ᾱ ] that surely contains points satisfying (A) (W) / (W ) (cf. proof above), restrict the search there inside Exercise (not for the faint of heart): develop specialized line search

Convergence with Armijo-Wolfe line search f Lipschitz continuous (A) (W) always hold = either f unbounded below or { f (x i ) } 0 Proof: (W) = ϕ ( α i ) ϕ ( 0 ) (1 m 3 )( ϕ ( 0 )) = f Lipschitz = ϕ Lipschitz and L does not depend on x i (check) = α i (1 m 3 )( ϕ ( 0 ))/L (check: where has gone?) ϕ ( 0 ) = f (x i ) ε > 0 = α i δ > 0 (A) = f (x i+1 ) f (x i ) m 1 α i f (x i ) f (x i ) m 1 δε = { f (x i ) } (or { f (x i ) } 0) Usual stuff: { x i } x = x a stationary point Hence, the algorithm finitely terminates with ε > 0 Insight from the proof: (W) (+ Lipschitz) serve to ensure that α k c f (x i ) for some c > 0 Can we get the same in a simpler way?

Backtracking line search Backtracking line search: procedure α = BLS ( ϕ, ϕ, α, m 1, τ ) { while( ϕ( α ) > ϕ( 0 ) + m 1αϕ ( 0 ) ) do α τ α; } f Lipschitz = gradient method with BLS works Proof: for simplicity, α = 1 (input). Remember the proof: ᾱ s.t. (A) holds α (0, ᾱ] and ϕ (ᾱ) > m 1 ϕ (0) > ϕ (0) = L(ᾱ 0) ϕ (ᾱ) ϕ (0) > (1 m 1 )( ϕ (0)) = ᾱ > (1 m 1 ) f (x i ) /L (same as before) f (x i ) > ε i = ᾱ > δ > 0 i h = min{ k : τ k δ } = α i τ h > 0 i = f (x i+1 ) f (x i ) m 1 τ h ε = { f (x i ) } or Now, { x i } x = x stationary blah blah Fundamental trick: α i can 0, but only as fast as f (x i ) Would be simpler if α i δ > 0 for good Exercise: remove assumption α = 1 (input)

Line Search: really really inexact... no line search at all fixed step size Recall f Lipschitz = f (y) f (x) + f (x)( y x ) + L 2 y x 2 y := x i+1, x := x i, y x := α f (x i ) = f (x i+1 ) f (x i ) ( Lα 2 /2 α ) f (x i ) 2 (check) Powerful idea: find α that provides best worst-case improvement v(α) = Lα 2 /2 α, v (α) = Lα 1 = 0 = α = 1/L, v(α ) = 1/2L All in all: f (x i+1 ) f (x i ) f (x i ) 2 /(2L) Can t do better if you trust the quadratic upper estimate (which of course must not be trusted) In fact, α i = 1/L terrible in practice = use the previous methods Enticing because simple and inexpensive Selecting the parameters that lead to best performances for a model a very powerful idea in general

Fixed stepsize: convergence rate Once you have convergence, you can talk efficiency (easier with α fixed) Already know the error decreases, but how fast? i+1 := f (x i+1 ) f (x ) ( i := f (x i ) f (x ) ) f (x i ) 2 /(2L) x i any and f (x ) f (x i+1 ) = f (x ) f (x) f (x) 2 /(2L) f convex = f (x)(x x ) f (x) f (x ) f (x) 2 /(2L) x This proves r i := x i x decreases: (r i+1 ) 2 = x i+1 x 2 = x i x f (x i )/L 2 = = x i x 2 2 f (x i )(x i x )/L + f (x i )/L 2 x i x 2 = (r i ) 2 Hence, at the very least { x i } x (no problem here) Technical step: f (x i ) (r i /r 1 ) f (x i ) [Cauchy-Swartz] f (x i )(x i x ) (f (x i ) f (x ))/r 1 [convexity] = i /r 1 Conclusion: i+1 i f (x i ) 2 /(2L) i ( i ) 2 /(2(r 1 ) 2 L) = = i ( 1 i /(2(r 1 ) 2 L) ) not linear convergence as R is not constant sublinear

Fixed stepsize: convergence rate (cont d) What does this mean, exactly? i+1 i ( i ) 2 /(2(r 1 ) 2 L): divide by i+1 i = 1 i 1 i+1 i i+1 2(r 1 ) 2 L = 1 i+1 1 1 i + 2(r 1 ) 2 L (why?) 1/ grows by a constant at each i = 1/ i+1 1/ 1 + i/(2(r 1 ) 2 L) = i+1 2 1 (r 1 ) 2 L/( 2(r 1 ) 2 L + i 1 ) Error decreases as O( 1 / i ) = O( 1 / ε ) iterations (check details) Exponentially worse than O( 1 / log( ε ) ) However, this is unfair: we used Q nonsingular λ n > 0 Does it make a difference? You bet

Fixed stepsize: convergence rate with strong convexity Basically strong convexity Eigenvalues bounded both above and below: u I f 2 (x) L I, u > 0 Taylor = f ( x ) f ( x i ) + f (x i )( x x i ) + u x x i 2 /2 (why?) Minimize on x both sides independently = f ( x ) f ( x i ) f (x i ) 2 /(2u) (check) = f (x i ) 2 2u( f ( x i ) f ( x ) ) Put in f (x i+1 ) f (x ) f (x i ) f (x ) f (x i ) 2 /(2L) = f (x i+1 ) f (x ) ( f (x i ) f (x ) )( 1 u/l ) with exact step, funnily same as with coarse estimate, i.e., much worse A small difference in f makes a big difference in convergence Properties of f even more important than the algorithm O( 1/ε ) not the best for not strongly convex, can be O( 1/ ε ) better, but still much worse than O( 1 / log( ε ) ) Hence better algorithms do count, we ll work towards that However, O( 1/ ε ) is tight: can t do better without strong convexity Algorithms can only get so far with nasty problems

Wrap up Gradient (descent direction) + line search = convergence Line search by no means have to be exact

Wrap up Gradient (descent direction) + line search = convergence Line search by no means have to be exact... but not too coarse either Many different practical line searches, up to no search at all Convergence of gradient methods can be from quite bad to horrible