Unconstrained minimization: assumptions

Unconstrained minimization I terminology and assumptions I gradient descent method I steepest descent method I Newton s method I self-concordant functions I implementation IOE 611: Nonlinear Programming, Fall 2017 7. Unconstrained minimization Page 7 1 Unconstrained minimization: assumptions minimize f (x) I f convex, twice continuously di erentiable (hence dom f open) I we assume optimal value p? =inf x f (x) isattained(and finite) Unconstrained minimization methods I produce sequence of points x (k) 2 dom f, k =0, 1,... with f (x (k) )! p? (commonly, f (x (k) ) & p? ) I in practice, terminated in finite time, when f (x k ) p? apple, where > 0 is pre-specified I can be interpreted as iterative methods for solving optimality condition rf (x? )=0 IOE 611: Nonlinear Programming, Fall 2017 7. Unconstrained minimization Page 7 2

Descent methods x (k+1) = x (k) + t (k) x (k) with f (x (k+1) ) < f (x (k) ) I other notation options: x + = x + t x, x := x + t x I x is the step, orsearch direction; t is the step size, orstep length I from convexity, f (x + ) < f (x) impliesrf (x) T x < 0 (i.e., x is a descent direction) General descent method. given a starting point x 2 dom f. repeat 1. Determine a descent direction x. 2. Line search. Choose a step size t > 0. 3. Update. x := x + t x. until stopping criterion is satisfied. IOE 611: Nonlinear Programming, Fall 2017 7. Unconstrained minimization Page 7 3 Further assumptions: Initial point and sublevel set Algorithms in this chapter require a starting point x (0) such that I x (0) 2 dom f I sublevel set S = {x f (x) apple f (x (0) )} is closed 2nd condition is hard to verify, except when all sublevel sets are closed: I equivalent to condition that epi f is closed I true if dom f = R n I true if f (x)!1as x! bd dom f examples of di erentiable functions with closed sublevel sets:! mx mx f (x) = log exp(ai T x + b i ), f (x) = log(b i ai T x) i=1 i=1 IOE 611: Nonlinear Programming, Fall 2017 7. Unconstrained minimization Page 7 4

Strong convexity and implications f is strongly convex on S if there exists an m > 0 such that r 2 f (x) mi for all x 2 S Implications 1. for x, y 2 S, f (y) f (x)+rf(x) T (y x)+ m 2 kx yk2 2 2. Hence, S is bounded: take any y 2 S and x = x? 2 S in the above 3. p? > 1, andforx 2 S, f (x) p? apple 1 2m krf (x)k2 2 useful as stopping criterion (if you know m) 4. Also, 8x 2 S, kx x? k 2 apple 2 m krf (x)k 2, and hence opt. solution is unique. IOE 611: Nonlinear Programming, Fall 2017 7. Unconstrained minimization Page 7 5 Descent methods (reprise) x (k+1) = x (k) + t (k) x (k) with f (x (k+1) ) < f (x (k) ) I other notation options: x + = x + t x, x := x + t x I x is the step, orsearch direction; t is the step size, orstep length I from convexity, f (x + ) < f (x) impliesrf (x) T x < 0 (i.e., x is a descent direction) General descent method. given a starting point x 2 dom f. repeat 1. Determine a descent direction x. 2. Line search. Choose a step size t > 0. 3. Update. x := x + t x. until stopping criterion is satisfied. IOE 611: Nonlinear Programming, Fall 2017 7. Unconstrained minimization Page 7 6

Line search types Exact line search: t = argmin t>0 f (x + t x) Backtracking line search (with parameters 2 (0, 1/2), 2 (0, 1)) I starting at t = 1, repeat t := t until f (x + t x) < f (x)+ trf (x) T x I graphical interpretation: backtrack until t apple t 0 f(x + t x) f(x) + t f(x) T x f(x) + αt f(x) T x t t = 0 t 0 I after backtracking, stepsize t will satisfy t =1ort 2 ( t 0, t 0 ] IOE 611: Nonlinear Programming, Fall 2017 7. Unconstrained minimization Page 7 7 Gradient descent method general descent method with x = rf (x) given a starting point x 2 dom f. repeat 1. x := rf (x). 2. Line search. Choose step size t via exact or backtracking line search. 3. Update. x := x + t x. until stopping criterion is satisfied. I stopping criterion usually of the form krf (x)k 2 apple I convergence result: for strongly convex f, f (x (k) ) p? apple c k (f (x (0) ) p? ) c 2 (0, 1) depends on m, x (0),linesearchtype I will terminate in at most log((f (x 0 ) p? )/ ) iterations log(1/c) I very simple, but often very slow; rarely used in practice IOE 611: Nonlinear Programming, Fall 2017 7. Unconstrained minimization Page 7 8

Quadratic problem in R 2 f (x) =(1/2)(x 2 1 + x 2 2 ) ( > 0) with exact line search, starting at x (0) =(, 1): x (k) 1 = 1 k ( ), x (k) ( 1) 2 = +1 +1 k I very slow if 1or 1 I example for = 10: 4 x2 0 4 x 1 x (0) x (1) 10 0 10 IOE 611: Nonlinear Programming, Fall 2017 7. Unconstrained minimization Page 7 9 Nonquadratic example f (x 1, x 2 )=e x 1+3x 2 0.1 + e x 1 3x 2 0.1 + e x 1 0.1 x (0) x (0) ts x (2) PSfrag replacements x (1) x (1) backtracking line search exact line search IOE 611: Nonlinear Programming, Fall 2017 7. Unconstrained minimization Page 7 10

A problem in R 100 f (x) =c T x X500 log(b i ai T x) i=1 10 4 f(x (k) ) p 10 2 10 0 10 2 exact l.s. backtracking l.s. 10 4 0 50 100 150 200 k linear convergence, i.e., a straight line on a semilog plot IOE 611: Nonlinear Programming, Fall 2017 7. Unconstrained minimization Page 7 11 Complexity analysis of gradient descent Assumptions and observations I Assumptions: f 2 C 2 (D); S = {x : f (x) apple f (x 0 )} is closed; 9m > 0:r 2 f (x) mi 8x 2 S I Observations: I 8x, y 2 S, f (y) f (x)+rf(x) T (y x)+ m 2 kx yk2 2 I S is bounded I p? > 1, andforx 2 S, f (x) p? apple 1 2m krf (x)k2 2 I 9M m: r 2 f (x) MI 8x 2 S (since S is bdd). This implies: I 8x, y 2 S, f (y) apple f (x)+rf (x) T (y x)+ M 2 kx yk2 2 I Minimizing each side over y gives: 8x 2 S, p? apple f (x) 1 2M krf (x)k2 2 I Geometric implication: If S is a sub-level set of f with apple f (x 0 ), then cond(s ) apple M m, where cond( ) is the square of the ratio of maximum to minimum widths of a convex set IOE 611: Nonlinear Programming, Fall 2017 7. Unconstrained minimization Page 7 12

Complexity analysis of gradient descent Algorithm analysis Let x(t) =x trf (x). Then: I f (x(t)) apple f (x) tkrf (x)k 2 2 + Mt2 2 krf (x)k2 2 for any t I If exact line search is used: I f (x + ) apple min{f (x) tkrf (x)k 2 2 + Mt2 t 2 krf (x)k2 2} = 1 f (x) 2M krf 1 (x)k2 2 apple f (x) 2M 2m(f (x) p? ) I We conclude: f (x + ) p? apple (f (x) p? )(1 m/m) Linear convergence with c = 1 I If backtracking (BT) line search is used: I BT exit condition is guaranteed for t 2 [0, 1/M] (for < 1/2) I BT terminates either with t =1ort /M I Using these bounds on t and analysis similar to the above, get linear convergence with m M c =1 min{2m, 2 m/m} < 1. IOE 611: Nonlinear Programming, Fall 2017 7. Unconstrained minimization Page 7 13 Steepest descent method Normalized steepest descent direction (at x, for general norm k k): x nsd = argmin{rf (x) T v kvk =1} interpretation: for small v, f (x + v) f (x)+rf(x) T v; direction x nsd is unit-norm step with most negative directional derivative (Unnormalized) steepest descent direction x sd = krf (x)k x nsd satisfies rf (x) T x sd = krf (x)k 2 Steepest descent method I general descent method with x = x sd I convergence properties similar to gradient descent IOE 611: Nonlinear Programming, Fall 2017 7. Unconstrained minimization Page 7 14

Examples I Euclidean norm: x sd = rf (x) I quadratic norm kxk P =(x T Px) 1/2 (P 2 S n ++): x sd = P 1 rf (x) I `1-norm: x sd = (@f (x)/@x i )e i,wherei is chosen so that @f (x)/@x i = krf (x)k 1 unit balls and normalized steepest descent directions for a quadratic norm and the `1-norm: f(x) x nsd x nsd f(x) s PSfrag replacements IOE 611: Nonlinear Programming, Fall 2017 7. Unconstrained minimization Page 7 15 Choice of norm for steepest descent ements x (0) x (1) x (2) PSfrag replacements x (0) x (2) x (1) I steepest descent with backtracking line search for two quadratic norms I ellipses show {x kx x (k) k P =1} I equivalent interpretation of steepest descent with quadratic norm k k P : gradient descent after change of variables x = P 1/2 x shows choice of P has strong e ect on speed of convergence IOE 611: Nonlinear Programming, Fall 2017 7. Unconstrained minimization Page 7 16

Newton step x nt = r 2 f (x) 1 rf (x) interpretations I x + x nt minimizes second order approximation bf (x + v) =f (x)+rf(x) T v + 1 2 v T r 2 f (x)v I x + x nt solves linearized optimality condition rf (x + v) r b f (x + v) =rf (x)+r 2 f (x)v =0 PSfrag replacements Sfrag replacements bf (x, f(x)) (x + x nt, f(x + x nt )) f 0 bf f (x + x nt, f (x + x nt )) (x, f (x)) IOE 611: Nonlinear Programming, Fall 2017 7. Unconstrained minimization Page 7 17 Newton step steepest descent interpretation I x nt is steepest descent direction at x in local Hessian norm 1/2 kuk r 2 ( f (x) = u T r 2 ) f (x)u x s x + x nsd x + x nt dashed lines are contour lines of f ; ellipse is {x + v v T r 2 f (x)v =1} arrow shows rf (x) I Newton step is a ne invariant, i.e., independent of a ne change of coordinates IOE 611: Nonlinear Programming, Fall 2017 7. Unconstrained minimization Page 7 18

Newton decrement (x) = 1/2 rf (x) T r 2 f (x) 1 rf (x) a measure of the proximity of x to x? Properties I gives an estimate of f (x) p?, using quadratic approximation bf : f (x) inf bf (y) =f (x) f b (x + xnt )= 1 y 2 (x)2 I equal to the norm of the Newton step in the quadratic Hessian norm 1/2 (x) = xntr T 2 f (x) x nt I directional derivative in the Newton direction: rf (x) T x nt = (x) 2 I a ne invariant (unlike krf (x)k 2 ) IOE 611: Nonlinear Programming, Fall 2017 7. Unconstrained minimization Page 7 19 Newton s method given a starting point x 2 dom f, tolerance > 0. repeat 1. Compute the Newton step and decrement. x nt := r 2 f (x) 1 rf (x); 2 := rf (x) T r 2 f (x) 1 rf (x). 2. Stopping criterion. quit if 2 /2 apple. 3. Line search. Choose step size t by backtracking line search. 4. Update. x := x + t x nt. A ne invariant, i.e., independent of linear changes of coordinates: Newton iterates for f (y) =f (By) with starting point y (0) = B 1 x (0) are y (k) = B 1 x (k) IOE 611: Nonlinear Programming, Fall 2017 7. Unconstrained minimization Page 7 20

Classical convergence analysis assumptions I f strongly convex on S with constant m I r 2 f is Lipschitz continuous on S, with constant L > 0: kr 2 f (x) r 2 f (y)k 2 apple Lkx yk 2 (L measures how well f can be approximated by a quadratic function) outline: there exist constants 2 (0, m 2 /L), I if krf (x (k) )k 2,thenf (x (k+1) ) f (x (k) ) apple I if krf (x (k) )k 2 <, thent (k) = 1, and > 0 such that L L 2 2m 2 krf (x (k+1) )k 2 apple 2m 2 krf (x (k) )k 2 IOE 611: Nonlinear Programming, Fall 2017 7. Unconstrained minimization Page 7 21 Two phases of Newton s method: apple 3(1 2 ) m2 L, = m 2 M 2 Damped Newton phase (krf (x)k 2 ) I most iterations require backtracking steps I function value decreases by at least I if p? > 1, this phase ends after at most (f (x (0) ) p? )/ iterations Quadratically convergent phase (krf (x)k 2 < ) I all iterations use step size t =1 I krf (x)k 2 converges to zero quadratically: if krf (x (k) )k 2 <, then L L 2 l 2m 2 krf (x l )k 2 apple 2m 2 krf (x k )k 2 k apple 1 2 l 2 k, l k IOE 611: Nonlinear Programming, Fall 2017 7. Unconstrained minimization Page 7 22

Conclusion: number of iterations until f (x) p? apple is bounded above by f (x (0) ) p? + log 2 log 2 ( 0 / ) I, 0 are constants that depend on m, L, x (0) I second term is small (of the order of 6) and almost constant for practical purposes I in practice, constants m, L (hence, 0 )areusuallyunknown I provides qualitative insight into convergence properties (i.e., explains two algorithm phases) IOE 611: Nonlinear Programming, Fall 2017 7. Unconstrained minimization Page 7 23 Examples example in R 2 10 5 ts x (0) x (1) f(x (k) ) p 10 0 10 5 10 10 10 15 0 1 2 3 4 5 k I backtracking parameters =0.1, =0.7 I converges in only 5 steps I quadratic local convergence IOE 611: Nonlinear Programming, Fall 2017 7. Unconstrained minimization Page 7 24

Example in R 100 10 5 2 f(x (k) ) p 10 0 10 5 10 10 exact line search backtracking step size t (k) 1.5 1 0.5 exact line search backtracking 10 15 0 2 4 6 8 10 k 0 0 2 4 6 8 k I backtracking parameters =0.01, =0.5 I backtracking line search almost as fast as exact l.s. (and much simpler) I clearly shows two phases in algorithm IOE 611: Nonlinear Programming, Fall 2017 7. Unconstrained minimization Page 7 25 Example in R 10000 (with sparse a i ) f (x) = X log(1 xi 2 ) 10000 i=1 cements 10 5 X log(b i ai T x) 100000 i=1 f(x (k) ) p 10 0 10 5 0 5 10 15 20 k I backtracking parameters =0.01, =0.5. I performance similar to small examples IOE 611: Nonlinear Programming, Fall 2017 7. Unconstrained minimization Page 7 26

Self-concordance shortcomings of classical convergence analysis I depends on unknown constants (m, L,...) I bound is not a nely invariant, although Newton s method is convergence analysis via self-concordance (Nesterov and Nemirovski) I does not depend on any unknown constants I gives a ne-invariant bound I applies to special class of convex functions ( self-concordant functions) I developed to analyze polynomial-time interior-point methods for convex optimization IOE 611: Nonlinear Programming, Fall 2017 7. Unconstrained minimization Page 7 27 Self-concordant functions definition I Convex f : R! R is self-concordant if f 000 (x) apple 2f 00 (x) 3/2 for all x 2 dom f I Convex f : R n! R is self-concordant if g(t) =f (x + tv) is self-concordant for all x 2 dom f, v 2 R n examples on R I linear and quadratic functions I negative logarithm f (x) = log x I negative entropy plus negative logarithm: f (x) =x log x log x a ne invariance: if f : R! R is s.c., then f (y) =f (ay + b) is s.c.: f 000 (y) =a 3 f 000 (ay + b), f 00 (y) =a 2 f 00 (ay + b) be careful: if f is s.c., f may not be IOE 611: Nonlinear Programming, Fall 2017 7. Unconstrained minimization Page 7 28

Self-concordant calculus properties I preserved under positive scaling I preserved under composition with a 1, and sum ne function I if g is convex with dom g = R ++ and g 000 (x) apple 3g 00 (x)/x then f (x) = log( g(x)) log x is self-concordant examples: properties can be used to show that the following are s.c. I f (x) = P m i=1 log(b i ai T x) on {x ai T x < b i, i =1,...,m} I f (X )= log det X on S n ++ I f (x) = log(y 2 x T x) on {(x, y) kxk 2 < y} IOE 611: Nonlinear Programming, Fall 2017 7. Unconstrained minimization Page 7 29 Convergence analysis for self-concordant functions summary: there exist constants 2 (0, 1/4], > 0 such that I if (x) >, then f (x (k+1) ) f (x (k) ) apple I if (x) apple, then 2 (x (k+1) ) apple 2 2 (x (k) ) ( and only depend on backtracking parameters, ) complexity bound: number of Newton iterations bounded by f (x (0) ) p? + log 2 log 2 (1/ ) for =0.1, =0.8, = 10 10, bound evaluates to 375(f (x (0) ) p? )+6 IOE 611: Nonlinear Programming, Fall 2017 7. Unconstrained minimization Page 7 30

Bounding second derivatives of a SC function I Let : R! R be a SCSC function I SC condition can be restated as 1 apple d 00 ( ) 1/2 apple 1 8 2 dom (2) d I Assume [0, t] 2 dom get t apple Z t 0. Integrating (2) between 0 and t, we d d i.e., t apple 00 (t) 1/2 00 (0) 1/2 apple t I Rearranging: 00 (0) (1 + t 00 (0) 1/2 ) 2 apple 00 (t) apple 00 ( ) 1/2 d apple t, (3) 00 (0) (1 t 00 (0) 1/2 ) 2 (4) I the lower bound is valid for any 0 apple t 2 dom,andtheupper bound is valid for 0 apple t < 00 (0) 1/2 IOE 611: Nonlinear Programming, Fall 2017 7. Unconstrained minimization Page 7 31 Preliminaries for analysis of the Newton s algorithm I Newton step: x nt = r 2 f (x) 1 rf (x) I Newton decrement: (x) =krf (x)k r 2 f (x) 1 I Can show: (x) =sup v6=0 v T rf (x) kvk r2 f (x) I Let v T rf (x) < 0 and define v (t) =f (x + tv) I 0 v (0) = rf (x) T v and 00 v (0) = v T r 2 f (x)v (5) I According to (5), (x) 0 v (0) 00 v (0) 1/2 (with equality when v = x nt ) I If v = x nt, the exit condition for BT line search with parameters < 1/2 and is (t) apple (0) + t 0 (0) = (0) t (x) 2 (6) I (we will drop the subscript of when v is the Newton direction) IOE 611: Nonlinear Programming, Fall 2017 7. Unconstrained minimization Page 7 32

Termination criterion and bound on suboptimality I Integrate the lower bound in (4) as applied to v ( ) twice: v (t) v (0) + t 0 v (0) + t 00 v (0) 1/2 log(1 + t 00 v (0) 1/2 ) (7) I The right-hand side of (7) can be minimized analytically, so inf t 0 v (t) v (0) 0 v (0) 00 v (0) 1/2 +log(1+ v (0) 0 00 v (0) 1/2 ) I u + log(1 u) isdecreasinginu, so bound (8) further: (8) inf t 0 v (t) v (0) + (x) + log(1 (x)) (9) I If v is the descent direction that leads from x to x?, I Conclusion: p? v (0)+ (x)+log(1 (x)) f (x) (x) 2 if (x) apple 0.68 f (x) p? apple (x) 2 for (x) apple 0.68, (10) so termination criterion (x) 2 apple results in a guaranteed sub-optimality bound when applied to a SCSC function IOE 611: Nonlinear Programming, Fall 2017 7. Unconstrained minimization Page 7 33 One iteration of Newton s method on SC functions I From now on,v = x nt, and drop the subscript on (t) I We would like to show that there exist (problem-independent) parameters > 0and 2 (0, 1/4], such that I If (x) >, thenf (x + ) f (x) apple,and I If (x) apple, thent =1and2 (x + ) apple (2 (x)) 2 I Useful derivation step: Integrating the upper bound of (4) twice, we get: (t) apple (0) t (x) 2 t (x) log(1 t (x)) for t 2 [0, 1/ (x)) (11) IOE 611: Nonlinear Programming, Fall 2017 7. Unconstrained minimization Page 7 34

One iteration of Newton s method on SC functions Analysis of the Damped Newton phase I Claim: ˆt = 1 1+ (x) satisfies the BT exit condition. Using (11): (ˆt) apple (0) (x) + log(1 + (x)) apple (0) (x) 2 /(2(1 + (x))) (true since (x) 0) apple (0) (x) 2 /(1 + (x)) (true since < 1/2) = (0) (x) 2ˆt I Hence, after BT, t 1+ (x) I Combining t so, set 1+ (x) with (6), we get: (t) (0) apple = 2 1+ (x) 2 1+ (x), (12) (13) IOE 611: Nonlinear Programming, Fall 2017 7. Unconstrained minimization Page 7 35 One iteration of Newton s method on SC functions Analysis of the quadratically convergence phase I The following values only depend on BT settings: =(1 2 )/4 < 1/4 and = 2 1+ (14) I Returning to (11): I It implies that t =12 dom if (x) < 1 I Since in this phase (x) apple < (1 2 )/2, we have (1) apple (0) (x) 2 (x) log(1 (x)) apple (0) 1 2 (x)2 + (x) 3 (true since 0 apple (x) apple 0.81) apple (0) (x) 2, so t = 1 satisfies the BT exit condition. I Can show: (x + (x) ) apple 2 for x + = x + x (1 (x)) 2 nt and SCSC f I In particular, if (x) apple 1/4, we have (x + ) apple 2 (x) 2 IOE 611: Nonlinear Programming, Fall 2017 7. Unconstrained minimization Page 7 36

One iteration of Newton s method on SC functions Final complexity bound I Termination criterion: (x) 2 apple I The algorithm will perform as follows: I It will spend at most f (x (0) ) p? iterations in the damped Newton s phase. I Upon entering the quadratically convergent phase in iteration k, the Newton decrement at iteration l k will satisfy 2 (x (l) ) apple (2 (x (k) )) 2l k apple (2 ) 2l k apple 2 l 1 2 k I Thus, the termination criterion will be satisfied for any l such that 1 2 l k+1 apple, 2 from which we conclude that at most log 2 log 2 (1/ ) iterations will be needed in this second phase IOE 611: Nonlinear Programming, Fall 2017 7. Unconstrained minimization Page 7 37 Numerical example: 150 randomly generated instances of minimize f (x) = P m i=1 log(b i ai T x) 25 20 : m = 100, n = 50 : m = 1000, n = 500 : m = 1000, n = 50 iterations 15 10 5 0 0 5 10 15 20 25 30 35 f(x (0) ) p I number of iterations much smaller than 375(f (x (0) ) p? )+6 I bound of the form c(f (x (0) ) p? )+6withsmallerc (empirically) valid IOE 611: Nonlinear Programming, Fall 2017 7. Unconstrained minimization Page 7 38

Implementation main e ort in each iteration: evaluate derivatives and solve Newton system H x = g where H = r 2 f (x), g = rf (x) via Cholesky factorization H = LL T, x nt = L T L 1 g, (x) =kl 1 gk 2 I cost (1/3)n 3 flops for unstructured system I cost (1/3)n 3 if H sparse, banded IOE 611: Nonlinear Programming, Fall 2017 7. Unconstrained minimization Page 7 39 Example of dense Newton system with structure f (x) = nx i=1 i(x i )+ 0 (Ax + b), H = D + A T H 0 A I assume A 2 R p n,dense,withp n I D diagonal with diagonal elements 00 i (x i); H 0 = r 2 0 (Ax + b) method 1: formh, solve via dense Cholesky factorization: (cost (1/3)n 3 ) method 2: factorh 0 = L 0 L T 0 ; write Newton system as D x + A T L 0 w = g, L T 0 A x w =0 eliminate x from first equation; compute w and x from (I + L T 0 AD 1 A T L 0 )w = L T 0 AD 1 g, D x = g A T L 0 w cost: 2p 2 n (dominated by computation of L T 0 AD 1 AL 0 ) IOE 611: Nonlinear Programming, Fall 2017 7. Unconstrained minimization Page 7 40