OPER 627: Nonlinear Optimization Lecture 14: Mid-term Review

OPER 627: Nonlinear Optimization Lecture 14: Mid-term Review Department of Statistical Sciences and Operations Research Virginia Commonwealth University Oct 16, 2013 (Lecture 14) Nonlinear Optimization Oct 16, 2013 1 / 16

Exam begins now... Try to find Professor Song s technical mistakes (not including typos) in his terrible slides: If you find one that nobody else could find, you get one extra point Maximum extra points: 5 Submit the exam paper with these mistakes (I will allocate some space for you to fill out) (Lecture 14) Nonlinear Optimization Oct 16, 2013 2 / 16

An overall summary 1 Theory: optimality conditions in various cases In general, FONC, SONC, SOSC, apply for functions defined in an open set Optimality conditions with convexity 2 Algorithms: line search and trust region All we learn is Newton method Algorithms in this class only guarantees convergence to a stationary point from any initial point. (Lecture 14) Nonlinear Optimization Oct 16, 2013 3 / 16

How do we use optimality conditions? 1 Use FONC to rule out non-stationary solutions FONC is used in all algorithms that have global convergence 2 Use SONC to rule out saddle points 3 Use SOSC to validate quadratic convergence of Newton method SOSC is used to show fast convergence to a local minimizer (Lecture 14) Nonlinear Optimization Oct 16, 2013 4 / 16

Convexity 1 First-order characterization of convex functions 2 Second-order characterization of convex functions defined in an open set 3 Free lunch, free dinner, ultimate gift 4 Strongly convex: 2 f is PD, why important? (Lecture 14) Nonlinear Optimization Oct 16, 2013 5 / 16

Optimization algorithms Motivation: optimize for the next step using information from the current step f (x k + p k ) m(p k ) := f (x k ) + f (x k ) p k + 1 2 p k B kp k 1 Line search: if B k is nice, we can find a descent direction p k easily, and the minimizer along that direction is our next iterate 2 Trust region: m(p k ) only approximates f (x k + p k ) well locally, we will look for the next iterate based on our confidence level on how m(p k ) approximate f (x k + p k ), and adjust our confidence level adaptively (Lecture 14) Nonlinear Optimization Oct 16, 2013 6 / 16

Line search 1 Wolfe conditions: Sufficient descent: φ(α) φ(0) + c 1 φ (0)α Sufficient curvature: φ (α) c 2 φ (0) What are their purposes? 2 Fundamental result for line search: Just assume the search direction p k is a descent direction, and use Wolfe condition for line search k=0 cos2 θ k f (x k ) 2 <, where cos θ k = f (x k ) p k f (x k ) p k How to use this result to prove global convergence for steepest descent? Newton? (Lecture 14) Nonlinear Optimization Oct 16, 2013 7 / 16

Line search is all about Newton Model function min m(p k ) := f (x k ) + f (x k ) p k + 1 2 p k B kp k Approximated Hessian B k : This is an unconstrained QP, any stationary point is optimal if and only if B k is PD 1 Choice 1: B k = I, correspond to steepest descent p k = f (x k ) Only first-order information is used Linear local convergence Convergence could be very slow if the condition number of Hessian is large 2 Choice 2: B k = 2 f (x k ), correspond to pure Newton p k = [ 2 f (x k )] 1 f (x k ) No line search is needed, stepsize is always α k = 1 Quadratic local convergence to x if x satisfies SOSC Fragile: may run into trouble if Hessian is not PD (Lecture 14) Nonlinear Optimization Oct 16, 2013 8 / 16

Line search is all about Newton (contd) Model function min m(p k ) := f (x k ) + f (x k ) p k + 1 2 p k B kp k 1 Choice 3: Modified Newton, B k = 2 f (x k ) + E k If 2 f (x k ) is PD, E k = 0 Otherwise, E k is big enough to ensure B k is PD Loses quadratic convergence, because we need line search 2 Choice 4: Quasi-Newton, construct/update a PD matrix B k as we go Updating formula ensures the secant equation: B k+1 s k = y k, so using B k to approximate Hessian makes sense B k is PD by Wolfe/curvature condition and updating formula BFGS, approximates inverse Hessian H k Superlinear local convergence, but no global convergence in general (Lecture 14) Nonlinear Optimization Oct 16, 2013 9 / 16

Trust region All about solving the trust region subproblem (TRP) Solving TRP: min p m(p) := f (x k ) + f (x k ) p + 1 2 p B k p s.t. p 1 Direct method: needs matrix factorization, iterative root-finding procedure 2 Cauchy points 3 Improved Cauchy points, dogleg methods (Lecture 14) Nonlinear Optimization Oct 16, 2013 10 / 16

Cauchy points and dogleg Cauchy point: the best solution along the steepest descent direction within the trust region A constrained step size problem min f k + τ k g τ k ps k + 1 k 2 τ k 2 (ps k ) B k pk s s.t. τ k 1 where p s k = k g k g k Improvement: Dogleg (only when B k is PD) Use a two line segments to approximate the full trajectory from the global minimizer p B = B 1 g to the minimizer along steepest descent direction p U Optimization over the two line segments is easy because of monotone structure (Lecture 14) Nonlinear Optimization Oct 16, 2013 11 / 16

Line search vs. Trust region 1 Line search first finds a direction, then chooses the step length 2 Line search methods have an easy problem to solve in each iteration 3 Line search methods may not allow true Hessian in the model function 1 Trust region first finds a length, then chooses the direction 2 Trust region methods have a hard TRP in each iteration 3 Trust region allows true Hessian in the model function (Lecture 14) Nonlinear Optimization Oct 16, 2013 12 / 16

Important concepts Condition number 1 Condition number and convergence Steepest descent Newton/quasi-Newton Conjugate gradient (preconditioned CG) 2 Condition number and numerical stability Least squares: solving normal equations J Jx = J y Wolfe condition 1 Global convergence for inexact line search 2 Guarantee for PD quasi-newton Hessian matrices (Lecture 14) Nonlinear Optimization Oct 16, 2013 13 / 16

Optimization for large-scale problems When problems get bigger, we have to compromise 1 Quasi-Newton: hard to compute Hessian when large-scale, use first-order information to mimic the behavior of Hessian 2 L-BFGS: use a limited-memory list of vectors s k, y k to approximate quasi-newton matrix 3 Inexact Newton: solving Newton direction via conjugate gradient CG is hunkydory CG enables inexact solutions, which is sufficient for superlinear convergence if error bound η k 0 (Lecture 14) Nonlinear Optimization Oct 16, 2013 14 / 16

Choice of algorithms for unconstrained nonlinear optimization If first-order information is not available, you need to take another course! 1 Second-order information is not available Steepest descent (if you are lazy) Quasi-Newton: n 100 Large-scale: L-BFGS, nonlinear CG, inexact Quasi-Newton, etc 2 Second-order information is available Newton, if you know your problem is strongly convex, or you know you are very close to optimal Trust-region methods Large-scale: Newton-CG, CG-trust In practice, choose an implementation/software that matches the best with your application! (Lecture 14) Nonlinear Optimization Oct 16, 2013 15 / 16

Good luck! (Lecture 14) Nonlinear Optimization Oct 16, 2013 16 / 16