ORIE 6326: Convex Optimization. Quasi-Newton Methods

Similar documents
10. Unconstrained minimization

Unconstrained minimization

Convex Optimization. 9. Unconstrained minimization. Prof. Ying Cui. Department of Electrical Engineering Shanghai Jiao Tong University

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

2. Quasi-Newton methods

Unconstrained minimization: assumptions

Newton s Method. Javier Peña Convex Optimization /36-725

Lecture 14: October 17

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Convex Optimization. Problem set 2. Due Monday April 26th

Lecture 14: Newton s Method

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization

Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb Shanno February 6, / 25 (BFG. Limited memory BFGS (L-BFGS)

Barrier Method. Javier Peña Convex Optimization /36-725

Unconstrained optimization

5 Quasi-Newton Methods

11. Equality constrained minimization

Chapter 4. Unconstrained optimization

Nonlinear Optimization for Optimal Control

Lecture 18: November Review on Primal-dual interior-poit methods

Improving the Convergence of Back-Propogation Learning with Second Order Methods

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725

Methods that avoid calculating the Hessian. Nonlinear Optimization; Steepest Descent, Quasi-Newton. Steepest Descent

14. Nonlinear equations

Convex Optimization and l 1 -minimization

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Higher-Order Methods

Subgradient Method. Ryan Tibshirani Convex Optimization

6. Proximal gradient method

Optimization II: Unconstrained Multivariable

Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2

Line Search Methods for Unconstrained Optimisation

Conditional Gradient (Frank-Wolfe) Method

Unconstrained minimization of smooth functions

Descent methods. min x. f(x)

GRADIENT = STEEPEST DESCENT

Quasi-Newton Methods

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

Optimization II: Unconstrained Multivariable

Fast proximal gradient methods

Analytic Center Cutting-Plane Method

Algorithms for Constrained Optimization

Lecture 17: October 27

Equality constrained minimization

An Iterative Descent Method

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

12. Interior-point methods

Selected Topics in Optimization. Some slides borrowed from

March 8, 2010 MATH 408 FINAL EXAM SAMPLE

The Conjugate Gradient Method

An Optimal Affine Invariant Smooth Minimization Algorithm.

Programming, numerics and optimization

AM 205: lecture 19. Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods

8. Conjugate functions

Maria Cameron. f(x) = 1 n

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Agenda. Interior Point Methods. 1 Barrier functions. 2 Analytic center. 3 Central path. 4 Barrier method. 5 Primal-dual path following algorithms

8 Numerical methods for unconstrained problems

Numerical solutions of nonlinear systems of equations

Stochastic Quasi-Newton Methods

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

CSCI : Optimization and Control of Networks. Review on Convex Optimization

Interior Point Methods. We ll discuss linear programming first, followed by three nonlinear problems. Algorithms for Linear Programming Problems

Convex Optimization CMU-10725

A Brief Review on Convex Optimization

More First-Order Optimization Algorithms

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725

12. Interior-point methods

Convex Optimization Algorithms for Machine Learning in 10 Slides

5. Subgradient method

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725

Truncated Newton Method

E5295/5B5749 Convex optimization with engineering applications. Lecture 8. Smooth convex unconstrained and equality-constrained minimization

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

Nonlinear Programming

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30

CSCI 1951-G Optimization Methods in Finance Part 09: Interior Point Methods

1. Search Directions In this chapter we again focus on the unconstrained optimization problem. lim sup ν

Accelerated Block-Coordinate Relaxation for Regularized Optimization

Reduced-Hessian Methods for Constrained Optimization

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization

EE 546, Univ of Washington, Spring Proximal mapping. introduction. review of conjugate functions. proximal mapping. Proximal mapping 6 1

Stochastic Optimization Algorithms Beyond SG

University of Houston, Department of Mathematics Numerical Analysis, Fall 2005

Optimization and Root Finding. Kurt Hornik

6. Proximal gradient method

ECS550NFB Introduction to Numerical Methods using Matlab Day 2

Introduction to gradient descent

1 Numerical optimization

CPSC 540: Machine Learning

4TE3/6TE3. Algorithms for. Continuous Optimization

Proximal and First-Order Methods for Convex Optimization

Data Mining (Mineria de Dades)

Optimization methods

EECS260 Optimization Lecture notes

Transcription:

ORIE 6326: Convex Optimization Quasi-Newton Methods Professor Udell Operations Research and Information Engineering Cornell April 10, 2017 Slides on steepest descent and analysis of Newton s method adapted from Stanford EE364a; slides on BFGS adapted from UCLA EE236C 1/38

Outline Steepest descent Preconditioning Newton method Quasi-Newton methods Analysis 2/38

Steepest descent method normalized steepest descent direction (at x, for norm ): x nsd = argmin{ f(x) T v v = 1} interpretation: for small v, f(x +v) f(x)+ f(x) T v x nsd is unit-norm step with most negative directional derivative (unnormalized) steepest descent direction x sd = f(x) x nsd satisfies f(x) T x sd = f(x) 2 steepest descent method general descent method with x = x sd convergence properties similar to gradient descent 3/38

examples Euclidean norm: x sd = f(x) quadratic norm x B = (x T Bx) 1/2 (B S n ++): x sd = B 1 f(x) l 1 -norm: x sd = ( f(x)/ x i )e i, where f(x)/ x i = f(x) unit balls and normalized steepest descent directions for a quadratic norm and the l 1 -norm: f(x) x nsd x nsd f(x) 4/38

choice of norm for steepest descent x (0) x (0) x (2) x (1) x (2) x (1) steepest descent with backtracking line search for two quadratic norms ellipses show {x x x (k) B = 1} equivalent interpretation of steepest descent with quadratic norm B : gradient descent after change of variables x = B 1/2 x shows choice of B has strong effect on speed of convergence 5/38

Outline Steepest descent Preconditioning Newton method Quasi-Newton methods Analysis 6/38

Recap: convergence analysis for gradient descent minimize f(x) recall: we say (twice-differentiable) f is α-strongly convex and β-smooth if αi 2 f(x) βi recall: if f is α-strongly convex and β-smooth, gradient descent converges linearly f(x (k) ) p βck 2 x(0) x 2, where c = ( κ 1 κ+1 )2, κ = β α 1 is condition number = want κ 1 7/38

Recap: convergence analysis for gradient descent minimize f(x) recall: we say (twice-differentiable) f is α-strongly convex and β-smooth if αi 2 f(x) βi recall: if f is α-strongly convex and β-smooth, gradient descent converges linearly f(x (k) ) p βck 2 x(0) x 2, where c = ( κ 1 κ+1 )2, κ = β α 1 is condition number = want κ 1 idea: can we minimize another function with κ 1 whose solution will tell us the minimizer of f? 7/38

for D 0, the two problems Preconditioning minimize f(x) and minimize f(dz) have solutions related by x = Dz gradient of f(dz) is D T f(dz) the second derivative (Hessian) of f(dz) is D T 2 f(dz)d a gradient step on f(dz) with step-size t > 0 is z + Dz + x + = z td T f(dz) = Dz tdd T f(dz) = x tdd T f(x) from prev analysis, we know gradient descent on z converges fastest if D T 2 f(dz)d I D ( 2 f(dz)) 1/2 8/38

Approximate inverse Hessian B = DD T is called the approximate inverse Hessian can fix B or update it at every iteration: if B is constant: called preconditioned method (e.g., preconditioned conjugate gradient) if B is updated: called (quasi)-newton method how to choose B? want B 2 f(x) 1 easy to compute (and update) B fast to multiply by B 9/38

Outline Steepest descent Preconditioning Newton method Quasi-Newton methods Analysis 10/38

Newton step x nt = 2 f(x) 1 f(x) interpretations x + x nt minimizes second order approximation f(x +v) = f(x)+ f(x) T v + 1 2 vt 2 f(x)v x + x nt solves linearized optimality condition f(x +v) f(x +v) = f(x)+ 2 f(x)v = 0 f (x,f(x)) (x + x nt,f(x + x nt)) f f f (x + x nt,f (x + x nt)) (x,f (x)) 11/38

x nt is steepest descent direction at x in local Hessian norm u 2 f(x) = ( u T 2 f(x)u ) 1/2 x x + x nsd x + x nt dashed lines are contour lines of f; ellipse is {x +v v T 2 f(x)v = 1} arrow shows f(x) 12/38

Newton decrement λ(x) = ( f(x) T 2 f(x) 1 f(x) ) 1/2 a measure of the proximity of x to x properties gives an estimate of f(x) p, using quadratic approximation f: f(x) inf y f(y) = 1 2 λ(x)2 equal to the norm of the Newton step in the quadratic Hessian norm λ(x) = ( x T nt 2 f(x) x nt ) 1/2 directional derivative in the Newton direction: f(x) T x nt = λ(x) 2 affine invariant (unlike f(x) 2 ) 13/38

Newton s method Algorithm 1 Newton s method. given a starting point x domf, tolerance ǫ > 0. repeat 1. Compute the Newton step and decrement. x nt := 2 f(x) 1 f(x); λ 2 := f(x) T 2 f(x) 1 f(x). 2. Stopping criterion. quit if λ 2 /2 ǫ. 3. Line search. Choose step size t by backtracking line search. 4. Update. x := x +t x nt. affine invariant, i.e., independent of linear changes of coordinates: Newton iterates for f(y) = f(ty) starting at y (0) = T 1 x (0) are y (k) = T 1 x (k) 14/38

Outline Steepest descent Preconditioning Newton method Quasi-Newton methods Analysis 15/38

Quasi-Newton method Algorithm 2 Quasi-Newton method. given a starting point x domf, H 0, tolerance ǫ > 0. repeat 1. Compute the step and decrement. x qn := H 1 f(x); λ 2 := f(x) T H 1 f(x). 2. Stopping criterion. quit if λ 2 /2 ǫ. 3. Line search. Choose step size t by backtracking line search. 4. Update x. x := x +t x. 5. Update H. Depends on specific method. Quasi-Newton methods defined by choice of H update can store and update B = H 1 instead 16/38

Broyden-Fletcher-Goldfarb-Shanno (BFGS) update BFGS update H + = H + yyt y T s HssT H s T Hs where s = x + x, y = f(x + ) f(x) Inverse update B + = ) (I syt y T B (I )+ yst sst s y T s y T s note y T s > 0 if f is strongly convex (monotonicity of gradient) update to H is rank 2 cost of update or inverse update is O(n 2 ) 17/38

BFGS update preserves positive definiteness if y T s > 0, then BFGS update preserves positive definiteness proof: from inverse update, for any v R n ( ) v T B + v = v T (I syt yst sst y T )B(I s y T )+ s y T v s = (v st v y T s y)t B(v st v y T s y)+ (vt s) 2 y T s first term is nonnegative because B 0 second term is nonnegative because y T s > 0 18/38

BFGS update preserves positive definiteness if y T s > 0, then BFGS update preserves positive definiteness proof: from inverse update, for any v R n ( ) v T B + v = v T (I syt yst sst y T )B(I s y T )+ s y T v s = (v st v y T s y)t B(v st v y T s y)+ (vt s) 2 y T s first term is nonnegative because B 0 second term is nonnegative because y T s > 0 can show x qn = H 1 f(x) > 0, so x qn is a descent direction: second term is 0 iff v T s = 0 first term is 0 if in addition v = 0 taking v = f(x), s = x x, we see x qn = H 1 f(x) > 0 unless f(x) = 0, i.e., x is optimal 18/38

Secant condition the BFGS update satisfies the secant condition H + s = y, i.e., H + (x + x) = f(x + ) f(x) Interpretation: define the second-order approximat at x + ˆf(z) = f(x + )+ f(x + ) T (z x + )+ 1 2 (z x+ )H(z x + ) secant condition ensures gradient of ˆf agrees with f at x: ˆf(x) = f(x + )+H(x x + ) = f(x) 19/38

Secant method for f : R R, BFGS with unit step size gives the secant method x + = x f (x) H, H = f (x) f (x ) x x 20/38

Limited memory quasi-newton methods main disadvantage of quasi-newton method: need to store H or B Limited-memory BFGS (L-BFGS): don t store B explicitly! instead, store the m (say, m = 30) most recent values of s j = x (j) x (j 1), y j = f(x (j) ) f(x (j 1) ) evaluate δx = B k f(x (k) ) recursively, using ( B j = I s ( jyj )B T yj T j 1 I y jsj T s j yj T s j assuming B k m = I ) + s js T J y T j s j 21/38

Limited memory quasi-newton methods main disadvantage of quasi-newton method: need to store H or B Limited-memory BFGS (L-BFGS): don t store B explicitly! instead, store the m (say, m = 30) most recent values of s j = x (j) x (j 1), y j = f(x (j) ) f(x (j 1) ) evaluate δx = B k f(x (k) ) recursively, using ( B j = I s ( jyj )B T yj T j 1 I y jsj T s j yj T s j assuming B k m = I ) + s js T J y T j s j advantage: for each update, just apply rank 1 + diagonal matrix to vector! cost per update is O(n); cost per iteration is O(mn) storage is O(mn) 21/38

L-BFGS: interpretations only remember curvature of Hessian on active subspace S k = span{s k,...,s k m } hope: locally, f(x (k) ) will approximately lie in active subspace f(x (k) ) = g S +g S, g S S k, g S small L-BFGS assumes B k I on S, so B k g S g S ; if g S is small, it shouldn t matter much. 22/38

Outline Steepest descent Preconditioning Newton method Quasi-Newton methods Analysis 23/38

Convergence of Quasi-Newton methods Global convergence. if f is strongly convex, Newton method or BFGS with backtracking line search converge to solution x for any initial x and H 0. Local convergence of BFGS. if f is strongly convex and 2 f(x) is Lipshitz continuous, local convergence is superlinear: for k sufficiently large, where c k 0 x k+1 x 2 c k x (k) x 2 0 Local convergence of Newton method. if f is strongly convex and 2 f(x) is Lipshitz continuous, local convergence is quadratic: for k sufficiently large, x k+1 x 2 c k x (k) x 2 2 0 24/38

Classical convergence analysis of Newton method assumptions f strongly convex on S with constant α 2 f is Lipschitz continuous on S, with constant γ > 0: 2 f(x) 2 f(y) 2 γ x y 2 (γ measures how well f can be approximated by a quadratic function) outline: there exist constants η (0,α 2 /γ), µ > 0 such that if f(x) 2 η, then f(x (k+1) ) f(x (k) ) µ if f(x) 2 < η, then γ ( γ ) 2 2m 2 f(x(k+1) ) 2 2m 2 f(x(k) ) 2 25/38

damped Newton phase ( f(x) 2 η) most iterations require backtracking steps function value decreases by at least γ if p >, this phase ends after at most (f(x (0) ) p )/γ iterations quadratically convergent phase ( f(x) 2 < η) all iterations use step size t = 1 f(x) 2 converges to zero quadratically: if f(x (k) ) 2 < η ( ) 2 l k β β 2α 2 f(xl ) 2 2α 2 f(xk ) 2 ( ) 2 l k 1, l k 2 26/38

conclusion: number of iterations until f(x) p ǫ is bounded above by f(x (0) ) p +log γ 2 log 2 (ǫ 0 /ǫ) µ, ǫ 0 are constants that depend on α, γ, x (0) second term is small (of the order of 6) and almost constant for practical purposes in practice, constants α, γ (hence µ, ǫ 0 ) are usually unknown provides qualitative insight in convergence properties (i.e., explains two algorithm phases) 27/38

Examples example in R 2 10 5 x (0) x (1) f(x (k) ) p 10 0 10 5 10 10 10 15 0 1 2 3 4 5 k backtracking parameters α = 0.1, β = 0.7 converges in only 5 steps quadratic local convergence 28/38

example in R 100 10 5 2 10 0 10 5 f fst 10 10 exact line search step size t (k) backtracking 1.5 1 0.5 exact line search backtracking 10 15 0 2 4 6 8 10 k 0 0 2 4 6 8 k backtracking parameters α = 0.01, β = 0.5 backtracking line search almost as fast as exact l.s. (and much simpler) clearly shows two phases in algorithm 29/38

example in R 10000 (with sparse a i ) 10000 100000 f(x) = log(1 xi 2 ) log(b i ai T x) 10 5 i=1 i=1 10 0 f fst 10 5 0 5 10 15 20 k backtracking parameters α = 0.01, β = 0.5. performance similar as for small examples 30/38

Self-concordance shortcomings of classical convergence analysis depends on unknown constants (α, γ,...) bound is not affinely invariant, although Newton s method is convergence analysis via self-concordance (Nesterov and Nemirovski) does not depend on any unknown constants gives affine-invariant bound applies to special class of convex functions ( self-concordant functions) developed to analyze polynomial-time interior-point methods for convex optimization 31/38

Self-concordant functions definition convex f : R R is self-concordant if f (x) 2f (x) 3/2 for all x domf f : R n R is self-concordant if g(t) = f(x +tv) is self-concordant for all x domf, v R n examples on R linear and quadratic functions negative logarithm f(x) = logx negative entropy plus negative logarithm: f(x) = x logx logx affine invariance: if f : R R is strongly convex, then f(y) = f(ay +b) is strongly convex: f (y) = a 3 f (ay +b), f (y) = a 2 f (ay +b) 32/38

Self-concordant calculus properties preserved under positive scaling α 1, and sum preserved under composition with affine function if g is convex with domg = R ++ and g (x) 3g (x)/x then f(x) = log( g(x)) logx is self-concordant examples: properties can be used to show that the following are strongly convex f(x) = m i=1 log(b i a T i x) on {x a T i x < b i, i = 1,...,m} f(x) = logdetx on S n ++ f(x) = log(y 2 x T x) on {(x,y) x 2 < y} 33/38

Convergence analysis for self-concordant functions summary: there exist constants η (0,1/4], γ > 0 such that if λ(x) > η, then f(x (k+1) ) f(x (k) ) γ if λ(x) η, then 2λ(x (k+1) ) ( ) 2 2λ(x (k) ) (η and γ only depend on backtracking parameters α, β) complexity bound: number of Newton iterations bounded by f(x (0) ) p +log γ 2 log 2 (1/ǫ) for α = 0.1, β = 0.8, ǫ = 10 10, bound evaluates to 375(f(x (0) ) p )+6 34/38

numerical example: 150 randomly generated instances of minimize f(x) = m i=1 log(b i a T i x) 25 20 : m = 100, n = 50 : m = 1000, n = 500 : m = 1000, n = 50 iterations 15 10 5 0 0 5 10 15 20 25 30 35 f(x (0) ) p number of iterations much smaller than 375(f(x (0) ) p )+6 bound of the form c(f(x (0) ) p )+6 with smaller c (empirically) valid 35/38

Implementation main effort in each iteration: evaluate derivatives and solve Newton system H x = g where H = 2 f(x), g = f(x) via Cholesky factorization H = LL T, x nt = L T L 1 g, λ(x) = L 1 g 2 cost (1/3)n 3 flops for unstructured system cost (1/3)n 3 if H sparse, banded 36/38

example of dense Newton system with structure f(x) = n ψ i (x i )+ψ 0 (Ax +b), i=1 H = D +A T H 0 A assume A R p n, dense, with p n D diagonal with diagonal elements ψ i (x i ); H 0 = 2 ψ 0 (Ax +b) method 1: form H, solve via dense Cholesky factorization: (cost (1/3)n 3 ) method 2 factor H 0 = L 0 L T 0 ; write Newton system as D x +A T L 0 w = g, L T 0 A x w = 0 eliminate x from first equation; compute w and x from (I +L T 0 AD 1 A T L 0 )w = L T 0 AD 1 g, D x = g A T L 0 w cost: 2p 2 n (dominated by computation of L T 0 AD 1 A T L 0 ) 37/38

References Nocedal and Wright, Numerical Optimization Lieven Vandenberghe, UCLA EE236C Stephen Boyd, Stanford EE364a 38/38