Quasi-Newton Methods. Javier Peña Convex Optimization /36-725

Similar documents
Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization

Lecture 18: November Review on Primal-dual interior-poit methods

2. Quasi-Newton methods

5 Quasi-Newton Methods

Stochastic Optimization Algorithms Beyond SG

Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb Shanno February 6, / 25 (BFG. Limited memory BFGS (L-BFGS)

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

Quasi-Newton methods for minimization

Convex Optimization CMU-10725

Optimization II: Unconstrained Multivariable

Stochastic Quasi-Newton Methods

Optimization II: Unconstrained Multivariable

Search Directions for Unconstrained Optimization

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Second-Order Methods for Stochastic Optimization

High Order Methods for Empirical Risk Minimization

ORIE 6326: Convex Optimization. Quasi-Newton Methods

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Optimization Methods for Machine Learning

Optimization and Root Finding. Kurt Hornik

Quasi-Newton Methods

Lecture 14: October 17

Math 408A: Non-Linear Optimization

Algorithms for Constrained Optimization

Primal-Dual Interior-Point Methods. Javier Peña Convex Optimization /36-725

Higher-Order Methods

Methods that avoid calculating the Hessian. Nonlinear Optimization; Steepest Descent, Quasi-Newton. Steepest Descent

High Order Methods for Empirical Risk Minimization

Unconstrained optimization

Nonlinear Optimization: What s important?

High Order Methods for Empirical Risk Minimization

Homework 5. Convex Optimization /36-725

Incremental Quasi-Newton methods with local superlinear convergence rate

Maria Cameron. f(x) = 1 n

March 8, 2010 MATH 408 FINAL EXAM SAMPLE

Programming, numerics and optimization

MATH 4211/6211 Optimization Quasi-Newton Method

Improving the Convergence of Back-Propogation Learning with Second Order Methods

Newton s Method. Javier Peña Convex Optimization /36-725

Homework 3 Conjugate Gradient Descent, Accelerated Gradient Descent Newton, Quasi Newton and Projected Gradient Descent

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2

ECS550NFB Introduction to Numerical Methods using Matlab Day 2

ENSIEEHT-IRIT, 2, rue Camichel, Toulouse (France) LMS SAMTECH, A Siemens Business,15-16, Lower Park Row, BS1 5BN Bristol (UK)

LBFGS. John Langford, Large Scale Machine Learning Class, February 5. (post presentation version)

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

Nonlinear Optimization Methods for Machine Learning

ABSTRACT 1. INTRODUCTION

A Stochastic Quasi-Newton Method for Large-Scale Optimization

Chapter 4. Unconstrained optimization

Improved Damped Quasi-Newton Methods for Unconstrained Optimization

Quasi-Newton Methods

Nonlinear Programming

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal

Coordinate Descent and Ascent Methods

Primal-Dual Interior-Point Methods. Ryan Tibshirani Convex Optimization

Comparative study of Optimization methods for Unconstrained Multivariable Nonlinear Programming Problems

Statistics 580 Optimization Methods

Conditional Gradient (Frank-Wolfe) Method

Variable Metric Stochastic Approximation Theory

Multipoint secant and interpolation methods with nonmonotone line search for solving systems of nonlinear equations

Matrix Secant Methods

Accelerating SVRG via second-order information

Gradient-Based Optimization

HYBRID RUNGE-KUTTA AND QUASI-NEWTON METHODS FOR UNCONSTRAINED NONLINEAR OPTIMIZATION. Darin Griffin Mohr. An Abstract

Selected Topics in Optimization. Some slides borrowed from

SECTION: CONTINUOUS OPTIMISATION LECTURE 4: QUASI-NEWTON METHODS

Primal-Dual Interior-Point Methods. Ryan Tibshirani Convex Optimization /36-725

Sub-Sampled Newton Methods

Trust-Region Optimization Methods Using Limited-Memory Symmetric Rank-One Updates for Off-The-Shelf Machine Learning

Accelerated Block-Coordinate Relaxation for Regularized Optimization

Motivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning

Lecture V. Numerical Optimization

AM 205: lecture 19. Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods

Optimization for neural networks

Numerical Optimization Techniques

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Line search methods with variable sample size. Nataša Krklec Jerinkić. - PhD thesis -

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Convex Optimization Lecture 16

A COMBINED CLASS OF SELF-SCALING AND MODIFIED QUASI-NEWTON METHODS

Institute of Computer Science. Limited-memory projective variable metric methods for unconstrained minimization

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization

Chapter 2. Optimization. Gradients, convexity, and ALS

On Nesterov s Random Coordinate Descent Algorithms - Continued

Global Convergence of Perry-Shanno Memoryless Quasi-Newton-type Method. 1 Introduction

OPER 627: Nonlinear Optimization Lecture 14: Mid-term Review

E5295/5B5749 Convex optimization with engineering applications. Lecture 8. Smooth convex unconstrained and equality-constrained minimization

Cubic regularization in symmetric rank-1 quasi-newton methods

Minimum Norm Symmetric Quasi-Newton Updates Restricted to Subspaces

Lecture 7 Unconstrained nonlinear programming

1. Search Directions In this chapter we again focus on the unconstrained optimization problem. lim sup ν

Introduction to Logistic Regression and Support Vector Machine

Modern Optimization Techniques

CPSC 540: Machine Learning

Notes on Numerical Optimization

Optimization 2. CS5240 Theoretical Foundations in Multimedia. Leow Wee Kheng

A Stochastic Quasi-Newton Method for Online Convex Optimization

Transcription:

Quasi-Newton Methods Javier Peña Convex Optimization 10-725/36-725

Last time: primal-dual interior-point methods Consider the problem min x subject to f(x) Ax = b h(x) 0 Assume f, h 1,..., h m are convex and differentiable. Assume also that strong duality holds. Central path equations: f(x) + h(x)u + A T v = 0 Uh(x) + τ1 = 0 Ax b = 0 u, h(x) > 0. 2

Let Primal-dual interior-point algorithm Crux of each iteration f(x) + h(x)u + A T v r(x, u, v) := Uh(x) + τ1 Ax b, (x +, u +, v + ) := (x, u, v) + θ( x, u, v) where ( x, u, v) is the Newton step: 2 f(x) + i u i 2 h i (x) h(x) A T x U h(x) T H(x) 0 u = r(x, u, v) A 0 0 v Here U = Diag(u), H(x) = Diag(h(x)). 3

Outline Today: Motivation for quasi-newton methods Most popular updates: SR1, DFP, BFGS, Broyden class Superlinear convergence Limited memory BFGS Stochastic quasi-newton methods 4

Gradient descent and Newton revisited Consider the unconstrained, smooth optimization problem min x f(x) where f is twice differentiable, and dom(f) = R n. Gradient descent method x + = x t f(x) Newton s method x + = x t 2 f(x) 1 f(x) 5

Quasi-Newton methods Two main steps in Newton s method: Compute Hessian 2 f(x) Solve the system of equations 2 f(x)p = f(x). Each of these two steps could be expensive. Quasi-Newton method Use instead x + = x + tp where Bp = f(x) for some approximation B of 2 f(x). Want B easy to compute and Bp = g easy to solve. 6

A bit of history In the mid 1950s W. Davidon was a physicist at Argonne National Lab. He was using a coordinate descent method to solve a long optimization calculation that kept crashing the computer before finishing. Davidon figured out a way to accelerate the computation the first quasi-newton method ever. Although Davidon s contribution was a major breakthrough in optimization, his original paper was rejected. In 1991, after more than 30 years, his paper was published in the first issue of the SIAM Journal on Optimization. In addition to his remarkable work in optimization, Davidon was a peace activist. Part of his story is nicely described in The Burglary a book published shortly after he passed away in 2013. 7

Secant equation We would like B k to approximate 2 f(x k ), that is f(x k + s) f(x k ) + B k s. Once x k+1 = x k + s k is computed, we would like a new B k+1. Idea: since B k already contains some information, make some suitable update. Reasonable requirement for B k+1 or equivalently f(x k+1 ) = f(x k ) + B k+1 s k B k+1 s k = f(x k+1 ) f(x k ). 8

Secant equation The latter condition is called the secant equation and written as B k+1 s k = y k or simply B + s = y where s k = x k+1 x k and y k = f(x k+1 ) f(x k ). In addition to the secant equation, we would like (i) B + symmetric (ii) B + close to B (iii) B positive definite B + positive definite 9

10 Symmetric rank-one update (SR1) Try an update of the form B + = B + auu T Observe B + s = y (au T s)u = y Bs The latter can hold only if u is a multiple of y Bs. It follows that the only symmetric rank-one update that satisfies the secant equation is B + = B + (y Bs)(y Bs)T (y Bs) T. s

11 Sherman-Morrison-Woodbury formula A low-rank update on a matrix corresponds to a low rank update on its inverse. Theorem (Sherman-Morrison-Woodbury formula) Assume A R n n, and U, V R n d. Then A + UV T is nonsingular if and only if I + V T A 1 U is nonsingular. In that case (A + UV T ) 1 = A 1 A 1 U(I + V T A 1 U) 1 A 1 Thus for the SR1 update the inverse H of B is also easily updated: H + = H + (s Hy)(s Hy)T (s Hy) T. y SR1 is simple but has two shortcomings: it may fail if (y Bs) T s 0 and it does not preserve positive definiteness.

12 Davidon-Fletcher-Powell (DFP) update Try a rank-two update The secant equation yields H + = H + auu T + bvv T. s Hy = (au T y)u + (bv T y)v. Putting u = s, v = Hy, and solving for a, b we get H + = H HyyT H y T Hy + sst y T s By Sherman-Morrison-Woodbury we get a rank-two update on B B + (y Bs)yT y(y Bs)T = B + y T + s y T (y Bs)T s s (y T s) 2 yy T ) ) = (I yst y T B (I syt s y T + yyt s y T s

13 DFP update alternate derivation Find B + closest to B in some norm so that B + satisfies the secant equation and is symmetric: What norm to use? min B + B? B + subject to B + = (B + ) T B + s = y

14 Curvature condition Observe: B + positive definite and B + s = y imply y T s = s T B + s > 0. The inequality y T s > 0 is called the curvature condition. Fact: if y, s R n and y T s > 0 then there exists M symmetric and positive definite such that Ms = y. DFP update again Solve min B + subject to W 1 (B + B)W T F B + = (B + ) T B + s = y where W R n n is nonsingular and such that W W T s = y.

15 Broyden-Fletcher-Goldfarb-Shanno (BFGS) update Same ideas as the DFP update but with roles of B and H exchanged. Secant equation H + y = s B + s = y Closeness to H: min H + subject to W 1 (H + H)W T F H + = (H + ) T H + y = s where W R n n is nonsingular and W W T y = s.

16 BFGS update Swapping H and B and y and s in the DFP update we get and B + = B BssT B s T Bs + yyt y T s H + (s Hy)sT s(s Hy)T = H + y T + s y T (s Hy)T y s (y T s) 2 ss T ) ) = (I syt y T H (I yst s y T + sst s y T s Both DFP and BFGS preserve positive definiteness: if B is positive definite and y T s > 0 then B + is positive definite. BFGS is more popular than DFP. It also has a self-correcting property.

17 The Broyden class SR1, DFP, and BFGS are some of many possible quasi-newton updates. The Broyden class of updates is defined by By putting v := Observe B + = (1 φ)b + BFGS + φb+ DFP, for φ R. y y T s Bs s T Bs we can rewrite the above as B + = B BssT B s T Bs + yyt y T s + φ(st Bs)vv T. BFGS and DFP correspond to φ = 0 and φ = 1 respectively. SR1 corresponds to φ = y T s y T s s T Bs

18 Superlinear convergence Back to our main goal: min x f(x) where f is twice differentiable, and dom(f) = R n. Quasi-Newton method Pick initial x 0 and B 0 For k = 0, 1,... Solve B k p k = f(x k ) Pick t k and let x k+1 = x k + t k p k update B k to B k+1 end for

19 Superlinear convergence Under suitable assumptions on the objective function f(x) and on the step size t k we get superlinear convergence: x k+1 x lim k x k x = 0. Step length (Wolfe conditions) Assume t is chosen (via backtracking) so that f(x + tp) f(x) + α 1 t f(x) T p and f(x + tp) T p α 2 f(x) T p for 0 < α 1 < α 2 < 1.

20 Superlinear convergence The crux of superlinear convergence is the following technical result. Theorem (Dennis-Moré) Assume f is twice differentiable, x k x such that f(x ) = 0 and 2 f(x ) is positive definite. If the search direction p k satisfies then there exists k 0 such that f(x k ) 2 f(x k )p k lim k p k = 0 (1) (i) The step length t k = 1 satisfies Wolfe conditions for k k 0. (ii) If t k = 1 for k k 0 then x k x superlinearly. Under suitable assumptions, DFP and BFGS updates ensure (1) holds for p k = H k f(x k ) and we get superlinear convergence.

21 Limited memory BFGS (LBFGS) For large problems, exact quasi-newton updates becomes too costly. An alternative is to maintain a compact approximation of the matrices: save only a few n 1 vectors and compute the matrix implicitly. The BFGS method computes the search direction where H is updated via H + = p = H f(x) ) (I syt y T H s ) (I yst y T + sst s y T s

22 LBFGS Instead of computing and storing H, compute an implicit modified version of H by maintaining the last m pairs (y, s). Observe where H + g = p + (α β)s α = st g y T s, q = g αy, p = Hq, β = yt p y T s Hence Hg can be computed via two loops of length k if H is obtained after k BFGS updates. For k m LBFGS limits the length of these loops to m.

LBFGS LBFGS computation of p k = H k f(x k ) 1. q := f(x k ) 2. for i = k 1,..., min(k m, 0) α i := (si ) T q (y i ) T s i q := q αy i end for 3. p := H 0,k q 4. for i = min(k m, 0),..., k 1 β := (yi ) T p (y i ) T s i p := p + (α i β)s i 5. return p In step 3 H 0,k is the initial H. Popular choice H 0,k := (yk 1 ) T s k 1 (y k 1 ) T y k 1 I 23

24 Stochastic quasi-newton methods Consider the problem min x E(f(x; ξ)) where f(x, ξ) depends on a random input ξ. Stochastic gradient descent (SGD) Use a draw of ξ to compute a random gradient of the objective function x k+1 = x k t k f(x k, ξ k ) Stochastic quasi-newton template Extend previous ideas x k+1 = x k t k H k f(x k, ξ k )

25 Stochastic quasi-newton methods Some challenges: Theoretical limitations: stochastic iteration cannot have faster than sublinear rate of convergence. Additional cost: a major advantage of SGD (and similar algorithms) is their low cost per iteration. Could the additional overhead of a quasi-newton method be compensated? Conditioning of scaling matrices: updates on H depend on consecutive gradient estimates. The noise in the random gradient estimates can be a hindrance in the updates.

26 Online BFGS The most straightforward adaptation of quasi-newton methods is to use BFGS (or LBFGS) with s k = x k+1 x k, y k = f(x k+1, ξ k ) f(x k, ξ k ) Key: use the same ξ k in the two above stochastic gradients. Maintain H k via BFGS or LBFGS updates. This approach, referred to as online BFGS, is due to Schraudolph-Yu-Günter. With proper tuning it can give some improvement over SGD.

27 Other stochastic quasi-newton methods Byrd-Hansen-Nocedal-Singer propose a stochastic version of LBFGS but with two main changes: Perform LBFGS update only every L iterations For s and y use respectively s = x t x t 1 where x t = k i=k L+1 x i and y = 2 F ( x t )s where 2 F ( x t ) is a Hessian approximation (based on a random subsample).

b H =300and600outperformsSGD,andobtainsthesameorbetterobjectiv than the coordinate Some results descent (Byrd-Hansen-Nocedal-Singer) method. re 1: Illustration of SQN and SGD on the synthetic dataset. The dotted blac 28 marks the best function value obtained by the coordinate descent (CD) method SQN vs SGD on Synthetic Binary Logistic Regression with n = 50 and N = 7000 fx versus accessed data points 10 0 10 1 SGD: b = 50, β = 7 SQN: b = 50, β = 2, bh = 300 SQN: b = 50, β = 2, bh = 600 CD approx min fx 10 2 0 0.5 1 1.5 2 2.5 3 adp x 10 5

the initial choice of scaling parameter in Hessian updating (see Step 1 of Algorithm 2) was the average of the quotients s More results (Byrd-Hansen-Nocedal-Singer) T i y i /yi T y i averaged over the last M iterations, as recommended in [24]. Figure 12 compares our SQN method to the aforementioned olbfgs on our two realistic test problems, in terms of accessed data points. We observe that SQN has overall better performance, which is more pronounced for smaller batch sizes. 29 fx 10 0.5 10 0.7 RCV1 olbfgs: b = 50 olbfgs: b = 300 SQN: b = 50 SQN: b = 300 fx 10 0.33 10 0.3 SPEECH olbfgs: b = 100 olbfgs: b = 500 SQN: b = 100 SQN: b = 500 10 0.27 10 0.9 10 0.24 0.6883 1.3767 2.065 2.7533 adp x 10 6 1.9161 3.8321 5.7482 7.6643 9.5803 adp x 10 5 Figure 12: Comparison of olbfgs (dashed lines) and SQN (solid lines) in terms of accessed data points. For RCV1 dataset gradient batches are set to b = 50 or 300, for both methods; additional parameter settings for SQN are L =20,b H =1000, M =10. ForSpeechdatasetwesettob =100or500;andforSQNwesetL =10, b H =1000,M =10.

30 References and further reading J. Dennis and R. Schnabel (1996), Numerical methods for unconstrained optimization and nonlinear equations. J. Nocedal and S. Wright (2006), Numerical optimization, Chapters 6 and 7 N. Schraudolph, J. Yu, S. Günter (2007), A stochastic quasi-newton method for online convex optimization. L. Bottou, F. Curtis, J. Nocedal (2016), Optimization methods for large-scale machine learning.