LBFGS. John Langford, Large Scale Machine Learning Class, February 5. (post presentation version)

Similar documents
Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

2. Quasi-Newton methods

Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb Shanno February 6, / 25 (BFG. Limited memory BFGS (L-BFGS)

Convex Optimization CMU-10725

5 Quasi-Newton Methods

Improving the Convergence of Back-Propogation Learning with Second Order Methods

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725

Methods that avoid calculating the Hessian. Nonlinear Optimization; Steepest Descent, Quasi-Newton. Steepest Descent

MATH 4211/6211 Optimization Quasi-Newton Method

Stochastic Optimization Algorithms Beyond SG

Lecture 14: October 17

A projected Hessian for full waveform inversion

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Data Mining (Mineria de Dades)

Quasi-Newton Methods

Higher-Order Methods

ORIE 6326: Convex Optimization. Quasi-Newton Methods

Stochastic Quasi-Newton Methods

Algorithms for Constrained Optimization

Optimization for neural networks

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization

Linear Regression. S. Sumitra

Comparative study of Optimization methods for Unconstrained Multivariable Nonlinear Programming Problems

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09

Optimization 2. CS5240 Theoretical Foundations in Multimedia. Leow Wee Kheng

Programming, numerics and optimization

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Unconstrained optimization

Linear Regression (continued)

Optimization II: Unconstrained Multivariable

Incremental Quasi-Newton methods with local superlinear convergence rate

Variable Metric Stochastic Approximation Theory

Stochastic Optimization Methods for Machine Learning. Jorge Nocedal

Minimax Design of Complex-Coefficient FIR Filters with Low Group Delay

Reduced-Hessian Methods for Constrained Optimization

Optimization II: Unconstrained Multivariable

Improved Damped Quasi-Newton Methods for Unconstrained Optimization

MA/OR/ST 706: Nonlinear Programming Midterm Exam Instructor: Dr. Kartik Sivaramakrishnan INSTRUCTIONS

Deep Learning & Neural Networks Lecture 4

Extra-Updates Criterion for the Limited Memory BFGS Algorithm for Large Scale Nonlinear Optimization M. Al-Baali y December 7, 2000 Abstract This pape

Lecture 7 Unconstrained nonlinear programming

Optimization: Nonlinear Optimization without Constraints. Nonlinear Optimization without Constraints 1 / 23

Logistic Regression. Mohammad Emtiyaz Khan EPFL Oct 8, 2015

STAT Advanced Bayesian Inference

University of Maryland at College Park. limited amount of computer memory, thereby allowing problems with a very large number

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Nonlinear Programming

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Lecture 18: November Review on Primal-dual interior-poit methods

A Trust-region-based Sequential Quadratic Programming Algorithm

Conjugate Directions for Stochastic Gradient Descent

Seminal papers in nonlinear optimization

Extra-Updates Criterion for the Limited Memory BFGS Algorithm for Large Scale Nonlinear Optimization 1

Chapter 4. Unconstrained optimization

Optimization Methods for Machine Learning

DENSE INITIALIZATIONS FOR LIMITED-MEMORY QUASI-NEWTON METHODS

ECE580 Exam 2 November 01, Name: Score: / (20 points) You are given a two data sets

EAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science

Nonlinear Optimization Methods for Machine Learning

THE RELATIONSHIPS BETWEEN CG, BFGS, AND TWO LIMITED-MEMORY ALGORITHMS

Math 408A: Non-Linear Optimization

BFGS WITH UPDATE SKIPPING AND VARYING MEMORY. July 9, 1996

Regression with Numerical Optimization. Logistic

Second-Order Methods for Stochastic Optimization

NonlinearOptimization

AM 205: lecture 19. Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods

Introduction to Optimization

1 Numerical optimization

Empirical Risk Minimization and Optimization

The multidimensional moment-constrained maximum entropy problem: A BFGS algorithm with constraint scaling

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

1. Introduction Let the least value of an objective function F (x), x2r n, be required, where F (x) can be calculated for any vector of variables x2r

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

1. Background: The SVD and the best basis (questions selected from Ch. 6- Can you fill in the exercises?)

Minimum Norm Symmetric Quasi-Newton Updates Restricted to Subspaces

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

Convex Optimization. Problem set 2. Due Monday April 26th

Trust-Region Optimization Methods Using Limited-Memory Symmetric Rank-One Updates for Off-The-Shelf Machine Learning

Motivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning

An Iterative Descent Method

Statistics 580 Optimization Methods

Linear Algebra Section 2.6 : LU Decomposition Section 2.7 : Permutations and transposes Wednesday, February 13th Math 301 Week #4

Cheng Soon Ong & Christian Walder. Canberra February June 2018

March 8, 2010 MATH 408 FINAL EXAM SAMPLE

A Primer on Solving Systems of Linear Equations

Multivariate Newton Minimanization

Cubic regularization in symmetric rank-1 quasi-newton methods

Improving L-BFGS Initialization For Trust-Region Methods In Deep Learning

Porting a sphere optimization program from LAPACK to ScaLAPACK

Recommendation Systems

Machine Learning for NLP

ON THE CONNECTION BETWEEN THE CONJUGATE GRADIENT METHOD AND QUASI-NEWTON METHODS ON QUADRATIC PROBLEMS

ECS550NFB Introduction to Numerical Methods using Matlab Day 2

High Order Methods for Empirical Risk Minimization

An Algorithm for Unconstrained Quadratically Penalized Convex Optimization (post conference version)

AM 205: lecture 19. Last time: Conditions for optimality, Newton s method for optimization Today: survey of optimization methods

Sub-Sampled Newton Methods

The Conjugate Gradient Method

Principal Component Analysis (PCA) for Sparse High-Dimensional Data

Lecture V. Numerical Optimization

Transcription:

LBFGS John Langford, Large Scale Machine Learning Class, February 5 (post presentation version)

We are still doing Linear Learning Features: a vector x R n Label: y R Goal: Learn w R n such that ŷ w (x) = i w i x i close to y. is

But, this time in a batch fashion Initialize w Repeatedly: 1 Let ŷ w (x) = i w i x i L(ŷ w (x),y) 2 Let g i = (x,y) w i 3 Compute update direction d(g) 4 Update weights w i w i + d i (g)

The BFGS Update d(g) = Dg for some Direction matrix D What is D?

The BFGS Update d(g) = Dg for some Direction matrix D What is D? D is dened purely in terms of two empirical observations: g = gnew gprev w = wnew wprev

Assertion 1 i g i w i = g w should be positive for convex functions. convex function 2 1.5 1 0.5 0 Change in weight*gradient convex function a gradient another gradient -0.5-1.5-1 -0.5 0 0.5 parameter

Assertion 2 T kj = g w k j g w i i i = g w g w direction w and vice versa. Transforms direction g to

Assertion 2 T kj = g w k j g w i i i = g w g w Transforms direction g to direction w and vice versa. A matrix is a linear function which transforms one vector into another. j k T kj v j = v k T kj = j g k w j v j i g i w i k v k g k w j i g i w i = g k = w j j v j w j i g i w i k g k v k i g i w i

3 vectors, v, w, g v w g

3 vectors, v, w, g v <v,w> w g

3 vectors, v, w, g v <v,w> w g<v,w> g

Assertion 3 Let δ kj = I (k = j). if k = j then 1 and 0 otherwise S kj = δ kj T kj Subtracts transform T kj while keeping everything else.

Assertion 3 Let δ kj = I (k = j). if k = j then 1 and 0 otherwise S kj = δ kj T kj Subtracts transform T kj while keeping everything else. S kj v j = v k g k j j v k S kj = v k w j j v j w j i g i w i k g k v k i g i w i

Assertion 4 F kj = w w k j g w i i i Hessian. = w w g w is an estimate of the inverse

Assertion 4 F kj = w w k j g w i i i Hessian. H kj = 2 L w k w j = w w g w = g k w j is an estimate of the inverse

Assertion 4 F kj = w w k j g w i i i Hessian. H kj = 2 L w k w j So, Hw g. = w w g w = g k w j is an estimate of the inverse

Assertion 4 F kj = w w k j g w i i i Hessian. H kj = 2 L w k w j = w w g w = g k w j is an estimate of the inverse So, Hw g. So an inverse should satisfy Fg w.

The BFGS direction D kj il S ik D il S lj + F kj Or in recursive matrix form: D t = S t D t 1 S t + F t

The BFGS direction D kj il S ik D il S lj + F kj Or in recursive matrix form: D t = S t D t 1 S t + F t Unwinding, we get: D t = S t S t 1...S 1 D 0 S 1 S 2...S t +S t...s 2 F 1 S 2...S t +... + S t F t 1 S t + F t

The BFGS direction D kj il S ik D il S lj + F kj Or in recursive matrix form: D t = S t D t 1 S t + F t Unwinding, we get: D t = S t S t 1...S 1 D 0 S 1 S 2...S t +S t...s 2 F 1 S 2...S t +... + S t F t 1 S t + F t LBFGS is the low rank approximation. L t = S t...s t m D 0 S t m...s t +S t...s t m+1 F t m S t m+1...s t +... +S t F t 1 S t + F t

Questions What is D 0? How do you make it fast? How do you start? What if loss goes up? How do you regularize?

Questions What is D 0? δ jk 2 L w j w j How do you make it fast? How do you start? is a reasonable choice. What if loss goes up? How do you regularize?

Questions What is D 0? δ jk 2 L w j w j is a reasonable choice. How do you make it fast? All operations decompose into dense vector products. How do you start? What if loss goes up? How do you regularize?

Questions What is D 0? δ jk 2 L w j w j is a reasonable choice. How do you make it fast? All operations decompose into dense vector products. How do you start? Seed w with an online pass rst. Initially, step size may be crazy. Make a second pass computing the second derivative in the chosen direction. What if loss goes up? How do you regularize?

Questions What is D 0? δ jk 2 L w j w j is a reasonable choice. How do you make it fast? All operations decompose into dense vector products. How do you start? Seed w with an online pass rst. Initially, step size may be crazy. Make a second pass computing the second derivative in the chosen direction. What if loss goes up? Backstep along previous direction. How do you regularize?

Questions What is D 0? δ jk 2 L w j w j is a reasonable choice. How do you make it fast? All operations decompose into dense vector products. How do you start? Seed w with an online pass rst. Initially, step size may be crazy. Make a second pass computing the second derivative in the chosen direction. What if loss goes up? Backstep along previous direction. How do you regularize? Regularized loss has the form: L (ŷ, y) = L(ŷ, y) + c 2 i w 2. Imposing i regularization is a once-per-pass dense operation.

How do you restart with new data?

How do you restart with new data? 2 Curvature at solution 1.5 f(x) 1 0.5 Loss around solution 0 0 0.5 1 1.5 2 x Compute and store: r i = 2 L w i w i On resumption, regularize by i r i(w i o i ) 2 where is the old weight value. o i

Why LBFGS? Theorem: If L is quadratic and an exact line search was done for the step size, a variant satises e t C 2 2t for some C.

Why LBFGS? Theorem: If L is quadratic and an exact line search was done for the step size, a variant satises e t C 2 2t for some C. Of course, it's rarely quadratic and you never perform exact line search.

What happens here? 1 0.8 Absolute Value x-1 f(x) 0.6 0.4 0.2 0 0 0.5 1 1.5 2 x

What happens here? Absolute Value 1 x-1 0.8 f(x) 0.6 0.4 0.2 0 0 0.5 1 1.5 2 What happens to a true Newton step here? x

References [L] Nocedal, J., Updating quasi-newton matrices with limited storage, Math. of Comp., 35, 773-782. [B] Broyden, C., The convergence of a class of double-rank minimization algorithms, Journal of the Inst. of Math. and Its Applications, 6:76-90. [F] Fletcher, R., A New Approach to Variable Metric Algorithms, Computer Journal 13 (3):317-322. [G] Goldfarb, D., A Family of Variable Metric Updates Derived by Variational Means, Math. of Comp. 24 (109):23-26. [S] Shanno, D. Conditioning of quasi-newton methods for function minimization, Math. of Comp. 24(111):647-656.

More References Incremental LBFGS Olivier Chapelle