LBFGS. John Langford, Large Scale Machine Learning Class, February 5. (post presentation version)

LBFGS John Langford, Large Scale Machine Learning Class, February 5 (post presentation version)

We are still doing Linear Learning Features: a vector x R n Label: y R Goal: Learn w R n such that ŷ w (x) = i w i x i close to y. is

But, this time in a batch fashion Initialize w Repeatedly: 1 Let ŷ w (x) = i w i x i L(ŷ w (x),y) 2 Let g i = (x,y) w i 3 Compute update direction d(g) 4 Update weights w i w i + d i (g)

The BFGS Update d(g) = Dg for some Direction matrix D What is D?

The BFGS Update d(g) = Dg for some Direction matrix D What is D? D is dened purely in terms of two empirical observations: g = gnew gprev w = wnew wprev

Assertion 1 i g i w i = g w should be positive for convex functions. convex function 2 1.5 1 0.5 0 Change in weight*gradient convex function a gradient another gradient -0.5-1.5-1 -0.5 0 0.5 parameter

Assertion 2 T kj = g w k j g w i i i = g w g w direction w and vice versa. Transforms direction g to

Assertion 2 T kj = g w k j g w i i i = g w g w Transforms direction g to direction w and vice versa. A matrix is a linear function which transforms one vector into another. j k T kj v j = v k T kj = j g k w j v j i g i w i k v k g k w j i g i w i = g k = w j j v j w j i g i w i k g k v k i g i w i

3 vectors, v, w, g v w g

3 vectors, v, w, g v <v,w> w g

3 vectors, v, w, g v <v,w> w g<v,w> g

Assertion 3 Let δ kj = I (k = j). if k = j then 1 and 0 otherwise S kj = δ kj T kj Subtracts transform T kj while keeping everything else.

Assertion 3 Let δ kj = I (k = j). if k = j then 1 and 0 otherwise S kj = δ kj T kj Subtracts transform T kj while keeping everything else. S kj v j = v k g k j j v k S kj = v k w j j v j w j i g i w i k g k v k i g i w i

Assertion 4 F kj = w w k j g w i i i Hessian. = w w g w is an estimate of the inverse

Assertion 4 F kj = w w k j g w i i i Hessian. H kj = 2 L w k w j = w w g w = g k w j is an estimate of the inverse

Assertion 4 F kj = w w k j g w i i i Hessian. H kj = 2 L w k w j So, Hw g. = w w g w = g k w j is an estimate of the inverse

Assertion 4 F kj = w w k j g w i i i Hessian. H kj = 2 L w k w j = w w g w = g k w j is an estimate of the inverse So, Hw g. So an inverse should satisfy Fg w.

The BFGS direction D kj il S ik D il S lj + F kj Or in recursive matrix form: D t = S t D t 1 S t + F t

The BFGS direction D kj il S ik D il S lj + F kj Or in recursive matrix form: D t = S t D t 1 S t + F t Unwinding, we get: D t = S t S t 1...S 1 D 0 S 1 S 2...S t +S t...s 2 F 1 S 2...S t +... + S t F t 1 S t + F t

The BFGS direction D kj il S ik D il S lj + F kj Or in recursive matrix form: D t = S t D t 1 S t + F t Unwinding, we get: D t = S t S t 1...S 1 D 0 S 1 S 2...S t +S t...s 2 F 1 S 2...S t +... + S t F t 1 S t + F t LBFGS is the low rank approximation. L t = S t...s t m D 0 S t m...s t +S t...s t m+1 F t m S t m+1...s t +... +S t F t 1 S t + F t

Questions What is D 0? How do you make it fast? How do you start? What if loss goes up? How do you regularize?

Questions What is D 0? δ jk 2 L w j w j How do you make it fast? How do you start? is a reasonable choice. What if loss goes up? How do you regularize?

Questions What is D 0? δ jk 2 L w j w j is a reasonable choice. How do you make it fast? All operations decompose into dense vector products. How do you start? What if loss goes up? How do you regularize?

Questions What is D 0? δ jk 2 L w j w j is a reasonable choice. How do you make it fast? All operations decompose into dense vector products. How do you start? Seed w with an online pass rst. Initially, step size may be crazy. Make a second pass computing the second derivative in the chosen direction. What if loss goes up? Backstep along previous direction. How do you regularize?

How do you restart with new data?

How do you restart with new data? 2 Curvature at solution 1.5 f(x) 1 0.5 Loss around solution 0 0 0.5 1 1.5 2 x Compute and store: r i = 2 L w i w i On resumption, regularize by i r i(w i o i ) 2 where is the old weight value. o i

Why LBFGS? Theorem: If L is quadratic and an exact line search was done for the step size, a variant satises e t C 2 2t for some C.

Why LBFGS? Theorem: If L is quadratic and an exact line search was done for the step size, a variant satises e t C 2 2t for some C. Of course, it's rarely quadratic and you never perform exact line search.

What happens here? 1 0.8 Absolute Value x-1 f(x) 0.6 0.4 0.2 0 0 0.5 1 1.5 2 x

What happens here? Absolute Value 1 x-1 0.8 f(x) 0.6 0.4 0.2 0 0 0.5 1 1.5 2 What happens to a true Newton step here? x

References [L] Nocedal, J., Updating quasi-newton matrices with limited storage, Math. of Comp., 35, 773-782. [B] Broyden, C., The convergence of a class of double-rank minimization algorithms, Journal of the Inst. of Math. and Its Applications, 6:76-90. [F] Fletcher, R., A New Approach to Variable Metric Algorithms, Computer Journal 13 (3):317-322. [G] Goldfarb, D., A Family of Variable Metric Updates Derived by Variational Means, Math. of Comp. 24 (109):23-26. [S] Shanno, D. Conditioning of quasi-newton methods for function minimization, Math. of Comp. 24(111):647-656.

More References Incremental LBFGS Olivier Chapelle