Stochastic Quasi-Newton Methods

Size: px

Start display at page:

Download "Stochastic Quasi-Newton Methods"

Jane Lane
6 years ago
Views:

1 Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, / 35

2 Outline Stochastic Approximation Stochastic Gradient Descent Variance Reduction Techniques Newton-like and quasi-newton methods for convex stochastic optimization problems using limited memory block BFGS updates. Numerical results on problems from machine learning. Quasi-Newton methods for nonconvex stochastic optimization problems using damped and modified limited memory BFGS updates. 2 / 35

3 Stochastic optimization Stochastic optimization min f (x) = E[f (x, ξ)], ξ is random variable Or finite sum (with f i (x) f (x, ξ i ) for i = 1,..., n and very large n) min f (x) = 1 n f i (x) n f and f are very expensive to evaluate; stochastic gradient descent (SGD) methods choose a random subset S [n] and evaluate f S (x) = 1 S i S i=1 f i (x) and f S (x) = 1 S f i (x) Essentially, only noisy info about f, f and 2 f is available Challenge: how to smooth variability of stochastic methods Challenge: how to design methods that take advantage of noisy 2nd-order information? i S 3 / 35

4 Stochastic optimization Deterministic gradient method Stochastic gradient method 4 / 35

5 Stochastic Variance Reduced Gradients Stochastic methods converge slowly near the optimum due to the variance of the gradient estimates f S (x); hence requiring a decreasing step size. We use the control variates approach of Johnson and Zhang (2013) for a SGD method SVRG. It uses d = f S (x t ) f S (w k ) + f (w k ), where w k is a reference point, in place of f S (x t ). w k, and the full gradient, are computed after each full pass of the data, hence doubling the work of computing stochastic gradients. 5 / 35

6 Stochastic Average Gradient At iteration t - Sample i from {1,..., N} - update y t+1 i = f i (x t ) and y t+1 j - Compute g t+1 = 1 N N j=1 y t+1 i - Set x t+1 = x t α t+1 g t+1 = y t j for all j i Provable linear convergence in expectation. Other SGD variance reduction techniques have been recently proposes including the methods: SAGA, SDCA, S2GD. 6 / 35

7 Quasi-Newton Method for min f (x) : f C 1 Gradient method: Newton s method: Quasi-Newton method: x k+1 = x k α k f (x k ) x k+1 = x k α k [ 2 f (x k )] 1 f (x k ) x k+1 = x k α k B 1 k f (x k) where B k 0 approximates the Hessian matrix Update B k+1 s k = y k, (Secant equation) where s k = x k+1 x k = α k d k, and y k = f k+1 f k 7 / 35

8 BFGS BFGS quasi-newton method B k+1 = B k + y k y k s k y k B ks k s k B k s k B ks k where s k := x k+1 x k and y k := f (x k+1 ) f (x k ) B k+1 0 if B k 0 and s k y k > 0 (curvature condition) Secant equation has a solution if s k y k > 0 When f is strongly convex, s k y k > 0 holds automatically If f is nonconvex, use line search to guarantee s k y k > 0 H k+1 = (I s kyk sk y )H k (I y ksk k sk y ) + s ksk k sk y k 8 / 35

9 Prior work on Quasi-Newton Methods for Stochastic Optimization P1 N.N. Schraudolph, J. Yu and S.Günter. A stochastic quasi-newton method for online convex optim. Int l. Conf. AI & Stat., 2007 Modifies BFGS and L-BFGS updates by reducing the step s k and the last term in the update of H k, uses step size α k = β/k for small β > 0. P2 A. Bordes, L. Bottou and P. Gallinari. SGD-QN: Careful quasi-newton stochastic gradient descent. JMLR vol. 10, 2009 Uses a diagonal matrix approximation to [ 2 f ( )] 1 which is updated (hence, the name SGD-QN) on each iteration, α k = 1/(k + α). 9 / 35

10 Prior work on Quasi-Newton Methods for Stochastic Optimization P3 A. Mokhtari and A. Ribeiro. RES: Regularized stochastic BFGS algorithm. IEEE Trans. Signal Process., no. 10, Replaces y k by y k δs k for some δ > 0 in BFGS update and also adds δi to the update; uses α k = β/k; converges in expectation at sub-linear rate E(f (x k ) f ) C/k P4 A. Mokhtari and A. Ribeiro. Global convergence of online limited memory BFGS. to appear in J. Mach. Learn. Res., Uses L-BFGS without regularization and α k = β/k; converges in expectation at sub-linear rate E(f (x k ) f ) C/k 10 / 35

11 Prior work on Quasi-Newton Methods for Stochastic Optimization P5 R.H. Byrd, S.L. Hansen, J. Nocedal, and Y. Singer. A stochastic quasi-newton method for large-scale optim. arxiv v2, 2015 Averages iterates over L steps keeping H k fixed; uses average iterates to update H k using subsampled Hessian to compute y k ; α k = β/k; converges in expectation at a sub-linear rate E(f (x k ) f ) C/k P6 P. Moritz, R. Nishihara, M.I. Jordan. A linearly-convergent stochastic L-BFGS Algorithm, 2015 arxiv: v1 Combines [P5] with SVRG; uses fixed step size α; converges in expectation at a linear rate. 11 / 35

12 Using Stochastic 2nd-order information Assumption: f (x) = 1 n n i=1 f i(x) is strongly convex and twice continuously differentiable. Choose (compute) a sketching matrix S k (the columns of S k are a set of directions). We do not use differences in noisy gradients to estimate curvature, but rather compute the action of the sub-sampled Hessian on S k. i.e., compute Y k = 1 T i T 2 f i (x)s k, where T [n]. 12 / 35

13 Example of Hessian-Vector Computation In binary classification problem, sample function (logistic loss) where f (w; x i, z i ) = z i log(c(w; x i )) + (1 z i ) log(1 c(w; x i )) c(w; x i ) = exp( x i w), x i R n, w R n, z i {0, 1}, Gradient: f (w; x i, z i ) = (c(w; x i ) z i )x i Action of Hessian on s : 2 f (w; x i, z i )s = c(w; x i )(1 c(w; x i ))(x i s)x i 13 / 35

14 block BFGS The block BFGS method computes a least change update to the current approximation H k to the inverse Hessian matrix 2 f (x) at the current point x, by solving min H H k s.t., H = H, HY k = S k. where A = ( 2 f (x k )) 1/2 A( 2 f (x k )) 1/2 F (F = Frobenius) This gives the updating formula (analgous to the updates derived by Broyden, Fletcher, Goldfarb and Shanno, 1970). H k+1 = (I S k [S k Y k] 1 Y k )H k(i Y k [S k Y k] 1 S k )+S k[s k Y k] 1 S k or, by the Sherman-Morrison-Woodbury formula: B k+1 = B k B k S k [S k B ks k ] 1 S k B k + Y k [S k Y k] 1 Y k 14 / 35

15 Limited Memory Block BFGS After M block BFGS steps starting from H k+1 M, one can express H k+1 as H k+1 = V k H k V T k + S kλ k S T k. = V k V k 1 H k 1 V T k 1 V k + V k S k 1 Λ k 1 S T k 1 V T k + S kλ k S T k k+1 M = V k:k+1 M H k+1 M Vk:k+1 M T + i=k V k:i+1 S i Λ i S T i V T k:i+1, where V k = (I S k Λ k Y T k ) (1) and Λ k = (S T k Y k) 1 and V k:i = V k V i. 15 / 35

16 Limited Memory Block BFGS Intuition Hence, when the number of variables d is large, instead of storing the d d matrix H k, we store the previous M block curvature triples (S k+1 M, Y k+1 M, Λ k+1 M ),..., (S k, Y k, Λ k ). Then, analogously to the standard L-BFGS method, for any vector v R d, H k v can be computed efficiently using a two-loop block recursion (in O(Mp(d + p) + p 3 ) operations), if all S i R d p. Limited memory - least change aspect of BFGS is important Each block update acts like a sketching procedure. 16 / 35

17 Two Loop Recursion 17 / 35

18 Choices for the Sketching Matrix S k We employ one of the following strategies Gaussian: S k N (0, I ) has Gaussian entries sampled i.i.d at each iteration. Previous search directions s i delayed: Store the previous L search directions S k = [s k+1 L,..., s k ] then update H k only once every L iterations. Self-conditioning: Sample the columns of the Cholesky factors L k of H k (i.e., L k L T k = H k) uniformly at random. Fortunately we can maintain and update L k efficiently with limited memory. The matrix S is a sketching matrix, in the sense that we are sketching the, possibly very large equation 2 f (x)h = I to which the solution is the inverse Hessian. Right multiplying by S compresses/sketches the equation yielding 2 f (x)hs = S. 18 / 35

19 The Basic Algorithm 19 / 35

20 Convergence - Assumptions There exist constants λ, Λ R + such that f is λ strongly convex f (w) f (x) + f (x) T (w x) + λ 2 w x 2 2, (2) f is Λ smooth f (w) f (x) + f (x) T (w x) + Λ 2 w x 2 2, (3) These assumptions imply that λi 2 f S (w) ΛI, for all x R d, S [n], (4) from which we can prove that there exist constants γ, Γ R + such that for all k we have γi H k ΓI. (5) 20 / 35

21 Bounds on Spectrum of H k Lemma Assuming 0 < λ < Λ such that for all x R d and T [n], where λi 2 f T (x) ΛI γi H k ΓI 1 1+MΛ γ, Γ (1 + κ) 2M 1 (1 + λ(2 ) and κ = Λ/λ κ+κ) bounds in MNJ depend on problem dimension 1 (d+m)λ γ and Γ [(d+m)λ]d+m 1 λ d+m (dκ) d+m 21 / 35

22 Linear Convergence Theorem Suppose that the Assumptions hold. Let w be the unique minimizer of f (w). Then in our Algorithm, we have for all k 0 that Ef (w k ) f (w ) ρ k Ef (w 0 ) f (w ), where the convergence rate is given by ρ = 1/2mη + ηγ2 Λ(Λ λ) γλ ηγ 2 Λ 2 < 1, assuming we have chosen η < γλ/(2γ 2 Λ 2 ) and that we choose m large enough to satisfy m 1 2η (γλ ηγ 2 Λ(2Λ λ)), which is a positive lower bound given our restriction on η. 22 / 35

23 Numerical Experiments Empirical Risk Minimization Test Problems logistic loss with l 2 regularizer min w n log(1 + exp( y i a i, w )) + L w 2 2 i=1 given data: A = [a 1, a 2,, a n ] R d n y {0, 1} n. For each method, chose step size η {1,.5,.1,.05,..., , 10 8 } that gave best results Computed full gradient after each full data pass. Vertical axis in figures below: log(relative error) 23 / 35

24 error gisette-scale d = 5, 000, n = 6, 000 gauss_18_m_5 prev_18_m_5 fact_18_m_3 MNJ_bH_330 SVRG time (s) 24 / 35

25 error covtype-libsvm-binary d = 54, n = 581, 012 gauss_8_m_5 prev_8_m_5 fact_8_m_3 MNJ_bH_3815 SVRG time (s) 25 / 35

26 error Higgs d = 28, n = 11, 000, 000 gauss_4_m_5 prev_4_m_5 fact_4_m_3 MNJ_bH_16585 SVRG time (s) 26 / 35

27 error SUSY d = 18, n = 3, 548, 466 gauss_5_m_5 prev_5_m_5 fact_5_m_3 MNJ_bH_9420 SVRG time (s) 27 / 35

28 error epsilon-normalized d = 2, 000, n = 400, 000 gauss_45_m_5 prev_45_m_5 fact_45_m_3 MNJ_bH_3165 SVRG time (s) 28 / 35

29 error rcv1-training d = 47, 236, n = 20, 242 gauss_10_m_5 prev_10_m_5 fact_10_m_3 MNJ_bH_715 SVRG time (s) 29 / 35

30 error url-combined d = 3, 231, 961, n = 2, 396, 130 gauss_2_m_5 prev_2_m_5 fact_2_m_3 MNJ_bH_7740 SVRG time (s) 30 / 35

31 Contributions New metric learning framework. A block BFGS framework for gradually learning the metric of the underlying function using a sketched form of the subsampled Hessian matrix New limited memory block BFGS method. May also be of interest for non-stochastic optimization Several sketching matrix possibilities. More reasonable bounds on eigenvalues of H k more reasonable conditions for step size 31 / 35

32 Nonconvex stochastic optimization Most stochastic quasi-newton optimization methods are for strongly convex problems; this is needed to ensure a curvature condition required for the positive definiteness of B k (H k ) This is not possible for problems min f (x) E[F (x, ξ)], where f is nonconvex In deterministic setting, one can do line search to guarantee the curvature condition, and hence the positive definiteness of B k (H k ) Line search is not possible for stochastic optimization To address these issues we develop a stochastic damped and a stochastic modified L-BFGS method. 32 / 35

33 Stochastic Damped BFGS (Wang, Ma, G, Liu, 2015) Let y k = 1 m m i=1 ( f (x k+1, ξ k,i ) f (x k, ξ k,i )) and define ȳ k = θ k y k + (1 θ k )B k s k, where { 1, if sk θ k = y k 0.25sk B ks k, (0.75sk B ks k )/(sk B ks k sk y k), if sk y k < 0.25sk B ks k. Update H k : (replace y k by ȳ k ) H k+1 = (I ρ k s k ȳ k )H k(i ρ k ȳ k s k ) + ρ ks k s k where ρ k = 1/s k ȳk Implemented in a limited memory version Work in progress: combine with variance reduced stochastic gradients (SVRG) 33 / 35

34 Convergence of Stochastic Damped BFGS Method Assumptions [AS1] f C 1, bounded below, f is L Lipschitz continuous [AS2] For any iteration k, the stochastic gradient satisfies E ξk [ f (x k, ξ k )] = f (x k ) E ξk [ f (x k, ξ k ) f (x k ) 2 ] σ 2 Theorem (Global convergence): Assume AS1-AS2 hold, (and α k = β/k γ/(lγ 2 ) for all k), then there exist positive constants γ, Γ, such that γi H k ΓI, for all k, and lim inf k f (x k) = 0, with probability 1. Under additional assumption E ξk [ f (xk, ξ k ) 2] M lim f (x k) = 0, with probability 1. k We do not need to assume convexity of f 34 / 35

35 Block-L-BFGS Method for Non-Convex Stochastic Optimization Block-update H k+1 = (I S k Λ 1 k Y k )H k(i Y k Λ 1 k S k ) + S kλ 1 where Λ k = Sk Y k = Sk 2 f (x k )S k In non-convex case Λ k = Λ k may not be positive definite. Λ k 0 discovered while computing Cholesky factorization LDL of Λ k. If during Cholesky, d j δ or (LD 1/2 ) ij β are not satisfied, d j is increased by τ j. = (Λ k ) jj (Λ k ) jj + τ j has the effect of moving search direction H k+1 f (x k+1 ) toward one of negative curvature. k S k Modification based on Gershgorin disc also possible. 35 / 35

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods