Stochastic Quasi-Newton Methods

Similar documents
Stochastic Optimization Algorithms Beyond SG

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725

Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb Shanno February 6, / 25 (BFG. Limited memory BFGS (L-BFGS)

High Order Methods for Empirical Risk Minimization

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization

Incremental Quasi-Newton methods with local superlinear convergence rate

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

Large-scale Stochastic Optimization

High Order Methods for Empirical Risk Minimization

Second-Order Methods for Stochastic Optimization

5 Quasi-Newton Methods

Accelerating SVRG via second-order information

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

High Order Methods for Empirical Risk Minimization

Sub-Sampled Newton Methods

Programming, numerics and optimization

ECS289: Scalable Machine Learning

Stochastic and online algorithms

STA141C: Big Data & High Performance Statistical Computing

2. Quasi-Newton methods

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations

A Stochastic Quasi-Newton Method for Large-Scale Optimization

Lecture 1: Supervised Learning

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Sub-Sampled Newton Methods I: Globally Convergent Algorithms

Methods that avoid calculating the Hessian. Nonlinear Optimization; Steepest Descent, Quasi-Newton. Steepest Descent

ORIE 6326: Convex Optimization. Quasi-Newton Methods

Higher-Order Methods

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Second order machine learning

Second-Order Stochastic Optimization for Machine Learning in Linear Time

Convex Optimization CMU-10725

Solving linear equations with Gaussian Elimination (I)

Nonlinear Optimization Methods for Machine Learning

Stochastic Gradient Methods for Large-Scale Machine Learning

Math 273a: Optimization Netwon s methods

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Trade-Offs in Distributed Learning and Optimization

Maximum Likelihood Estimation

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

ABSTRACT 1. INTRODUCTION

LBFGS. John Langford, Large Scale Machine Learning Class, February 5. (post presentation version)

Quasi-Newton methods for minimization

Variable Metric Stochastic Approximation Theory

SVRG++ with Non-uniform Sampling

Linear Convergence under the Polyak-Łojasiewicz Inequality

Stochastic Optimization Methods for Machine Learning. Jorge Nocedal

Second order machine learning

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

1. Search Directions In this chapter we again focus on the unconstrained optimization problem. lim sup ν

Lecture 18: November Review on Primal-dual interior-poit methods

Big Data Analytics: Optimization and Randomization

LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION

Block stochastic gradient update method

Improving the Convergence of Back-Propogation Learning with Second Order Methods

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Efficient Methods for Large-Scale Optimization

Convex Optimization. Problem set 2. Due Monday April 26th

Stochastic Gradient Descent with Variance Reduction

Comparison of Modern Stochastic Optimization Algorithms

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Linearly-Convergent Stochastic-Gradient Methods

Improved Damped Quasi-Newton Methods for Unconstrained Optimization

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2

Search Directions for Unconstrained Optimization

Stochastic Second-Order Method for Large-Scale Nonconvex Sparse Learning Models

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Sub-Sampled Newton Methods I: Globally Convergent Algorithms

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Presentation in Convex Optimization

Optimization Methods for Machine Learning

Quasi-Newton Methods

Network Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China)

On Nesterov s Random Coordinate Descent Algorithms - Continued

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Optimization Algorithms on Riemannian Manifolds with Applications

Linear Models in Machine Learning

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

Approximate Second Order Algorithms. Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo

Optimization and Root Finding. Kurt Hornik

Accelerating Stochastic Optimization

Numerical Optimization: Basic Concepts and Algorithms

Fast Stochastic Optimization Algorithms for ML

Unconstrained optimization

Towards stability and optimality in stochastic gradient descent

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Nonlinear Programming

A Stochastic Quasi-Newton Method for Online Convex Optimization

Trust-Region Optimization Methods Using Limited-Memory Symmetric Rank-One Updates for Off-The-Shelf Machine Learning

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Algorithms for Constrained Optimization

Summary and discussion of: Dropout Training as Adaptive Regularization

MATH 4211/6211 Optimization Quasi-Newton Method

Math 408A: Non-Linear Optimization

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09

Transcription:

Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35

Outline Stochastic Approximation Stochastic Gradient Descent Variance Reduction Techniques Newton-like and quasi-newton methods for convex stochastic optimization problems using limited memory block BFGS updates. Numerical results on problems from machine learning. Quasi-Newton methods for nonconvex stochastic optimization problems using damped and modified limited memory BFGS updates. 2 / 35

Stochastic optimization Stochastic optimization min f (x) = E[f (x, ξ)], ξ is random variable Or finite sum (with f i (x) f (x, ξ i ) for i = 1,..., n and very large n) min f (x) = 1 n f i (x) n f and f are very expensive to evaluate; stochastic gradient descent (SGD) methods choose a random subset S [n] and evaluate f S (x) = 1 S i S i=1 f i (x) and f S (x) = 1 S f i (x) Essentially, only noisy info about f, f and 2 f is available Challenge: how to smooth variability of stochastic methods Challenge: how to design methods that take advantage of noisy 2nd-order information? i S 3 / 35

Stochastic optimization Deterministic gradient method Stochastic gradient method 4 / 35

Stochastic Variance Reduced Gradients Stochastic methods converge slowly near the optimum due to the variance of the gradient estimates f S (x); hence requiring a decreasing step size. We use the control variates approach of Johnson and Zhang (2013) for a SGD method SVRG. It uses d = f S (x t ) f S (w k ) + f (w k ), where w k is a reference point, in place of f S (x t ). w k, and the full gradient, are computed after each full pass of the data, hence doubling the work of computing stochastic gradients. 5 / 35

Stochastic Average Gradient At iteration t - Sample i from {1,..., N} - update y t+1 i = f i (x t ) and y t+1 j - Compute g t+1 = 1 N N j=1 y t+1 i - Set x t+1 = x t α t+1 g t+1 = y t j for all j i Provable linear convergence in expectation. Other SGD variance reduction techniques have been recently proposes including the methods: SAGA, SDCA, S2GD. 6 / 35

Quasi-Newton Method for min f (x) : f C 1 Gradient method: Newton s method: Quasi-Newton method: x k+1 = x k α k f (x k ) x k+1 = x k α k [ 2 f (x k )] 1 f (x k ) x k+1 = x k α k B 1 k f (x k) where B k 0 approximates the Hessian matrix Update B k+1 s k = y k, (Secant equation) where s k = x k+1 x k = α k d k, and y k = f k+1 f k 7 / 35

BFGS BFGS quasi-newton method B k+1 = B k + y k y k s k y k B ks k s k B k s k B ks k where s k := x k+1 x k and y k := f (x k+1 ) f (x k ) B k+1 0 if B k 0 and s k y k > 0 (curvature condition) Secant equation has a solution if s k y k > 0 When f is strongly convex, s k y k > 0 holds automatically If f is nonconvex, use line search to guarantee s k y k > 0 H k+1 = (I s kyk sk y )H k (I y ksk k sk y ) + s ksk k sk y k 8 / 35

Prior work on Quasi-Newton Methods for Stochastic Optimization P1 N.N. Schraudolph, J. Yu and S.Günter. A stochastic quasi-newton method for online convex optim. Int l. Conf. AI & Stat., 2007 Modifies BFGS and L-BFGS updates by reducing the step s k and the last term in the update of H k, uses step size α k = β/k for small β > 0. P2 A. Bordes, L. Bottou and P. Gallinari. SGD-QN: Careful quasi-newton stochastic gradient descent. JMLR vol. 10, 2009 Uses a diagonal matrix approximation to [ 2 f ( )] 1 which is updated (hence, the name SGD-QN) on each iteration, α k = 1/(k + α). 9 / 35

Prior work on Quasi-Newton Methods for Stochastic Optimization P3 A. Mokhtari and A. Ribeiro. RES: Regularized stochastic BFGS algorithm. IEEE Trans. Signal Process., no. 10, 2014. Replaces y k by y k δs k for some δ > 0 in BFGS update and also adds δi to the update; uses α k = β/k; converges in expectation at sub-linear rate E(f (x k ) f ) C/k P4 A. Mokhtari and A. Ribeiro. Global convergence of online limited memory BFGS. to appear in J. Mach. Learn. Res., 2015. Uses L-BFGS without regularization and α k = β/k; converges in expectation at sub-linear rate E(f (x k ) f ) C/k 10 / 35

Prior work on Quasi-Newton Methods for Stochastic Optimization P5 R.H. Byrd, S.L. Hansen, J. Nocedal, and Y. Singer. A stochastic quasi-newton method for large-scale optim. arxiv1401.7020v2, 2015 Averages iterates over L steps keeping H k fixed; uses average iterates to update H k using subsampled Hessian to compute y k ; α k = β/k; converges in expectation at a sub-linear rate E(f (x k ) f ) C/k P6 P. Moritz, R. Nishihara, M.I. Jordan. A linearly-convergent stochastic L-BFGS Algorithm, 2015 arxiv:1508.02087v1 Combines [P5] with SVRG; uses fixed step size α; converges in expectation at a linear rate. 11 / 35

Using Stochastic 2nd-order information Assumption: f (x) = 1 n n i=1 f i(x) is strongly convex and twice continuously differentiable. Choose (compute) a sketching matrix S k (the columns of S k are a set of directions). We do not use differences in noisy gradients to estimate curvature, but rather compute the action of the sub-sampled Hessian on S k. i.e., compute Y k = 1 T i T 2 f i (x)s k, where T [n]. 12 / 35

Example of Hessian-Vector Computation In binary classification problem, sample function (logistic loss) where f (w; x i, z i ) = z i log(c(w; x i )) + (1 z i ) log(1 c(w; x i )) c(w; x i ) = 1 1 + exp( x i w), x i R n, w R n, z i {0, 1}, Gradient: f (w; x i, z i ) = (c(w; x i ) z i )x i Action of Hessian on s : 2 f (w; x i, z i )s = c(w; x i )(1 c(w; x i ))(x i s)x i 13 / 35

block BFGS The block BFGS method computes a least change update to the current approximation H k to the inverse Hessian matrix 2 f (x) at the current point x, by solving min H H k s.t., H = H, HY k = S k. where A = ( 2 f (x k )) 1/2 A( 2 f (x k )) 1/2 F (F = Frobenius) This gives the updating formula (analgous to the updates derived by Broyden, Fletcher, Goldfarb and Shanno, 1970). H k+1 = (I S k [S k Y k] 1 Y k )H k(i Y k [S k Y k] 1 S k )+S k[s k Y k] 1 S k or, by the Sherman-Morrison-Woodbury formula: B k+1 = B k B k S k [S k B ks k ] 1 S k B k + Y k [S k Y k] 1 Y k 14 / 35

Limited Memory Block BFGS After M block BFGS steps starting from H k+1 M, one can express H k+1 as H k+1 = V k H k V T k + S kλ k S T k. = V k V k 1 H k 1 V T k 1 V k + V k S k 1 Λ k 1 S T k 1 V T k + S kλ k S T k k+1 M = V k:k+1 M H k+1 M Vk:k+1 M T + i=k V k:i+1 S i Λ i S T i V T k:i+1, where V k = (I S k Λ k Y T k ) (1) and Λ k = (S T k Y k) 1 and V k:i = V k V i. 15 / 35

Limited Memory Block BFGS Intuition Hence, when the number of variables d is large, instead of storing the d d matrix H k, we store the previous M block curvature triples (S k+1 M, Y k+1 M, Λ k+1 M ),..., (S k, Y k, Λ k ). Then, analogously to the standard L-BFGS method, for any vector v R d, H k v can be computed efficiently using a two-loop block recursion (in O(Mp(d + p) + p 3 ) operations), if all S i R d p. Limited memory - least change aspect of BFGS is important Each block update acts like a sketching procedure. 16 / 35

Two Loop Recursion 17 / 35

Choices for the Sketching Matrix S k We employ one of the following strategies Gaussian: S k N (0, I ) has Gaussian entries sampled i.i.d at each iteration. Previous search directions s i delayed: Store the previous L search directions S k = [s k+1 L,..., s k ] then update H k only once every L iterations. Self-conditioning: Sample the columns of the Cholesky factors L k of H k (i.e., L k L T k = H k) uniformly at random. Fortunately we can maintain and update L k efficiently with limited memory. The matrix S is a sketching matrix, in the sense that we are sketching the, possibly very large equation 2 f (x)h = I to which the solution is the inverse Hessian. Right multiplying by S compresses/sketches the equation yielding 2 f (x)hs = S. 18 / 35

The Basic Algorithm 19 / 35

Convergence - Assumptions There exist constants λ, Λ R + such that f is λ strongly convex f (w) f (x) + f (x) T (w x) + λ 2 w x 2 2, (2) f is Λ smooth f (w) f (x) + f (x) T (w x) + Λ 2 w x 2 2, (3) These assumptions imply that λi 2 f S (w) ΛI, for all x R d, S [n], (4) from which we can prove that there exist constants γ, Γ R + such that for all k we have γi H k ΓI. (5) 20 / 35

Bounds on Spectrum of H k Lemma Assuming 0 < λ < Λ such that for all x R d and T [n], where λi 2 f T (x) ΛI γi H k ΓI 1 1+MΛ γ, Γ (1 + κ) 2M 1 (1 + λ(2 ) and κ = Λ/λ κ+κ) bounds in MNJ depend on problem dimension 1 (d+m)λ γ and Γ [(d+m)λ]d+m 1 λ d+m (dκ) d+m 21 / 35

Linear Convergence Theorem Suppose that the Assumptions hold. Let w be the unique minimizer of f (w). Then in our Algorithm, we have for all k 0 that Ef (w k ) f (w ) ρ k Ef (w 0 ) f (w ), where the convergence rate is given by ρ = 1/2mη + ηγ2 Λ(Λ λ) γλ ηγ 2 Λ 2 < 1, assuming we have chosen η < γλ/(2γ 2 Λ 2 ) and that we choose m large enough to satisfy m 1 2η (γλ ηγ 2 Λ(2Λ λ)), which is a positive lower bound given our restriction on η. 22 / 35

Numerical Experiments Empirical Risk Minimization Test Problems logistic loss with l 2 regularizer min w n log(1 + exp( y i a i, w )) + L w 2 2 i=1 given data: A = [a 1, a 2,, a n ] R d n y {0, 1} n. For each method, chose step size η {1,.5,.1,.05,..., 5 10 8, 10 8 } that gave best results Computed full gradient after each full data pass. Vertical axis in figures below: log(relative error) 23 / 35

error gisette-scale d = 5, 000, n = 6, 000 gauss_18_m_5 prev_18_m_5 fact_18_m_3 MNJ_bH_330 SVRG 0 10 20 30 40 time (s) 24 / 35

error covtype-libsvm-binary d = 54, n = 581, 012 gauss_8_m_5 prev_8_m_5 fact_8_m_3 MNJ_bH_3815 SVRG 0 5 10 15 20 25 time (s) 25 / 35

error Higgs d = 28, n = 11, 000, 000 gauss_4_m_5 prev_4_m_5 fact_4_m_3 MNJ_bH_16585 SVRG 0 100 200 300 400 time (s) 26 / 35

error SUSY d = 18, n = 3, 548, 466 gauss_5_m_5 prev_5_m_5 fact_5_m_3 MNJ_bH_9420 SVRG 0 20 40 60 80 100 time (s) 27 / 35

error epsilon-normalized d = 2, 000, n = 400, 000 gauss_45_m_5 prev_45_m_5 fact_45_m_3 MNJ_bH_3165 SVRG 0 200 400 600 800 1000 1200 1400 time (s) 28 / 35

error rcv1-training d = 47, 236, n = 20, 242 gauss_10_m_5 prev_10_m_5 fact_10_m_3 MNJ_bH_715 SVRG 0 10 20 30 40 time (s) 29 / 35

error url-combined d = 3, 231, 961, n = 2, 396, 130 gauss_2_m_5 prev_2_m_5 fact_2_m_3 MNJ_bH_7740 SVRG 0 2000 4000 6000 8000 10000 time (s) 30 / 35

Contributions New metric learning framework. A block BFGS framework for gradually learning the metric of the underlying function using a sketched form of the subsampled Hessian matrix New limited memory block BFGS method. May also be of interest for non-stochastic optimization Several sketching matrix possibilities. More reasonable bounds on eigenvalues of H k more reasonable conditions for step size 31 / 35

Nonconvex stochastic optimization Most stochastic quasi-newton optimization methods are for strongly convex problems; this is needed to ensure a curvature condition required for the positive definiteness of B k (H k ) This is not possible for problems min f (x) E[F (x, ξ)], where f is nonconvex In deterministic setting, one can do line search to guarantee the curvature condition, and hence the positive definiteness of B k (H k ) Line search is not possible for stochastic optimization To address these issues we develop a stochastic damped and a stochastic modified L-BFGS method. 32 / 35

Stochastic Damped BFGS (Wang, Ma, G, Liu, 2015) Let y k = 1 m m i=1 ( f (x k+1, ξ k,i ) f (x k, ξ k,i )) and define ȳ k = θ k y k + (1 θ k )B k s k, where { 1, if sk θ k = y k 0.25sk B ks k, (0.75sk B ks k )/(sk B ks k sk y k), if sk y k < 0.25sk B ks k. Update H k : (replace y k by ȳ k ) H k+1 = (I ρ k s k ȳ k )H k(i ρ k ȳ k s k ) + ρ ks k s k where ρ k = 1/s k ȳk Implemented in a limited memory version Work in progress: combine with variance reduced stochastic gradients (SVRG) 33 / 35

Convergence of Stochastic Damped BFGS Method Assumptions [AS1] f C 1, bounded below, f is L Lipschitz continuous [AS2] For any iteration k, the stochastic gradient satisfies E ξk [ f (x k, ξ k )] = f (x k ) E ξk [ f (x k, ξ k ) f (x k ) 2 ] σ 2 Theorem (Global convergence): Assume AS1-AS2 hold, (and α k = β/k γ/(lγ 2 ) for all k), then there exist positive constants γ, Γ, such that γi H k ΓI, for all k, and lim inf k f (x k) = 0, with probability 1. Under additional assumption E ξk [ f (xk, ξ k ) 2] M lim f (x k) = 0, with probability 1. k We do not need to assume convexity of f 34 / 35

Block-L-BFGS Method for Non-Convex Stochastic Optimization Block-update H k+1 = (I S k Λ 1 k Y k )H k(i Y k Λ 1 k S k ) + S kλ 1 where Λ k = Sk Y k = Sk 2 f (x k )S k In non-convex case Λ k = Λ k may not be positive definite. Λ k 0 discovered while computing Cholesky factorization LDL of Λ k. If during Cholesky, d j δ or (LD 1/2 ) ij β are not satisfied, d j is increased by τ j. = (Λ k ) jj (Λ k ) jj + τ j has the effect of moving search direction H k+1 f (x k+1 ) toward one of negative curvature. k S k Modification based on Gershgorin disc also possible. 35 / 35