Incremental Quasi-Newton methods wh local superlinear convergence rate Aryan Mokhtari, Mark Eisen, and Alejandro Ribeiro Department of Electrical and Systems Engineering Universy of Pennsylvania Int. Conference on Acoustics, Speech, and Signal Processing (ICASSP) March 8th, 2017 Aryan Mokhtari Incremental Quasi-Newton Methods wh Local Superlinear Convergence Rate 1
Large-scale empirical risk minimization We aim to solve expected risk minimization problem min w R p E θ [f (w, θ)] The distribution is unknown We have access to N independent realizations of θ We settle for solving the Empirical Risk Minimization (ERM) problem 1 min F (w) := min w Rp w R p N N f (w, θ i ) Large-scale optimization or machine learning: large N, large p N: number of observations (inputs) p: number of parameters in the model Many (most) machine learning algorhms reduce to ERM problems Aryan Mokhtari Incremental Quasi-Newton Methods wh Local Superlinear Convergence Rate 2
Incremental Gradient Descent Objective function gradients s(w) := F (w) = 1 N N f (w, θ i ) (Deterministic) gradient descent eration w t+1 = w t ɛ t s(w t) Evaluation of (deterministic) gradients is not computationally affordable Incremental/Stochastic gradient Sample average in lieu of expectations ŝ(w, θ) = 1 L f (w, θ l ) θ = [θ 1;...; θ L ] L l=1 Functions are chosen cyclically or at random wh or whout replacement Incremental gradient descent eration w t+1 = w t ɛ t ŝ(w t, θ t) (Incremental) gradient descent is (very) slow. Newton is impractical Aryan Mokhtari Incremental Quasi-Newton Methods wh Local Superlinear Convergence Rate 3
BFGS quasi-newton method Approximate function s curvature wh Hessian approximation matrix B 1 t w t+1 = w t ɛ t B 1 t s(w t) Make B t close to H(w t) := 2 F (w t). Broyden, DFP, BFGS Variable variation: v t = w t+1 w t. Gradient variation: r t = s(w t+1) s(w t) Matrix B t+1 satisfies secant condion B t+1v t = r t. Underdetermined Resolve indeterminacy making B t+1 closest to previous approximation B t Using Gaussian relative entropy as proximy condion yields update B t+1 = B t + rtrt t v T Btvtvt T B t t r t v T t B tv t Superlinear convergence Close enough to quadratic rate of Newton BFGS requires gradients Use stochastic/incremental gradients Aryan Mokhtari Incremental Quasi-Newton Methods wh Local Superlinear Convergence Rate 4
Stochastic/Incremental quasi-newton methods Online (o)bfgs & Online Limed-Memory (ol)bfgs [Schraudolph et al 07] obfgs may diverge because Hessian approximation gets close to singular Regularized Stochastic BFGS (RES) [Mokhtari-Ribeiro 14] olbfgs does (surprisingly) converge [Mokhtari-Ribeiro 15] Problem solved? Alas. RES and olbfgs have sublinear convergence Same as stochastic gradient descent. Asymptotically not better Variance reduced stochastic L-BFGS (SVRG+oLBFGS) [Mortiz et al 16] Linear convergence rate. But this is not better than SAG, SAGA, SVRG Computationally feasible quasi Newton method wh superlinear convergence Aryan Mokhtari Incremental Quasi-Newton Methods wh Local Superlinear Convergence Rate 5
Incremental aggregated gradient method Utilize memory to reduce variance of stochastic gradient approximation f t 1 f t f t N f i t (wt+1 ) f t+1 1 f t+1 f t+1 N Descend along incremental gradient w t+1 = w t α N Select update index i t cyclically. Uniformly at random is similar N f t i Update gradient corresponding to function f f t+1 i t = f (w t+1 ) = w t α N gt Sum easy to compute g t+1 = g t fi t t + f t+1 i t. Converges linearly Stochastic (SAG) [Schmidt et al 13], Cyclic (IAG) [Gurbuzbalaban et al 15] Aryan Mokhtari Incremental Quasi-Newton Methods wh Local Superlinear Convergence Rate 6
Incremental BFGS method Keep memory of variables z t i, Hessian approximations B t i, and gradients fi t Functions indexed by i. Time indexed by t. Select function f at time t z t 1 z t z t N B t 1 B t B t N f t 1 f t f t N w t+1 All gradients, matrices, and variables used to update w t+1 Aryan Mokhtari Incremental Quasi-Newton Methods wh Local Superlinear Convergence Rate 7
Incremental BFGS method Keep memory of variables z t i, Hessian approximations B t i, and gradients fi t Functions indexed by i. Time indexed by t. Select function f at time t z t 1 z t z t N B t 1 B t B t N f t 1 f t f t N w t+1 f i t (wt+1 ) Updated variable w t+1 used to update gradient f t+1 i t = f (w t+1 ) Aryan Mokhtari Incremental Quasi-Newton Methods wh Local Superlinear Convergence Rate 8
Incremental BFGS method Keep memory of variables z t i, Hessian approximations B t i, and gradients fi t Functions indexed by i. Time indexed by t. Select function f at time t z t 1 z t z t N B t 1 B t B t N f t 1 f t f t N w t+1 B t+1 f i t (wt+1 ) Update Bi t t to satisfy secant condion for function f for variable variation z t i t w t+1 and gradient variation f t+1 i t fi t t (more later) Aryan Mokhtari Incremental Quasi-Newton Methods wh Local Superlinear Convergence Rate 9
Incremental BFGS method Keep memory of variables z t i, Hessian approximations B t i, and gradients fi t Functions indexed by i. Time indexed by t. Select function f at time t z t 1 z t z t N B t 1 B t B t N f t 1 f t f t N w t+1 B t+1 f i t (wt+1 ) z t+1 1 z t+1 z t+1 N B t+1 1 B t+1 B t+1 N f t+1 1 f t+1 f t+1 N Update variable, Hessian approximation, and gradient memory for function f Aryan Mokhtari Incremental Quasi-Newton Methods wh Local Superlinear Convergence Rate 10
Update of Hessian approximation matrices Variable variation at time t for function f i = f vi t := z t+1 i Gradient variation at time t for function f i = f ri t := f t+1 i t z t i Update B t i = B t i t to satisfy secant condion for variations v t i and r t i B t+1 i = B t i + rt i r t i T r t i T v t i Bt i v t i v t i T B t i v t i T B t i vt i f t i t We want B t i to approximate the Hessian of the function f i = f Aryan Mokhtari Incremental Quasi-Newton Methods wh Local Superlinear Convergence Rate 11
A naive (in hindsight) incremental BFGS method The key is in the update of w t. Use memory in stochastic quanties ( w t+1 = w t 1 N N B t i ) 1 ( 1 N N f t i It doesn t work Better than incremental gradient but not superlinear ) Optimization updates are solutions of function approximations In this particular update we are minimizing the quadratic form f (w) 1 n [ f i (z t i ) + f i (z t i ) T (w z t i ) + 1 ] n 2 (w wt ) T B t i (w w t ) Gradients evaluated at z t i. Secant condion verified at z t i The quadratic form is centered at w t. Not a reasonable Taylor series Aryan Mokhtari Incremental Quasi-Newton Methods wh Local Superlinear Convergence Rate 12
A proper Taylor series expansion Each individual function f i is being approximated by the quadratic f i (w) f i (z t i ) + f i (z t i ) T (w z t i ) + 1 2 (w wt ) T B t i (w w t ) To have a proper expansion we have to recenter the quadratic form at z t i f i (w) f i (z t i ) + f i (z t i ) T (w z t i ) + 1 2 (w zt i ) T B t i (w z t i ) I.e., we approximate f (w) wh the aggregate quadratic function f (w) 1 N N [ f i (z t i ) + f i (z t i ) T (w z t i ) + 1 ] 2 (w zt i ) T B t i (w z t i ) This is now a reasonable Taylor series that we use to derive an update Aryan Mokhtari Incremental Quasi-Newton Methods wh Local Superlinear Convergence Rate 13
Incremental BFGS Solving this quadratic program yields the update for the IQN method ( w t+1 1 N = N B t i ) 1 [ 1 N B t i z t i 1 N N ] N f i (z t i ) Looks difficult to implement but is more similar to BFGS than apparent As in BFGS, can be implemented wh O(p 2 ) operations Wre as rank-2 update, use matrix inversion lemma Independently of N. True incremental method. Aryan Mokhtari Incremental Quasi-Newton Methods wh Local Superlinear Convergence Rate 14
Superlinear convergence rate The functions f i are m-strongly convex. The gradients f i are M-Lipschz continuous. The Hessians 2 f i are L-Lipschz continuous Theorem The sequence of residuals w t w in the IQN method converges to zero at a superlinear rate, lim t w t w (1/N)( w t 1 w + + w t N w ) = 0. Incremental method wh small cost per eration converging at superlinear rate Resulting from the use of memory to reduce stochastic variances Aryan Mokhtari Incremental Quasi-Newton Methods wh Local Superlinear Convergence Rate 15
Numerical results Quadratic programming f (w) := (1/N) N wt A i w/2 + b T i w A i R p p is a diagonal posive define matrix b i R p is a random vector from the box [0, 10 3 ] p N = 1000, p = 100, and condion number (10 2, 10 4 ) Relative error w t w / w 0 w of SAG, SAGA, IAG, and IQN 10 0 10 0 Normalized Error 10-5 10-10 10-15 SAG SAGA IAG IQN Normalized Error 10-5 10-10 10-15 SAG SAGA IAG IQN 10-20 0 10 20 30 40 Number of Effective Passes (a) small condion number 10-20 0 10 20 30 40 Number of Effective Passes (b) large condion number Aryan Mokhtari Incremental Quasi-Newton Methods wh Local Superlinear Convergence Rate 16
Conclusions We considered large-scale ERM problems Fine sum minimization An incremental quasi-newton BFGS method (IQN) was presented IQN only computes the information of a single function at each step Low computation cost IQN aggregates variable, gradient, and BFGS approximation Reduce the noise Superlinear convergence Aryan Mokhtari Incremental Quasi-Newton Methods wh Local Superlinear Convergence Rate 17
References N. N. Schraudolph, J. Yu, and S. Gunter, A Stochastic Quasi-Newton Method for Online Convex Optimization, In AISTATS, vol. 7, pp. 436-443, 2007. A. Mokhtari and A. Ribeiro, RES: Regularized Stochastic BFGS Algorhm, IEEE Trans. on Signal Processing, vol. 62, no. 23, pp. 6089-6104, December 2014. A. Mokhtari and A. Ribeiro, Global Convergence of Online Limed Memory BFGS, Journal of Machine Learning Research, vol. 16, pp. 3151-3181, 2015. P. Morz, R. Nishihara, and M. I. Jordan, A Linearly-Convergent Stochastic L-BFGS Algorhm, Proceedings of the Nineteenth International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 249-258, 2016. M. Schmidt, N. Le Roux, and F. Bach. Minimizing Fine Sums wh the Stochastic Average Gradient. Mathematical Programming (2013): 1-30. M. Gurbuzbalaban, A. Ozdaglar, and P. Parrilo. On the Convergence Rate of Incremental Aggregated Gradient Algorhms. arxiv preprint arxiv:1506.02081. A. Mokhtari, M. Eisen, and A. Ribeiro, IQN: An Incremental Quasi-Newton Method wh Local Superlinear Convergence Rate. arxiv preprint arxiv:1702.00709. Aryan Mokhtari Incremental Quasi-Newton Methods wh Local Superlinear Convergence Rate 18