Incremental Quasi-Newton methods with local superlinear convergence rate

Similar documents
High Order Methods for Empirical Risk Minimization

High Order Methods for Empirical Risk Minimization

High Order Methods for Empirical Risk Minimization

Efficient Methods for Large-Scale Optimization

Stochastic Quasi-Newton Methods

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 62, NO. 23, DECEMBER 1, Aryan Mokhtari and Alejandro Ribeiro, Member, IEEE

Stochastic Optimization Algorithms Beyond SG

Accelerating SVRG via second-order information

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725

Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb Shanno February 6, / 25 (BFG. Limited memory BFGS (L-BFGS)

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

Network Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China)

LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION

5 Quasi-Newton Methods

Comparison of Modern Stochastic Optimization Algorithms

Improving the Convergence of Back-Propogation Learning with Second Order Methods

Online Limited-Memory BFGS for Click-Through Rate Prediction

Variable Metric Stochastic Approximation Theory

Trade-Offs in Distributed Learning and Optimization

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

Linearly-Convergent Stochastic-Gradient Methods

ABSTRACT 1. INTRODUCTION

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Convex Optimization CMU-10725

Programming, numerics and optimization

Decentralized Quasi-Newton Methods

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization

On the Convergence Rate of Incremental Aggregated Gradient Algorithms

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Stochastic Analogues to Deterministic Optimizers

Why should you care about the solution strategies?

Regression with Numerical Optimization. Logistic

Sub-Sampled Newton Methods

Beyond stochastic gradient descent for large-scale machine learning

Chapter 4. Unconstrained optimization

Fast Stochastic Optimization Algorithms for ML

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Selected Topics in Optimization. Some slides borrowed from

LBFGS. John Langford, Large Scale Machine Learning Class, February 5. (post presentation version)

Accelerated Gradient Temporal Difference Learning

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Optimization 2. CS5240 Theoretical Foundations in Multimedia. Leow Wee Kheng

Lecture 14: October 17

Nonlinear Optimization Methods for Machine Learning

Adaptive Newton Method for Empirical Risk Minimization to Statistical Accuracy

MATH 4211/6211 Optimization Quasi-Newton Method

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Lecture 1: Supervised Learning

Stochastic gradient descent and robustness to ill-conditioning

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Higher-Order Methods

Lecture 7 Unconstrained nonlinear programming

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

SVRG++ with Non-uniform Sampling

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Math 273a: Optimization Netwon s methods

OPER 627: Nonlinear Optimization Lecture 14: Mid-term Review

2. Quasi-Newton methods

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations

Second-Order Methods for Stochastic Optimization

Stochastic optimization: Beyond stochastic gradients and convexity Part I

MATHEMATICS FOR COMPUTER VISION WEEK 8 OPTIMISATION PART 2. Dr Fabio Cuzzolin MSc in Computer Vision Oxford Brookes University Year

8 Numerical methods for unconstrained problems

STA141C: Big Data & High Performance Statistical Computing

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

Stochastic and online algorithms

Nonlinear Optimization: What s important?

Large-scale Stochastic Optimization

Beyond stochastic gradient descent for large-scale machine learning

ECS289: Scalable Machine Learning

Lecture 18: November Review on Primal-dual interior-poit methods

Non-linear least squares

ORIE 6326: Convex Optimization. Quasi-Newton Methods

ONLINE VARIANCE-REDUCING OPTIMIZATION

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09

Convergence Rate of Expectation-Maximization

A Stochastic Quasi-Newton Method for Online Convex Optimization

Algorithms for Constrained Optimization

Math 411 Preliminaries

Optimization II: Unconstrained Multivariable

Beyond stochastic gradient descent for large-scale machine learning

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Optimization II: Unconstrained Multivariable

Data Mining (Mineria de Dades)

Unconstrained optimization

Optimization for neural networks

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

Day 3 Lecture 3. Optimizing deep networks

Stochastic Gradient Descent with Variance Reduction

Decentralized Quadratically Approximated Alternating Direction Method of Multipliers

Optimization Methods for Machine Learning

Incremental Reshaped Wirtinger Flow and Its Connection to Kaczmarz Method

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Approximate Newton Methods and Their Local Convergence

LECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION

Transcription:

Incremental Quasi-Newton methods wh local superlinear convergence rate Aryan Mokhtari, Mark Eisen, and Alejandro Ribeiro Department of Electrical and Systems Engineering Universy of Pennsylvania Int. Conference on Acoustics, Speech, and Signal Processing (ICASSP) March 8th, 2017 Aryan Mokhtari Incremental Quasi-Newton Methods wh Local Superlinear Convergence Rate 1

Large-scale empirical risk minimization We aim to solve expected risk minimization problem min w R p E θ [f (w, θ)] The distribution is unknown We have access to N independent realizations of θ We settle for solving the Empirical Risk Minimization (ERM) problem 1 min F (w) := min w Rp w R p N N f (w, θ i ) Large-scale optimization or machine learning: large N, large p N: number of observations (inputs) p: number of parameters in the model Many (most) machine learning algorhms reduce to ERM problems Aryan Mokhtari Incremental Quasi-Newton Methods wh Local Superlinear Convergence Rate 2

Incremental Gradient Descent Objective function gradients s(w) := F (w) = 1 N N f (w, θ i ) (Deterministic) gradient descent eration w t+1 = w t ɛ t s(w t) Evaluation of (deterministic) gradients is not computationally affordable Incremental/Stochastic gradient Sample average in lieu of expectations ŝ(w, θ) = 1 L f (w, θ l ) θ = [θ 1;...; θ L ] L l=1 Functions are chosen cyclically or at random wh or whout replacement Incremental gradient descent eration w t+1 = w t ɛ t ŝ(w t, θ t) (Incremental) gradient descent is (very) slow. Newton is impractical Aryan Mokhtari Incremental Quasi-Newton Methods wh Local Superlinear Convergence Rate 3

BFGS quasi-newton method Approximate function s curvature wh Hessian approximation matrix B 1 t w t+1 = w t ɛ t B 1 t s(w t) Make B t close to H(w t) := 2 F (w t). Broyden, DFP, BFGS Variable variation: v t = w t+1 w t. Gradient variation: r t = s(w t+1) s(w t) Matrix B t+1 satisfies secant condion B t+1v t = r t. Underdetermined Resolve indeterminacy making B t+1 closest to previous approximation B t Using Gaussian relative entropy as proximy condion yields update B t+1 = B t + rtrt t v T Btvtvt T B t t r t v T t B tv t Superlinear convergence Close enough to quadratic rate of Newton BFGS requires gradients Use stochastic/incremental gradients Aryan Mokhtari Incremental Quasi-Newton Methods wh Local Superlinear Convergence Rate 4

Stochastic/Incremental quasi-newton methods Online (o)bfgs & Online Limed-Memory (ol)bfgs [Schraudolph et al 07] obfgs may diverge because Hessian approximation gets close to singular Regularized Stochastic BFGS (RES) [Mokhtari-Ribeiro 14] olbfgs does (surprisingly) converge [Mokhtari-Ribeiro 15] Problem solved? Alas. RES and olbfgs have sublinear convergence Same as stochastic gradient descent. Asymptotically not better Variance reduced stochastic L-BFGS (SVRG+oLBFGS) [Mortiz et al 16] Linear convergence rate. But this is not better than SAG, SAGA, SVRG Computationally feasible quasi Newton method wh superlinear convergence Aryan Mokhtari Incremental Quasi-Newton Methods wh Local Superlinear Convergence Rate 5

Incremental aggregated gradient method Utilize memory to reduce variance of stochastic gradient approximation f t 1 f t f t N f i t (wt+1 ) f t+1 1 f t+1 f t+1 N Descend along incremental gradient w t+1 = w t α N Select update index i t cyclically. Uniformly at random is similar N f t i Update gradient corresponding to function f f t+1 i t = f (w t+1 ) = w t α N gt Sum easy to compute g t+1 = g t fi t t + f t+1 i t. Converges linearly Stochastic (SAG) [Schmidt et al 13], Cyclic (IAG) [Gurbuzbalaban et al 15] Aryan Mokhtari Incremental Quasi-Newton Methods wh Local Superlinear Convergence Rate 6

Incremental BFGS method Keep memory of variables z t i, Hessian approximations B t i, and gradients fi t Functions indexed by i. Time indexed by t. Select function f at time t z t 1 z t z t N B t 1 B t B t N f t 1 f t f t N w t+1 All gradients, matrices, and variables used to update w t+1 Aryan Mokhtari Incremental Quasi-Newton Methods wh Local Superlinear Convergence Rate 7

Incremental BFGS method Keep memory of variables z t i, Hessian approximations B t i, and gradients fi t Functions indexed by i. Time indexed by t. Select function f at time t z t 1 z t z t N B t 1 B t B t N f t 1 f t f t N w t+1 f i t (wt+1 ) Updated variable w t+1 used to update gradient f t+1 i t = f (w t+1 ) Aryan Mokhtari Incremental Quasi-Newton Methods wh Local Superlinear Convergence Rate 8

Incremental BFGS method Keep memory of variables z t i, Hessian approximations B t i, and gradients fi t Functions indexed by i. Time indexed by t. Select function f at time t z t 1 z t z t N B t 1 B t B t N f t 1 f t f t N w t+1 B t+1 f i t (wt+1 ) Update Bi t t to satisfy secant condion for function f for variable variation z t i t w t+1 and gradient variation f t+1 i t fi t t (more later) Aryan Mokhtari Incremental Quasi-Newton Methods wh Local Superlinear Convergence Rate 9

Incremental BFGS method Keep memory of variables z t i, Hessian approximations B t i, and gradients fi t Functions indexed by i. Time indexed by t. Select function f at time t z t 1 z t z t N B t 1 B t B t N f t 1 f t f t N w t+1 B t+1 f i t (wt+1 ) z t+1 1 z t+1 z t+1 N B t+1 1 B t+1 B t+1 N f t+1 1 f t+1 f t+1 N Update variable, Hessian approximation, and gradient memory for function f Aryan Mokhtari Incremental Quasi-Newton Methods wh Local Superlinear Convergence Rate 10

Update of Hessian approximation matrices Variable variation at time t for function f i = f vi t := z t+1 i Gradient variation at time t for function f i = f ri t := f t+1 i t z t i Update B t i = B t i t to satisfy secant condion for variations v t i and r t i B t+1 i = B t i + rt i r t i T r t i T v t i Bt i v t i v t i T B t i v t i T B t i vt i f t i t We want B t i to approximate the Hessian of the function f i = f Aryan Mokhtari Incremental Quasi-Newton Methods wh Local Superlinear Convergence Rate 11

A naive (in hindsight) incremental BFGS method The key is in the update of w t. Use memory in stochastic quanties ( w t+1 = w t 1 N N B t i ) 1 ( 1 N N f t i It doesn t work Better than incremental gradient but not superlinear ) Optimization updates are solutions of function approximations In this particular update we are minimizing the quadratic form f (w) 1 n [ f i (z t i ) + f i (z t i ) T (w z t i ) + 1 ] n 2 (w wt ) T B t i (w w t ) Gradients evaluated at z t i. Secant condion verified at z t i The quadratic form is centered at w t. Not a reasonable Taylor series Aryan Mokhtari Incremental Quasi-Newton Methods wh Local Superlinear Convergence Rate 12

A proper Taylor series expansion Each individual function f i is being approximated by the quadratic f i (w) f i (z t i ) + f i (z t i ) T (w z t i ) + 1 2 (w wt ) T B t i (w w t ) To have a proper expansion we have to recenter the quadratic form at z t i f i (w) f i (z t i ) + f i (z t i ) T (w z t i ) + 1 2 (w zt i ) T B t i (w z t i ) I.e., we approximate f (w) wh the aggregate quadratic function f (w) 1 N N [ f i (z t i ) + f i (z t i ) T (w z t i ) + 1 ] 2 (w zt i ) T B t i (w z t i ) This is now a reasonable Taylor series that we use to derive an update Aryan Mokhtari Incremental Quasi-Newton Methods wh Local Superlinear Convergence Rate 13

Incremental BFGS Solving this quadratic program yields the update for the IQN method ( w t+1 1 N = N B t i ) 1 [ 1 N B t i z t i 1 N N ] N f i (z t i ) Looks difficult to implement but is more similar to BFGS than apparent As in BFGS, can be implemented wh O(p 2 ) operations Wre as rank-2 update, use matrix inversion lemma Independently of N. True incremental method. Aryan Mokhtari Incremental Quasi-Newton Methods wh Local Superlinear Convergence Rate 14

Superlinear convergence rate The functions f i are m-strongly convex. The gradients f i are M-Lipschz continuous. The Hessians 2 f i are L-Lipschz continuous Theorem The sequence of residuals w t w in the IQN method converges to zero at a superlinear rate, lim t w t w (1/N)( w t 1 w + + w t N w ) = 0. Incremental method wh small cost per eration converging at superlinear rate Resulting from the use of memory to reduce stochastic variances Aryan Mokhtari Incremental Quasi-Newton Methods wh Local Superlinear Convergence Rate 15

Numerical results Quadratic programming f (w) := (1/N) N wt A i w/2 + b T i w A i R p p is a diagonal posive define matrix b i R p is a random vector from the box [0, 10 3 ] p N = 1000, p = 100, and condion number (10 2, 10 4 ) Relative error w t w / w 0 w of SAG, SAGA, IAG, and IQN 10 0 10 0 Normalized Error 10-5 10-10 10-15 SAG SAGA IAG IQN Normalized Error 10-5 10-10 10-15 SAG SAGA IAG IQN 10-20 0 10 20 30 40 Number of Effective Passes (a) small condion number 10-20 0 10 20 30 40 Number of Effective Passes (b) large condion number Aryan Mokhtari Incremental Quasi-Newton Methods wh Local Superlinear Convergence Rate 16

Conclusions We considered large-scale ERM problems Fine sum minimization An incremental quasi-newton BFGS method (IQN) was presented IQN only computes the information of a single function at each step Low computation cost IQN aggregates variable, gradient, and BFGS approximation Reduce the noise Superlinear convergence Aryan Mokhtari Incremental Quasi-Newton Methods wh Local Superlinear Convergence Rate 17

References N. N. Schraudolph, J. Yu, and S. Gunter, A Stochastic Quasi-Newton Method for Online Convex Optimization, In AISTATS, vol. 7, pp. 436-443, 2007. A. Mokhtari and A. Ribeiro, RES: Regularized Stochastic BFGS Algorhm, IEEE Trans. on Signal Processing, vol. 62, no. 23, pp. 6089-6104, December 2014. A. Mokhtari and A. Ribeiro, Global Convergence of Online Limed Memory BFGS, Journal of Machine Learning Research, vol. 16, pp. 3151-3181, 2015. P. Morz, R. Nishihara, and M. I. Jordan, A Linearly-Convergent Stochastic L-BFGS Algorhm, Proceedings of the Nineteenth International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 249-258, 2016. M. Schmidt, N. Le Roux, and F. Bach. Minimizing Fine Sums wh the Stochastic Average Gradient. Mathematical Programming (2013): 1-30. M. Gurbuzbalaban, A. Ozdaglar, and P. Parrilo. On the Convergence Rate of Incremental Aggregated Gradient Algorhms. arxiv preprint arxiv:1506.02081. A. Mokhtari, M. Eisen, and A. Ribeiro, IQN: An Incremental Quasi-Newton Method wh Local Superlinear Convergence Rate. arxiv preprint arxiv:1702.00709. Aryan Mokhtari Incremental Quasi-Newton Methods wh Local Superlinear Convergence Rate 18