High Order Methods for Empirical Risk Minimization

Similar documents
High Order Methods for Empirical Risk Minimization

High Order Methods for Empirical Risk Minimization

Incremental Quasi-Newton methods with local superlinear convergence rate

Efficient Methods for Large-Scale Optimization

Stochastic Quasi-Newton Methods

Stochastic Optimization Algorithms Beyond SG

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 62, NO. 23, DECEMBER 1, Aryan Mokhtari and Alejandro Ribeiro, Member, IEEE

Accelerating SVRG via second-order information

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725

Network Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China)

Adaptive Newton Method for Empirical Risk Minimization to Statistical Accuracy

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

STA141C: Big Data & High Performance Statistical Computing

Lecture 14: October 17

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization

Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb Shanno February 6, / 25 (BFG. Limited memory BFGS (L-BFGS)

LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION

Stochastic Analogues to Deterministic Optimizers

ECS289: Scalable Machine Learning

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Comparison of Modern Stochastic Optimization Algorithms

Improving the Convergence of Back-Propogation Learning with Second Order Methods

ABSTRACT 1. INTRODUCTION

Selected Topics in Optimization. Some slides borrowed from

Nonlinear Optimization Methods for Machine Learning

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

DECENTRALIZED algorithms are used to solve optimization

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Second-Order Methods for Stochastic Optimization

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Decentralized Quasi-Newton Methods

Why should you care about the solution strategies?

5 Quasi-Newton Methods

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Large-scale Stochastic Optimization

Newton s Method. Javier Peña Convex Optimization /36-725

LEARNING STATISTICALLY ACCURATE RESOURCE ALLOCATIONS IN NON-STATIONARY WIRELESS SYSTEMS

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Linearly-Convergent Stochastic-Gradient Methods

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Variable Metric Stochastic Approximation Theory

Learning in Non-Stationary Wireless Control Systems via Newton s Method

Nonlinear Programming

SVRG++ with Non-uniform Sampling

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Learning in Non-Stationary Wireless Control Systems via Newton s Method

Lecture 18: November Review on Primal-dual interior-poit methods

MATH 4211/6211 Optimization Quasi-Newton Method

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Coordinate Descent and Ascent Methods

Efficient Distributed Hessian Free Algorithm for Large-scale Empirical Risk Minimization via Accumulating Sample Strategy

Sub-Sampled Newton Methods

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Conditional Gradient (Frank-Wolfe) Method

Non-convex optimization. Issam Laradji

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Optimization for neural networks

Trade-Offs in Distributed Learning and Optimization

Stochastic and online algorithms

Optimization 2. CS5240 Theoretical Foundations in Multimedia. Leow Wee Kheng

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Programming, numerics and optimization

Regression with Numerical Optimization. Logistic

Overfitting, Bias / Variance Analysis

8 Numerical methods for unconstrained problems

Dan Roth 461C, 3401 Walnut

Optimization Tutorial 1. Basic Gradient Descent

A Distributed Newton Method for Network Utility Maximization, II: Convergence

Big Data Analytics: Optimization and Randomization

Lecture 1: Supervised Learning

ORIE 6326: Convex Optimization. Quasi-Newton Methods

J. Sadeghi E. Patelli M. de Angelis

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Online Limited-Memory BFGS for Click-Through Rate Prediction

Stochastic Gradient Descent with Variance Reduction

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

WE consider the problem of estimating a time varying

Decentralized Quadratically Approximated Alternating Direction Method of Multipliers

Second-Order Stochastic Optimization for Machine Learning in Linear Time

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

A Parallel SGD method with Strong Convergence

Accelerated Gradient Temporal Difference Learning

Higher-Order Methods

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Nonlinear Optimization: What s important?

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Beyond stochastic gradient descent for large-scale machine learning

Linear Regression (continued)

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Reinforcement Learning

ECS171: Machine Learning

Homework 5. Convex Optimization /36-725

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems

A Distributed Newton Method for Network Utility Maximization, I: Algorithm

Transcription:

High Order Methods for Empirical Risk Minimization Alejandro Ribeiro Department of Electrical and Systems Engineering University of Pennsylvania aribeiro@seas.upenn.edu IPAM Workshop of Emerging Wireless Systems February 7th, 2017 Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 1

Introduction Introduction Incremental quasi-newton algorithms Adaptive sample size algorithms Conclusions Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 2

Optimal resource allocation in wireless systems Wireless channels characterized by random fading coefficients h Want to assign power p(h) as a function of fading to: Satisfy prescribed constraints, e.g., average power, target SINRs Optimize given criteria, e.g., maximize capacity, minimize power Two challenges Resultant optimization problems are infinite dimensional (h is) In most cases problems are not convex However, duality gap is null under mild conditions (Ribeiro-Giannakis 10) And in the dual domain the problem is finite dimensional and convex Motivate use of stochastic optimization algorithms that Have manageable computational complexity per iteration Use sample channel realizations instead of channel distributions Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 3

Large-scale empirical risk minimization We aim to solve expected risk minimization problem min w R p E θ [f (w, θ)] The distribution is unknown We have access to N independent realizations of θ We settle for solving the Empirical Risk Minimization (ERM) problem 1 min F (w) := min w Rp w R p N N f (w, θ i ) i=1 Large-scale optimization or machine learning: large N, large p N: number of observations (inputs) p: number of parameters in the model Not just wireless Many (most) machine learning algorithms reduce to ERM problems Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 4

Optimization methods Stochastic methods: a subset of samples is used at each iteration SGD is the most popular; however, it is slow because of Noise of stochasticity Variance reduction (SAG, SAGA, SVRG,...) Poor curvature approx. Stochastic QN (SGD-QN, RES, olbfgs,...) Decentralized methods: samples are distributed over multiple processors Primal methods: DGD, Acc. DGD, NN,... Dual methods: DDA, DADMM, DQM, EXTRA, ESOM,... Adaptive sample size methods: start with a subset of samples and increase the size of training set at each iteration Ada Newton The solutions are close when the number of samples are close Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 5

Incremental quasi-newton algorithms Introduction Incremental quasi-newton algorithms Adaptive sample size algorithms Conclusions Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 6

Incremental Gradient Descent Objective function gradients s(w) := F (w) = 1 N N f (w, θ i ) (Deterministic) gradient descent iteration w t+1 = w t ɛ t s(w t) Evaluation of (deterministic) gradients is not computationally affordable i=1 Incremental/Stochastic gradient Sample average in lieu of expectations ŝ(w, θ) = 1 L f (w, θ l ) θ = [θ 1;...; θ L ] L l=1 Functions are chosen cyclically or at random with or without replacement Incremental gradient descent iteration w t+1 = w t ɛ t ŝ(w t, θ t) (Incremental) gradient descent is (very) slow. Newton is impractical Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 7

BFGS quasi-newton method Approximate function s curvature with Hessian approximation matrix B 1 t w t+1 = w t ɛ t B 1 t s(w t) Make B t close to H(w t) := 2 F (w t). Broyden, DFP, BFGS Variable variation: v t = w t+1 w t. Gradient variation: r t = s(w t+1) s(w t) Matrix B t+1 satisfies secant condition B t+1v t = r t. Underdetermined Resolve indeterminacy making B t+1 closest to previous approximation B t Using Gaussian relative entropy as proximity condition yields update B t+1 = B t + rtrt t v T Btvtvt T B t t r t v T t B tv t Superlinear convergence Close enough to quadratic rate of Newton BFGS requires gradients Use stochastic/incremental gradients Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 8

Stochastic/Incremental quasi-newton methods Online (o)bfgs & Online Limited-Memory (ol)bfgs [Schraudolph et al 07] obfgs may diverge because Hessian approximation gets close to singular Regularized Stochastic BFGS (RES) [Mokhtari-Ribeiro 14] olbfgs does (surprisingly) converge [Mokhtari-Ribeiro 15] Problem solved? Alas. RES and olbfgs have sublinear convergence Same as stochastic gradient descent. Asymptotically not better Variance reduced stochastic L-BFGS (SVRG+oLBFGS) [Mortiz et al 16] Linear convergence rate. But this is not better than SAG, SAGA, SVRG Computationally feasible quasi Newton method with superlinear convergence Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 9

Incremental aggregated gradient method Utilize memory to reduce variance of stochastic gradient approximation f t 1 f t it f t N f i t (wt+1 ) f t+1 1 f t+1 it f t+1 N Descend along incremental gradient w t+1 = w t α N N i=1 f t i Select update index i t cyclically. Uniformly at random is similar = w t αg t i Update gradient corresponding to function f it f t+1 i t = f it (w t+1 ) Sum easy to compute g t+1 i = gi t f t+1 i t + f t+1 i t. Converges linearly Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 10

Incremental BFGS method Keep memory of variables z t i, Hessian approximations B t i, and gradients fi t Functions indexed by i. Time indexed by t. Select function f it at time t z t 1 z t it z t N B t 1 B t it B t N f t 1 f t it f t N w t+1 All gradients, matrices, and variables used to update w t+1 Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 11

Incremental BFGS method Keep memory of variables z t i, Hessian approximations B t i, and gradients fi t Functions indexed by i. Time indexed by t. Select function f it at time t z t 1 z t it z t N B t 1 B t it B t N f t 1 f t it f t N w t+1 f i t (wt+1 ) Updated variable w t+1 used to update gradient f t+1 i t = f it (w t+1 ) Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 12

Incremental BFGS method Keep memory of variables z t i, Hessian approximations B t i, and gradients fi t Functions indexed by i. Time indexed by t. Select function f it at time t z t 1 z t it z t N B t 1 B t it B t N f t 1 f t it f t N w t+1 B t+1 it f i t (wt+1 ) Update Bi t t to satisfy secant condition for function f it for variable variation z t i t w t+1 and gradient variation f t+1 i t fi t t (more later) Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 13

Incremental BFGS method Keep memory of variables z t i, Hessian approximations B t i, and gradients fi t Functions indexed by i. Time indexed by t. Select function f it at time t z t 1 z t it z t N B t 1 B t it B t N f t 1 f t it f t N w t+1 B t+1 it f i t (wt+1 ) z t+1 1 z t+1 it z t+1 N B t+1 1 B t+1 it B t+1 N f t+1 1 f t+1 it f t+1 N Update variable, Hessian approximation, and gradient memory for function f it Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 14

Update of Hessian approximation matrices Variable variation at time t for function f i = f it vi t := z t+1 i Gradient variation at time t for function f i = f it ri t := f t+1 i t z t i Update B t i = B t i t to satisfy secant condition for variations v t i and r t i B t+1 i = B t i + rt i r t i T r t i T v t i Bt i v t i v t i T B t i v t i T B t i vt i f t i t We want B t i to approximate the Hessian of the function f i = f it Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 15

A naive (in hindsight) incremental BFGS method The key is in the update of w t. Use memory in stochastic quantities ( w t+1 = w t 1 N N i=1 B t i ) 1 ( 1 N N i=1 f t i It doesn t work Better than incremental gradient but not superlinear ) Optimization updates are solutions of function approximations In this particular update we are minimizing the quadratic form f (w) 1 n [ f i (z t i ) + f i (z t i ) T (w w t) + 1 ] n 2 (w wt ) T B t i (w w t ) i=1 Gradients evaluated at z t i. Secant condition verified at z t i The quadratic form is centered at w t. Not a reasonable Taylor series Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 16

A proper Taylor series expansion Each individual function f i is being approximated by the quadratic f i (w) f i (z t i ) + f i (z t i ) T (w w t) + 1 2 (w wt ) T B t i (w w t ) To have a proper expansion we have to recenter the quadratic form at z t i f i (w) f i (z t i ) + f i (z t i ) T (w z t i ) + 1 2 (w zt i ) T B t i (w z t i ) I.e., we approximate f (w) with the aggregate quadratic function f (w) 1 N N i=1 [ f i (z t i ) + f i (z t i ) T (w z t i ) + 1 ] 2 (w zt i ) T B t i (w z t i ) This is now a reasonable Taylor series that we use to derive an update Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 17

Incremental BFGS Solving this quadratic program yields the update for the IQN method ( w t+1 1 N = N i=1 B t i ) 1 [ 1 N B t i z t i 1 N N i=1 ] N f i (z t i ) i=1 Looks difficult to implement but it is more similar to BFGS than apparent As in BFGS, it can be implemented with O(p 2 ) operations Write as rank-2 update, use matrix inversion lemma Independently of N. True incremental method. Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 18

Superlinear convergence rate The functions f i are m-strongly convex. The gradients f i are M-Lipschitz continuous. The Hessians f i are L-Lipschitz continuous Theorem The sequence of residuals w t w in the IQN method converges to zero at a superlinear rate, lim t w t w (1/N)( w t 1 w + + w t N w ) = 0. Incremental method with small cost per iteration converging at superlinear rate Resulting from the use of memory to reduce stochastic variances Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 19

Numerical results Quadratic programming f (w) := (1/N) N i=1 wt A i w/2 + b T i w A i R p p is a diagonal positive definite matrix b i R p is a random vector from the box [0, 10 3 ] p N = 1000, p = 100, and condition number (10 2, 10 4 ) Relative error w t w / w 0 w of SAG, SAGA, IAG, and IQN 10 0 10 0 Normalized Error 10-5 10-10 10-15 SAG SAGA IAG IQN Normalized Error 10-5 10-10 10-15 SAG SAGA IAG IQN 10-20 0 10 20 30 40 Number of Effective Passes (a) small condition number 10-20 0 10 20 30 40 Number of Effective Passes (b) large condition number Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 20

Adaptive sample size algorithms Introduction Incremental quasi-newton algorithms Adaptive sample size algorithms Conclusions Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 21

Back to the ERM problem Our original goal was to solve the statistical loss problem w := argmin w R p L(w) = argmin E [f (w, Z)] w R p But since the distribution of Z is unknown we settle for the ERM problem w N 1 := argmin L N (w) = argmin w R p w R p N N f (w, z k ) Where the samples z k are drawn from a common distribution ERM approximates actual problem Don t need perfect solution k=1 Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 22

Regularized ERM problem From statistical learning we know that there exists a constant V N such that sup L(w) L N (w) V N, w w.h.p. V N = O(1/ N) from CLT. V N = O(1/N) sometimes [Bartlett et al 06] There is no need to minimize L N (w) beyond accuracy O(V N ) This is well known. In fact, this is why we can add regularizers to ERM w N := argmin w R N (w) = argmin L N (w) + cv N w 2 w 2 Adding the term (cv N /2) w 2 moves the optimum of the ERM problem But the optimum w N is still in a ball of order V N around w Goal: Minimize the risk R N within its statistical accuracy V N Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 23

Adaptive sample size methods ERM problem R n for subset of n N unif. chosen samples w n := argmin w R n(w) = argmin L n(w) + cvn w 2 w 2 Solutions wm for m samples and wn for n samples are close Find approx. solution w m for the risk R m with m samples Increase sample size to n > m samples Use w m as a warm start to find approx. solution w n for R n If m < n, it is easier to solve R m comparing to R n since The condition number of R m is smaller than R n The required accuracy V m is larger than V n The computation cost of solving R m is lower than R n Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 24

Adaptive sample size Newton method Ada Newton is a specific adaptive sample size method using Newton steps Find w m that solves R m to its statistical accuracy V m Apply single Newton iteration w n = w m 2 R n(w m) 1 R n(w m) If m and n close, we have w n within statistical accuracy of R n w n w m This works if statistical accuracy ball of R m is within Newton quadratic convergence ball of R n. Then, w m is within Newton quadratic convergence ball of R n A single Newton iteration yields w n within statistical accuracy of R n Question: How should we choose α? Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 25

Assumptions The functions f (w, z) are convex The gradients f (w, z) are M-Lipschitz continuous f (w, z) f (w, z) M w w, for all z. The functions f (w, z) are self-concordant with respect to w for all z Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 26

Theoretical result Theorem Consider w m as a V m-optimal solution of R m, i.e., R m(w m) R m(wm) V m, and let n = αm. If the inequalities [ ] 1 2(M +cvm)v 2 m 2(n m) + + ((2+ 2)c 1 2 + c w )(V m V n) cv n nc 1 2 (cv n) 2 1 1 4, [ 2(n m) 144 V m + (V n m + V m) + 4 + c w 2 2 (V m V n)] V n n 2 are satisfied, then w n has sub-optimality error V n w.h.p., i.e., R n(w n) R n(w n ) V n, w.h.p. Condition1 w m is in the Newton quadratic convergence ball of R n Condition 2 w n is in the statistical accuracy of R n Condition 2 becomes redundant for large m Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 27

Doubling the size of training set Proposition Consider a learning problem in which the statistical accuracy satisfies V m αv n for n = αm and lim n V n = 0. If c is chosen so that ( ) 1/2 2αM + c 2(α 1) αc 1/2 1 4, then, there exists a sample size m such that the conditions in Theorem 1 are satisfied for all m > m and n = αm. We can double the size of training set α = 2 If the size of training set is large enough If the constant c satisfies c > 16(2 M + 1) 2 We achieve the S.A. of the full training set in about 2 passes over the data After inversion of about 3.32 log 10 N Hessians Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 28

Adaptive sample size Newton (Ada Newton) Parameters: α 0 = 2 and 0 < β < 1. Initialize: n = m 0 and w n = w m0 with R n(w n) R n(w n ) V n while n N do Update w m = w n and m = n. Reset factor α = α 0 repeat [sample size backtracking loop] 1: Increase sample size: n = min{αm, N}. 2: Comp. gradient: R n(w m) = 1 n n k=1 f (wm, z k) + cv nw m 3: Comp. Hessian: 2 R n(w m) = 1 n n k=1 2 f (w m, z k ) + cv ni 4: Update the variable: w n = w m 2 R n(w m) 1 R n(w m) 5: Backtrack sample size increase α = βα. until R n(w n) R n(w n ) V n end while Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 29

Numerical results LR problem Protein homology dataset provided (KDD cup 2004) Number of samples N = 145, 751, dimension p = 74 Parameters V n = 1/n, c = 20, m 0 = 124, and α = 2 10 0 10-2 SGD SAGA Newton Ada Newton RN(w) R N 10-4 10-6 10-8 10-10 0 5 10 15 20 25 Number of passes Ada Newton achieves the statistical accuracy of the full training set with about two passes over the dataset Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 30

Numerical results We use A9A and SUSY datasets to train a LR problem A9A: N = 32, 561 samples with dimension p = 123 SUSY: N = 5, 000, 000 samples with dimension p = 18 The green line shows the iteration at which Ada Newton reached convergence on the test set 10 0 10-2 Ada Newton Newton SAGA 10 0 10-2 Ada Newton Newton SAGA 10-4 10-4 RN(w) R N 10-6 10-8 RN(w) R N 10-6 10-8 10-10 10-10 10-12 10-12 10-14 0 1 2 3 4 5 6 7 Number of passes 10-14 0 1 2 3 4 5 6 7 Number of passes Figure: Suboptimality vs No. of effective passes. A9A (left) and SUSY (right) Ada Newton achieves the accuracy of R N (w) R N < 1/N by less than 2.3 passes over the full training set Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 31

Numerical results We use A9A and SUSY datasets to train a LR problem A9A: N = 32, 561 samples with dimension p = 123 SUSY: N = 5, 000, 000 samples with dimension p = 18 The green line shows the iteration at which Ada Newton reached convergence on the test set 10 0 10-2 Ada Newton Newton SAGA 10 0 10-2 Ada Newton Newton SAGA 10-4 10-4 RN(w) R N 10-6 10-8 RN(w) R N 10-6 10-8 10-10 10-10 10-12 10-12 10-14 0 2 4 6 8 10 12 14 16 18 Run time 10-14 0 50 100 150 200 250 Run time Figure: Suboptimality vs runtime. A9A (left) and SUSY (right) Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 32

Numerical results We use A9A and SUSY datasets to train a LR problem A9A: N = 32, 561 samples with dimension p = 123 SUSY: N = 5, 000, 000 samples with dimension p = 18 0.7 0.7 Ada Newton Ada Newton 0.65 Newton SAGA 0.65 Newton SAGA 0.6 0.55 0.6 Error 0.5 Error 0.45 0.55 0.4 0.5 0.35 0.3 0 0.5 1 1.5 2 2.5 3 3.5 4 Number of passes 0.45 0 0.5 1 1.5 2 2.5 3 3.5 Number of passes Figure: Test error vs No. of effective passes. A9A (left) and SUSY (right) Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 33

Ada Newton and the challenges for Newton in ERM There are four reasons why it is impractical to use Newton s method in ERM It is costly to compute Hessians (and gradients). Order O(Np 2 ) operations It is costly to invert Hessians. Order O(p 3 ) operations. A line search is needed to moderate stepsize outside of quadratic region Quadratic convergence is advantageous close to the optimum but we don t want to optimize beyond statistical accuracy Ada Newton (mostly) overcomes these four challenges Compute Hessians for a subset of samples. Two passes over dataset Hessians are inverted in a logarithmic number of steps. But still There is no line search We enter quadratic regions without going beyond statistical accuracy Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 34

Conclusions Introduction Incremental quasi-newton algorithms Adaptive sample size algorithms Conclusions Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 35

Conclusions We studied different approaches to solve large-scale ERM problems An incremental quasi-newton BFGS method (IQN) was presented IQN only computes the information of a single function at each step Low computation cost IQN aggregates variable, gradient, and BFGS approximation Reduce the noise Superlinear convergence Ada Newton resolves the Newton-type methods drawbacks Unit stepsize No line search Not sensitive to initial point less Hessian inversions Exploits quadratic convergence of Newton s method at each iteration Ada Newton achieves statistical accuracy with about two passes over the data Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 36

References N. N. Schraudolph, J. Yu, and S. Gunter, A Stochastic Quasi-Newton Method for Online Convex Optimization, In AISTATS, vol. 7, pp. 436-443, 2007. A. Mokhtari and A. Ribeiro, RES: Regularized Stochastic BFGS Algorithm, IEEE Trans. on Signal Processing (TSP), vol. 62, no. 23, pp. 6089-6104, December 2014. A. Mokhtari and A. Ribeiro, Global Convergence of Online Limited Memory BFGS, Journal of Machine Learning Research (JMLR), vol. 16, pp. 3151-3181, 2015. P. Moritz, R. Nishihara, and M. I. Jordan, A Linearly-Convergent Stochastic L-BFGS Algorithm, Proceedings of the Nineteenth International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 249-258, 2016 A. Mokhtari, M. Eisen, and A. Ribeiro, An Incremental Quasi-Newton Method with a Local Superlinear Convergence Rate, in Proc. Int. Conf. Acoustics Speech Signal Process. (ICASSP), New Orleans, LA, March 5-9 2017. A. Mokhtari, H. Daneshmand, A. Lucchi, T. Hofmann, and A. Ribeiro, Adaptive Newton Method for Empirical Risk Minimization to Statistical Accuracy, In Advances in Neural Information Processing Systems (NIPS), Barcelona, Spain, 2016. Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 37