Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal

Similar documents
Stochastic Optimization Methods for Machine Learning. Jorge Nocedal

Nonlinear Optimization Methods for Machine Learning

Exact and Inexact Subsampled Newton Methods for Optimization

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

Stochastic Optimization Algorithms Beyond SG

Stochastic Gradient Methods for Large-Scale Machine Learning

Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Sub-Sampled Newton Methods

Second-Order Methods for Stochastic Optimization

Approximate Newton Methods and Their Local Convergence

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations

Why should you care about the solution strategies?

Accelerating SVRG via second-order information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

An Investigation of Newton-Sketch and Subsampled Newton Methods

Stochastic Gradient Descent. CS 584: Big Data Analytics

Stochastic Quasi-Newton Methods

Day 3 Lecture 3. Optimizing deep networks

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization

Zero-Order Methods for the Optimization of Noisy Functions. Jorge Nocedal

Neural Network Training

Large-scale Stochastic Optimization

Second order machine learning

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725

Stochastic Gradient Descent

Lecture 1: Supervised Learning

ECS171: Machine Learning

Comparison of Modern Stochastic Optimization Algorithms

Second order machine learning

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Numerical Optimization Techniques

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Optimization Methods for Machine Learning

A Progressive Batching L-BFGS Method for Machine Learning

Presentation in Convex Optimization

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

STA141C: Big Data & High Performance Statistical Computing

Optimization for neural networks

arxiv: v1 [math.na] 17 Dec 2018

Sub-Sampled Newton Methods I: Globally Convergent Algorithms

CSC321 Lecture 8: Optimization

Adaptive Gradient Methods AdaGrad / Adam

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Normalization Techniques in Training of Deep Neural Networks

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization

Second-Order Stochastic Optimization for Machine Learning in Linear Time

Approximate Second Order Algorithms. Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo

Convex Optimization Lecture 16

ECS289: Scalable Machine Learning

Advanced computational methods X Selected Topics: SGD

More Optimization. Optimization Methods. Methods

Linearly-Convergent Stochastic-Gradient Methods

The Randomized Newton Method for Convex Optimization

Theory of Deep Learning IIb: Optimization Properties of SGD

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Selected Topics in Optimization. Some slides borrowed from

A Robust Multi-Batch L-BFGS Method for Machine Learning

Stochastic gradient descent and robustness to ill-conditioning

A Parallel SGD method with Strong Convergence

Optimization Methods for Large-Scale Machine Learning

Stochastic Analogues to Deterministic Optimizers

Deep Learning & Neural Networks Lecture 4

A Stochastic Quasi-Newton Method for Large-Scale Optimization

Non-Convex Optimization in Machine Learning. Jan Mrkos AIC

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Optimization Methods for Large-Scale Machine Learning

Derivative-Free Optimization of Noisy Functions via Quasi-Newton Methods: Experiments and Theory. Richard Byrd

Sample size selection in optimization methods for machine learning

The K-FAC method for neural network optimization

Nonlinear Models. Numerical Methods for Deep Learning. Lars Ruthotto. Departments of Mathematics and Computer Science, Emory University.

A summary of Deep Learning without Poor Local Minima

Modern Optimization Techniques

A STOCHASTIC SEMISMOOTH NEWTON METHOD FOR NONSMOOTH NONCONVEX OPTIMIZATION

Derivative-Free Optimization of Noisy Functions via Quasi-Newton Methods. Jorge Nocedal

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Sub-Sampled Newton Methods I: Globally Convergent Algorithms

Convergence of Cubic Regularization for Nonconvex Optimization under KŁ Property

CSC321 Lecture 7: Optimization

Line Search Methods for Unconstrained Optimisation

arxiv: v2 [math.oc] 1 Nov 2017

Regression with Numerical Optimization. Logistic

CSC321 Lecture 9: Generalization

OPTIMIZATION METHODS IN DEEP LEARNING

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems

CSC321 Lecture 9: Generalization

Summary and discussion of: Dropout Training as Adaptive Regularization

Reading Group on Deep Learning Session 1

Stochastic Optimization

Deep Learning & Artificial Intelligence WS 2018/2019

arxiv: v1 [math.oc] 9 Oct 2018

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Semistochastic quadratic bound methods

Transcription:

Sub-Sampled Newton Methods for Machine Learning Jorge Nocedal Northwestern University Goldman Lecture, Sept 2016 1

Collaborators Raghu Bollapragada Northwestern University Richard Byrd University of Colorado 2

Optimization, Statistics, Machine Learning Dramatic breakdown of barriers between optimization and statistics. Partly stimulated by the rise of machine learning Learning process and learning measures defined using continuous optimization models convex and non-convex Nonlinear optimization algorithms used to train statistical models a great challenge with the advent of high dimensional models and huge data sets The stochastic gradient method currently plays a central role 3

Stochastic Gradient Method Robbins-Monro (1951) 1. Why has it risen to such prominence? 2. What is the main mechanism that drives it? 3. What can we say about its behavior in convex and nonconvex cases? 4. What ideas have been proposed to improve upon SG? ``Optimization Methods for Machine Learning, Bottou, Curtis, Nocedal (2016) 4

Problem statement Given training set {(x 1, y 1 ), (x n, y n )} Given a loss function l(z, y) Find a prediction function h(x;w) min w 1 n n l(h(x i ;w), y i ) i=1 (hinge loss, logistic,...) (linear, DNN,...) Notation: f i (w) = l(h(x i ;w), y i ) n R n (w) = 1 f i (w) empirical risk n i=1 Random variable ξ = (x, y) F(w) = E[ f (w;ξ)] expected risk 5

Stochastic Gradient Method First present algorithms for empirical risk minimization R n (w) = 1 n n i=1 f i (w) w k+1 = w k α k f i ) i {1,...,n} choose at random Very cheap, noisy iteration; gradient w.r.t. just 1 data point Not a gradient descent method Stochastic process dependent on the choice of i Descent in expectation 6

Batch Optimization Methods w k+1 = w k α k R n ) batch gradient method w k+1 = w k α k n f i ) i=1 More expensive, accurate step n Can choose among a wide range of optimization algorithms Opportunities for parallelism Why has SG emerged as the preeminent method? Computational trade-offs between stochastic and batch methods Ability to minimize F (generalization error) 7

Practical Experience R n Empirical Risk 0.6 0.5 0.4 0.3 Batch L-BFGS LBFGS Logistic regression; speech data Fast initial progress of SG followed by drastic slowdown 0.2 0.1 SGD Can we explain this? 0 0 0.5 1 1.5 2 2.5 3 3.5 4 Accessed Data Points x 10 5 10 epochs 8

Intuition SG employs information more efficiently than batch methods Argument 1: Suppose data consists of 10 copies of a set S Iteration of batch method 10 times more expensive SG performs same computations 9

Computational complexity Total work to obtain R n ) R n (w * ) + ε Batch gradient method: nlog(1/ ε) Stochastic gradient method: 1/ ε Think of ε = 10 3 Which one is better? More precisely: Batch: ndκ log(1/ ε) SG: dνκ 2 / ε 10

Example by Bertsekas R n (w) = 1 n n i=1 f i (w) w 1 1 w 1,* 1 Region of confusion Note that this is a geographical argument Analysis: given w k what is the expected decrease in the objective function R n as we choose one of the quadratics randomly? 11

A fundamental observation E[R n +1 ) R n )] α k R n ) 2 2 + α k 2 E f ik ) 2 Initially, gradient decrease dominates; then variance in gradient hinders progress To ensure convergence: α k 0 in SG method to control variance. Sub-sampled Newton methods directly control the noise given in the last term What can we say when α k = α is constant? 12

Only converges to a neighborhood of the optimal value. 13

= k 14

Deep Neural Networks Although much is understood about the SG method, still some great mysteries: why is it so much better than batch methods on DNNs? 15

Sharp and flat minima Keskar et al. (2016) Observing R along line From SG solution to batch solution Goodfellow et al Deep convolutional Neural net CIFAR-10 SG solution SG: mini-batch of size 256 ADAM optimizer Batch solution Batch: 10% of training set 16

Testing accuracy and sharpness Keskar (2016) Sharpness of minimizer vs batch size Sharpness: Max R in a small box around minimizer Testing accuracy vs batch size 17

Drawback of SG method: distributed computing 18

Sub-sampled Newton Methods 19

Iteration R n (w) = 1 n n i=1 f i (w) Choose S {1,...,n}, X {1,...,n} uniformly and independently 2 F S )p = F X ) w k+1 = w k + α k p Sub-sampled gradient and Hessian F X ) = 1 X f i ) 2 F S ) = 1 2 f i ) S i X True Newton method impractical in large-scale machine learning Will not achieve scale invariance or quadratic convergence But the stochastic nature of the objective creates opportunities: Coordinate Hessian sample S and gradient sample X for optimal complexity i S 20

Active research area Friedlander and Schmidt (2011) Byrd, Chin, Neveitt, N. (2011) Royset (2012) Erdogdu and Montanari (2015) Roosta-Khorasani and Mahoney (2016) Agarwal, Bullins and Hazan (2016) Pilanci and Wainwright (2015) Pasupathy, Glynn, Ghosh, Hashemi (2015) Xu, Yang, Roosta-Khorasani, Re, Mahoney (2016) 21

Linear convergence 2 F Sk )p = F Xk ) w k+1 = w k + α p The following result is well known for strongly convex objective: Theorem: Under standard assumptions. If a) α = µ / L b) S k = constant c) X k = η k η > 1 (geometric growth) Then, E[ w k w * ] 0 at a linear rate and work complexity matches that of stochastic gradient method µ = smallest eigenvalue of any subsampled Hessian L = largest eigenvalue of Hessian of F 22

Local superlinear convergence Objective: expected risk F We can show the linear-quadratic result E k [ w k+1 w * ] C 1 w k w * 2 + σ w k w* + µ S k To obtain superlinear convergence: i) S k ii) X k must increase faster than geometrically v µ X k 23

Closer look at the constant We can show the linear-quadratic result E k [ w k+1 w * ] C 1 w k w * 2 + σ w k w * + µ S k v µ X k E[( 2 F i (w) 2 F(w) 2 ] σ 2 tr(cov( f i (w))) v 2 24

Achieving faster convergence We can show the linear-quadratic result E k [ w k+1 w * ] C 1 w k w * 2 + σ w k w* + µ S k v µ X k To obtain we used the bound E k [ 2 F Sk ) 2 F ) σ S k w k w* Not matrix concentration inequalities 25

Formal superlinear convergence Theorem: under the conditions just stated, there is a neighborhood of w * such that for the sub-sampled Newton method with α k = 1 E[ w k w * ] 0 superlinearly 26

Observations on quasi- Newton methods Dennis-More B k p = F X ) w k+1 = w k + α k p and suppose w k w * If [B k 2 f (x * )]p k 0 p k convergence rate is superlinear 1. B k need not converge to B * ; only good approximation along search directions 27

Hacker s definition of second-order method 1. One for which unit steplength is acceptable (yields sufficient reduction) most of the time 2. Why? How can one learn the scale of search directions without having learned curvature of the function in some relevant spaces? 3. This ``definition is problem dependent 28

Inexact Methods 2 F S )p = F X ) w k+1 = w k + α k p 1. Exact method: Hessian approximation is inverted (.e.g Newton-sketch) 2. Inexact method: solve linear system inexactly by iterative solver 3. Conjugate gradient 4. Stochastic gradient 5. Both require only Hessian-vector products Newton-CG chooses a fixed sample S, applies CG to q k (p) = F ) + F ) T p + 1 2 pt 2 F S )p 29

Newton-SGI (stochastic gradient iteration) If we apply the standard gradient method to q k (p) = F ) + F ) T p + 1 2 pt 2 F )p we obtain the iteration p k i+1 = p k i q k (p k i ) = (I 2 F ))p k i F ) Consider instead the semi-stochastic gradient iteration: 1. Choose and index j at random; 2. p k i+1 = (I 2 F j ))p k i F ) Change sample Hessian at each nner iteration Gradient method using the exact gradient but and an estimate of the Hessian. This method is implicit in Agarwal, Bullins, Hazan 2016 30

Comparing Newton-CG and Newton-GD Number of Hessian-vector products to achieve w k+1 w * 1 2 w k w* (*) ( ) Newton-SGI O ( κˆ max l ) 2 κˆ l log( κˆ l )log(d) O (( κˆ max l ) 2 κ ˆ max log( κ ˆ max l l )) Newton-CG Results in Agarwal, Bullins and Hazan (2016) and Xu, Yang, Re, Roosta-Khorasani, Mahoney (2016): Decrease (*) obtained at each step with probability 1-p Our results give convergence of the whole sequence in expectation Complexity bounds are very pessimistic, particularly for CG 31

Final Remarks The search for effective optimization algorithms for machine learning is ongoing In spite of the total dominance of the SG method at present on very large scale applications SG does not parallelize well SG is a first order method affected by ill conditioning A method that is noisy at the start and gradually becomes more accurate seems attractive 32