Stochastic Optimization Methods for Machine Learning. Jorge Nocedal

Similar documents
Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal

Nonlinear Optimization Methods for Machine Learning

Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

Exact and Inexact Subsampled Newton Methods for Optimization

Stochastic Optimization Algorithms Beyond SG

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

Second-Order Methods for Stochastic Optimization

Sub-Sampled Newton Methods

Stochastic Gradient Methods for Large-Scale Machine Learning

Neural Network Training

Approximate Newton Methods and Their Local Convergence

Optimization Methods for Machine Learning

Large-scale Stochastic Optimization

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations

Theory of Deep Learning IIb: Optimization Properties of SGD

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

SGD and Deep Learning

Comparison of Modern Stochastic Optimization Algorithms

Numerical Optimization Techniques

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

A summary of Deep Learning without Poor Local Minima

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization

Lecture 1: Supervised Learning

Accelerating SVRG via second-order information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Normalization Techniques in Training of Deep Neural Networks

CSC321 Lecture 9: Generalization

Stochastic Quasi-Newton Methods

Why should you care about the solution strategies?

Data Mining & Machine Learning

Stochastic Gradient Descent

Second order machine learning

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

arxiv: v1 [math.na] 17 Dec 2018

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

CSC321 Lecture 8: Optimization

A picture of the energy landscape of! deep neural networks

ECS171: Machine Learning

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Presentation in Convex Optimization

Zero-Order Methods for the Optimization of Noisy Functions. Jorge Nocedal

CS60021: Scalable Data Mining. Large Scale Machine Learning

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CSC321 Lecture 9: Generalization

An Investigation of Newton-Sketch and Subsampled Newton Methods

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Summary and discussion of: Dropout Training as Adaptive Regularization

Linear Regression (continued)

Day 3 Lecture 3. Optimizing deep networks

Stochastic optimization in Hilbert spaces

Stochastic Gradient Descent. CS 584: Big Data Analytics

Second order machine learning

Reading Group on Deep Learning Session 1

Lecture 5: Logistic Regression. Neural Networks

Regression with Numerical Optimization. Logistic

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

OPTIMIZATION METHODS IN DEEP LEARNING

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Non-Convex Optimization in Machine Learning. Jan Mrkos AIC

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Nonlinear Models. Numerical Methods for Deep Learning. Lars Ruthotto. Departments of Mathematics and Computer Science, Emory University.

Deep Learning & Artificial Intelligence WS 2018/2019

what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley

Stochastic gradient descent and robustness to ill-conditioning

Neural networks and optimization

Optimization for neural networks

Introduction to Neural Networks

Deep Feedforward Networks

Sample size selection in optimization methods for machine learning

Linear & nonlinear classifiers

Deep Learning & Neural Networks Lecture 4

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Adaptive Gradient Methods AdaGrad / Adam

Linearly-Convergent Stochastic-Gradient Methods

Tutorial on Machine Learning for Advanced Electronics

CSC321 Lecture 7: Optimization

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725

A Progressive Batching L-BFGS Method for Machine Learning

The K-FAC method for neural network optimization

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Modern Optimization Techniques

Lecture 5 Neural models for NLP

STA141C: Big Data & High Performance Statistical Computing

Optimization Methods for Large-Scale Machine Learning

Advanced computational methods X Selected Topics: SGD

Logistic Regression. Stochastic Gradient Descent

Full-information Online Learning

A Stochastic Quasi-Newton Method for Large-Scale Optimization

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

LBFGS. John Langford, Large Scale Machine Learning Class, February 5. (post presentation version)

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Deep Relaxation: PDEs for optimizing Deep Neural Networks

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Optimization Methods for Large-Scale Machine Learning

Multilayer Perceptron

Sub-Sampled Newton Methods I: Globally Convergent Algorithms

Transcription:

Stochastic Optimization Methods for Machine Learning Jorge Nocedal Northwestern University SIAM CSE, March 2017 1

Collaborators Richard Byrd R. Bollagragada N. Keskar University of Colorado Northwestern Northwestern D. Mudigere P. Tang M. Smelyanski INTEL 2

Roger Fletcher 1939-2016 His imagination, originality and humility will be an example for future generations. 3

Initial Remarks 1. Continuous optimization in applied math B. Stoufflet 2. Central role also in Statistics 3. I will talk about optimization algorithms that are good learning algorithms - that generalize well 4. Illustrate with concrete example: training Deep Neural Networks 5. Contrast classical gradient-based methods and with the stochastic gradient method 6. For decades nonlinear optimization research focused on descent methods (line search or trust region). How else can one obtain (deterministic) convergence guarantees? 7. In large-scale machine learning applications, it is best to require only descent in expectation 4

Initial remarks Algorithms whose iterates are random variables and that are allowed to wonder around 1. Perform a more effective exploration of the data 2. Markov process has shown to be particularly effective: there is randomness at each iteration, but independent of previous decisions 3. Such behavior allows the optimization algorithm to produce solutions (prediction functions) that generalize well 4. Different from simulated annealing/genetic methods 5

Deep neural networks w Have produced quite a stir since 2011 How? It is still not well understood, Zhang, Bengio, Hardt, Recht, Vinyals (2017) A highly nonlinear and non-convex predictor Input: images, acoustic frames, text Output: image classification, speech recognition, translation 6

Example: Speech recognition Observe features X in acoustic frames Predict word or sentence FREE SPEECH 7

Capacity of Deep neural networks w Can be viewed as a function with great expressive capacity: can reproduce large classes of functions Piecewise polynomial in w of degree 7 in 10 million variables Or can be seen as a composition of logistic regression units When training the DNN: Many minimizers, degenerate due to overcapacity 8

Our Observations Apply to Dominant Architectures Feed forward Convolutional Neural Network Recurrent Neural Network 9

Next Illustrate how 2 optimization algorithms that give equally good solutions on the training problem produce solutions with different generalization properties Discuss the notable properties of the stochastic gradient method and how it dominates the classical gradient method Bottleneck: parallelism Search for new optimization algorithms suited for machine learning Sub-sampled Newton methods 10

Some pictures Training deep neural networks with: - stochastic gradient method - gradient based method (L-BFGS), batch method 11

Sharp and wide minima Keskar et al. (2016) R Observing training error and testing error along line from SG solution to batch solution Deep convolutional neural net CIFAR-10 Stochastic gradient solution SG: mini-batch of size 256 ADAM optimizer full gradient methodsolution Batch: 10% of training set 12

Accuracy: correct classification Keskar et al. (2016) R SG solution Batch solution 13

Another example Keskar et al. (2016) R SG solution Batch solution 14

What is going on? Gradient method over-fits We need to back-up: Define setting of supervised training Describe of optimization methods and their properties 15

Supervised Learning Given a sizable training set of size n; each example i consists of x i : feature information y i : correct label 16

Supervised Learning Given a sizable training set of size n; each example i consists of x i : feature information y i : correct label Define prediction function h that depends on unknown parameter w, h(w; x) = w T x or h(w; x) = nonlinear that makes good predictions on unseen data ( ˆx, ŷ) y = h(w; ˆx) with y ŷ Choose a loss function l(y, ŷ) and solve optimization problem min w F(w) = 1 n n i=1 l(h(w; x i ), y i ) 17

Loss Functions Logistic regression: log(1+ exp( y(w T x)) For multi-class classification, C= set of classes F(w) = 1 N N i=1 log exp(w T y j x i ) exp(w T j x i ) j C 18

Training error vs Testing Error --- Learning Algorithms (x, y) Z denotes all input-output pairs with distribution P(x, y) Define f (w, x i, y i ) = l(h(w; x i ), y i ) Expected Risk: F(w) = f (w; x, y)dp(x, y) Empirical Risk: R(w) = 1 n n i=1 f (w; x i, y i ) R(w) = 1 n n f i (w) i=1 Finite Sum Problem f i f (w; x i, y i ) denotes the loss associated with the i-th data point 19

Optimization Problem 7 hidden layers 2000 units per layer 9000 label classes 6 44 million total parameters w 3.6 360 million examples A. Kingsbury w Therefore the problem: min w F(w) = 1 n Is a formidable optimization problem. Training time ~ 2 days to 1 week.. and deserves the respect of the CSE audience n i=1 l(h(w; x i ), y i ) 20

Stochastic Gradient Method For empirical risk minimization: R n (w) = 1 n n i=1 f i (w) w k+1 = w k α k f i (w k ) i {1,...,n} choose at random Very cheap, noisy iteration; gradient w.r.t. just 1 data point Not a gradient descent method Stochastic process dependent on the choice of i 21

Batch Optimization Methods SAA w k+1 = w k α k R n (w k ) gradient method w k+1 = w k α k n f i (w k ) i=1 More expensive, accurate step n Can choose among a wide range of optimization algorithms Opportunities for parallelism Why has SG emerged as the preeminent method? Computational trade-offs between stochastic and batch methods Ability to generalize: minimize F 22

Efficiency R n 0.6 Logistic regression; speech data 0.5 Empirical Risk 0.4 0.3 0.2 Batch L-BFGS LBFGS Fast initial progress of SG followed by drastic slowdown 0.1 0 SGD 0 0.5 1 1.5 2 2.5 3 3.5 4 Accessed Data Points x 10 5 Can we explain this? 10 epochs 23

Intuition SG employs information more efficiently than batch methods Argument 1: Suppose data consists of 10 copies of a set S Iteration of batch method 10 times more expensive SG performs same computations 24

Example by Bertsekas R n (w) = 1 n n i=1 f i (w) w 1 1 w 1,* 1 Region of confusion Note that this is a geographical argument Analysis: given w k what is the expected decrease in the objective function R n as we choose one of the quadratics randomly? 25

Computational complexity Total work to obtain R n (w k ) R n (w * ) + ε Batch gradient method: ndκ log(1/ ε) Stochastic gradient method: dνκ 2 / ε Think of ε = 10 3 n : # of training points d : # of variables κ : condition number Srebro et al. Bottou et al. 26

A fundamental inequality E k [R n (w k+1 ) R n (w k )] α k R n (w k ) 2 2 + α k 2 E k f ik (w k ) 2 Initially, gradient decrease dominates; then variance in gradient hinders progress To ensure convergence: α k 0 in SG method to control noise Variance reduction methods directly control the noise given in the last term The variant when α k = α is constant has also been thorough studied and yields convergence to a neighborhood of the solution at linear rate Bottou, Curtis, Nocedal (2016) prepared for SIAM Review 27

Learning algorithm Let us look more closely at the training of deep Neural networks It has been known for a long time that batch methods are inferior 28

Accuracy is lost with increase in batch size ADAM optimizer: 256 (small batch) v/s 10% (large batch) Testing Accuracy: R Studied 6 network configurations 29

Training and Testing Accuracy SB: small batch LB: large batch No Problems in Training! 30

Network configuarions 31

Sharp and flat minima Keskar et al. (2016) R Observing R along line From SG solution to batch solution Goodfellow et al Deep convolutional Neural net CIFAR-10 SG solution SG: mini-batch of size 256 ADAM optimizer Batch solution Batch: 10% of training set 32

Sharp minima Issue! Small batch solution large batch solution 33

Testing accuracy and sharpness Keskar (2016) Sharpness of minimizer vs batch size Sharpness: Max R in a small box around minimizer Testing accuracy vs batch size 34

35

Sharpness Metric Given a minimizer w * and a box B of width ε centered at w *, we define the sharpness of w * as max w B f (w * + w) f (w * ) 1+ f (w * ) Box B 36

Robust Optimization View 37

Hot starts LB 38

Can one understand reasons for convergence to sharp minima? What is the relative frequency of flat and sharp minima? Is there anything inherent to the architecture of the network that causes this behavior? Can one steer the training method away from sharp minima? Is a robust formulation feasible? 39

Drawback of SG method: distributed computing 40

Beyond the stochastic gradient method Let s consider an Part algorithm II that is more general 41

Subsampled Newton Methods R n (w) = 1 n n i=1 f i (w) Choose S {1,...,n}, X {1,...,n} uniformly and independently 2 F S (w k )p = F X (w k ) w k+1 = w k + α k p Sub-sampled gradient and Hessian F X (w k ) = 1 X f i (w k ) 2 F S (w k ) = 1 2 f i (w k ) S i X Focus on true objective: expected risk F i S The stochastic nature of the objective creates opportunities: Coordinate Hessian sample S and gradient sample X for optimal complexity 42

Active research area Friedlander and Schmidt (2011) Byrd, Chin, Neveitt, N. (2011) Erdogdu and Montanari (2015) Roosta-Khorasani and Mahoney (2016) Agarwal, Bullins and Hazan (2016) Pilanci and Wainwright (2015) Pasupathy, Glynn, Ghosh, Hashemi (2015) Xu, Yang, Roosta-Khorasani, Re, Mahoney (2016) Byrd, Bollapragada, N. (2016) 43

Linear convergence Focus on Expected Risk F 2 F Sk (w k )p = F Xk (w k ) w k+1 = w k + α p The following result is well known for strongly convex objective: Theorem: Under standard assumptions. If a) α = µ / L b) S k = constant c) X k = η k η > 1 (geometric growth) Then, E[ w k w * ] 0 at a linear rate and work complexity matches that of stochastic gradient method Byrd, Chin, N. Wu, 2012 µ = smallest eigenvalue of any subsampled Hessian L = largest eigenvalue of Hessian of F 44

Local superlinear convergence We can show the linear-quadratic result E k [ w k+1 w * ] C 1 w k w * 2 + σ w k w * + µ S k To obtain superlinear convergence: i) S k ii) X k must increase faster than geometrically v µ X k 45

Formal superlinear convergence Theorem: under the conditions just stated, there is a neighborhood of w * such that for the sub-sampled Newton method with α k = 1 E[ w k w * ] 0 superlinearly 46

Inexact Methods- what iterative solver to use? 2 F S (w k )p = F X (w k ) w k+1 = w k + α k p 1. Inexact method Conjugate gradient Stochastic gradient 2. Both require only Hessian-vector products Newton-CG chooses a fixed sample S, applies CG to q k (p) = F(w k ) + F(w k ) T p + 1 2 pt 2 F S (w k )p 47

Newton-SGI (stochastic gradient iteration) If we apply the standard gradient method to q k (p) = F(w k ) + F(w k ) T p + 1 2 pt 2 F(w k )p we obtain the iteration p k i+1 = p k i q k (p k i ) = (I 2 F(w k ))p k i F(w k ) Consider instead the semi-stochastic gradient iteration: 1. Choose and index j at random; 2. p k i+1 = (I 2 F j (w k ))p k i F(w k ) Change sample Hessian at each inner iteration This method is implicit in Agarwal, Bullins, Hazan 2016 48

Comparing Newton-CG and Newton-GD Number of Hessian-vector products to achieve w k+1 w * 1 2 w k w* (*) ( ) Newton-SGI O ( κˆ max l ) 2 κˆ l log( κˆ l )log(d) O (( κˆ max l ) 2 κ ˆ max log( κ ˆ max l l )) Newton-CG Agarwal, Bullins and Hazan (2016) and Xu, Yang, Re, Roosta-Khorasani, Mahoney (2016) Byrd, Bollapragada, N, (2016) Decrease (*) obtained at each step with probability 1-p Our results give convergence of the whole sequence in expectation Complexity bounds are very pessimistic, particularly for CG 49

Final Remarks Ø Ø Ø Ø Ø Ø Ø The search for effective optimization algorithms for machine learning is ongoing In spite of the total dominance of the SG method at present on very large scale applications SG does not parallelize well SG is a first order method affected by ill conditioning SG too noisy when high accuracy is desired A method that is noisy at the start and gradually becomes more accurate seems attractive Generalization properties are vital 50