Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Similar documents
Stochastic Optimization Methods for Machine Learning. Jorge Nocedal

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal

Deep Neural Networks: From Flat Minima to Numerically Nonvacuous Generalization Bounds via PAC-Bayes

Nonlinear Optimization Methods for Machine Learning

Theory of Deep Learning IIb: Optimization Properties of SGD

Introduction to Neural Networks

ECS171: Machine Learning

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Don t Decay the Learning Rate, Increase the Batch Size. Samuel L. Smith, Pieter-Jan Kindermans, Quoc V. Le December 9 th 2017

Introduction to Convolutional Neural Networks (CNNs)

Stochastic Optimization Algorithms Beyond SG

Advanced computational methods X Selected Topics: SGD

THERE ARE MANY CONSISTENT EXPLANATIONS OF UNLABELED DATA: WHY YOU SHOULD AVERAGE

Unraveling the mysteries of stochastic gradient descent on deep neural networks

CS260: Machine Learning Algorithms

Deep Learning & Artificial Intelligence WS 2018/2019

Deep Relaxation: PDEs for optimizing Deep Neural Networks

Day 3 Lecture 3. Optimizing deep networks

OPTIMIZATION METHODS IN DEEP LEARNING

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

Training Neural Networks Practical Issues

Stochastic Gradient Descent

A picture of the energy landscape of! deep neural networks

GRADIENT DESCENT. CSE 559A: Computer Vision GRADIENT DESCENT GRADIENT DESCENT [0, 1] Pr(y = 1) w T x. 1 f (x; θ) = 1 f (x; θ) = exp( w T x)

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on

Theory of Deep Learning III: the non-overfitting puzzle

Theory of Deep Learning III: explaining the non-overfitting puzzle

arxiv: v2 [cs.lg] 28 Nov 2017

Normalization Techniques in Training of Deep Neural Networks

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

On stochas*c gradient descent, flatness and generaliza*on Yoshua Bengio

Theory of Deep Learning III: Generalization Properties of SGD

Tips for Deep Learning

Neural Networks: Optimization & Regularization

CSC321 Lecture 8: Optimization

Logistic Regression. COMP 527 Danushka Bollegala

arxiv: v2 [stat.ml] 1 Jan 2018

Empirical Analysis of the Hessian of Over-Parametrized Neural Networks

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Deep Residual. Variations

Second-Order Methods for Stochastic Optimization

CSC321 Lecture 9: Generalization

Overparametrization for Landscape Design in Non-convex Optimization

Lecture 6 Optimization for Deep Neural Networks

Optimization for Training I. First-Order Methods Training algorithm

Neural Networks and Deep Learning

arxiv: v2 [cs.lg] 16 Aug 2018

Machine Learning. Lecture 2: Linear regression. Feng Li.

CSC321 Lecture 7: Optimization

STA141C: Big Data & High Performance Statistical Computing

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley

CSC321 Lecture 9: Generalization

Optimization and Gradient Descent

Theory IIIb: Generalization in Deep Networks

Tips for Deep Learning

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Neural Networks and Deep Learning.

Gradient-Based Learning. Sargur N. Srihari

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Deep Feedforward Networks

RegML 2018 Class 8 Deep learning

J. Sadeghi E. Patelli M. de Angelis

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Nonlinear Models. Numerical Methods for Deep Learning. Lars Ruthotto. Departments of Mathematics and Computer Science, Emory University.

Non-Linearity. CS 188: Artificial Intelligence. Non-Linear Separators. Non-Linear Separators. Deep Learning I

Based on the original slides of Hung-yi Lee

Deep Learning Lecture 2

Neural networks and optimization

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

Reading Group on Deep Learning Session 1

Recurrent Neural Network Training with Preconditioned Stochastic Gradient Descent

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Stochastic Gradient Descent. CS 584: Big Data Analytics

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Implicit Optimization Bias

Artificial Neural Networks. MGS Lecture 2

SGD and Deep Learning

Least Mean Squares Regression. Machine Learning Fall 2018

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Least Mean Squares Regression

Pattern Classification

Machine Learning (CSE 446): Backpropagation

Mini-Course 1: SGD Escapes Saddle Points

Energy Landscapes of Deep Networks

Neural Networks (Part 1) Goals for the lecture

Introduction to Deep Neural Networks

Transcription:

Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima J. Nocedal with N. Keskar Northwestern University D. Mudigere INTEL P. Tang INTEL M. Smelyanskiy INTEL 1

Initial Remarks SGD (and variants) is the method of choice Take another look at batch methods for training DNN Because they have the potential to parallelize Widely accepted that batch methods overfit Revisit this in the non-convex case of DNN with multiple minimizers o o o Performed an exploration using ADAM where gradient sample increased from stochastic to batch regime Ran methods until no measurable progress is made in training Does the batch method converge to shallower minimizer? 2

Testing Accuracy is lost with increase in batch size ADAM optimizer: 256 (small batch) v/s 10% (large batch) This behavior has been observed by others Testing Accuracy Studied 6 network configurations 3

Training and Testing Accuracy SB: small batch LB: large batch No Problems in Training! 4

Network configurations 5

Early stopping would not help large batch methods Testing error for large batch method Batch methods somehow do not employ information improperly To be described mathematically in this context! 6

Methods converge to different types of minimizers Next: plot the geometry of the loss function along the line joining the small batch solution and large batch solution Plot the true loss and test functions Goodfellow et al 7

What is going on? Gradient method over-fits We need to back-up: Define setting of supervised training Le Convolutional neural net CIFAR-10 LeCun, private communication Small batch solution SG: mini-batch of size 256 large batch solution Batch: 10% of training set 8

Accuracy: correct classification SG solution Batch solution 9

Combined 10

We observe this over and over Has this been observed by others? Hochreiter and Schmidhuber. Flat minima. 1997 11

What are Sharp and Wide Minima? Volume. Free Energy. Robust Solution. Instead we use Given a parameter w * and a box B of width ε centered at w *, we define the sharpness of w * as max w B f (w * + w) f (w * ) 1+ f (w * ) 1. Maximum sensitivity 2. Observed sharp solutions are wide in most of the space 3. Computed with an optimization solver (inexactly) 4. Verified through random sampling 5. Also minimized/samples in random subspaces. 12

Sharpness: small batch solution SB large batch solution LB Small batch solution large batch solution 13

Sampling in a subsapce 14

Sharp and Wide Minima: an illusion? 1. It is tempting to conclude that convergence to sharp minima explains why batch methods do not generalize well 2. Perturbation analysis in parameter space refers to training problem 3. But geometry of loss function depends on the basis used in parameter space. One can alter it in various ways without changing prediction capability 4. Dinh et al 2017 Sharp Minima can generalize: 5. construct two identical predictors; one 6. sharp minimum; the other not 7. Neyshabur et al: Path-SGD (2015) 8. Chaudhari et al. Entropy-sgd: Biasing gradient 9. descent into wide valleys 2016 1 α w αw 15

Nevertheless our observations require an explanation 1. Sharpness grows as batch optimization iteration progresses 2. Controlled experiments: start with SGD and swtich to batch: can get trapped in sharp minima. 16

Remarks Ø Ø Ø Ø Ø Ø Convergence to sharp/wide minima seems to be persistent Plausible: due to effect of noise in SGD and the fact that steplength is selected to give good testing error (noise adjustment) But it is not clear how to properly define sharp/wide minima so that they relate to generalization We need a mathematical explanation of the generalization properties of batch methods in the context of DNNs (not convex case) And convergence of the optimization on training functions A batch method with good generalization properties could make use of parallel platforms 17