Advanced Training Techniques. Prajit Ramachandran

Similar documents
Day 3 Lecture 3. Optimizing deep networks

Deep Learning & Artificial Intelligence WS 2018/2019

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on

Normalization Techniques in Training of Deep Neural Networks

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Deep Learning II: Momentum & Adaptive Step Size

Understanding Neural Networks : Part I

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Machine Learning Lecture 14

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Introduction to Neural Networks

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow,

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Deep Feedforward Networks

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

Introduction to Convolutional Neural Networks (CNNs)

Deep Learning Lab Course 2017 (Deep Learning Practical)

CSC 578 Neural Networks and Deep Learning

Lecture 6 Optimization for Deep Neural Networks

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

CSC321 Lecture 8: Optimization

Regularization and Optimization of Backpropagation

Optimization for neural networks

DeepLearning on FPGAs

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Neural Networks and Deep Learning

UNSUPERVISED LEARNING

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

Deep Learning & Neural Networks Lecture 4

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Lecture 17: Neural Networks and Deep Learning

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Based on the original slides of Hung-yi Lee

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Recurrent Neural Network Training with Preconditioned Stochastic Gradient Descent

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

CSC321 Lecture 7: Optimization

CSC321 Lecture 15: Exploding and Vanishing Gradients

Training Neural Networks Practical Issues

Normalization Techniques

Neural Networks: Optimization & Regularization

Introduction to Deep Neural Networks

Optimization for Training I. First-Order Methods Training algorithm

Don t Decay the Learning Rate, Increase the Batch Size. Samuel L. Smith, Pieter-Jan Kindermans, Quoc V. Le December 9 th 2017

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Machine Learning Techniques

Introduction to Deep Learning CMPT 733. Steven Bergner

OPTIMIZATION METHODS IN DEEP LEARNING

SGD and Deep Learning

The K-FAC method for neural network optimization

More Tips for Training Neural Network. Hung-yi Lee

Ways to make neural networks generalize better

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Rapid Introduction to Machine Learning/ Deep Learning

ECE521 Lecture 7/8. Logistic Regression

1 What a Neural Network Computes

Lecture 5 Neural models for NLP

Lecture 5: Logistic Regression. Neural Networks

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Deep Learning Lecture 2

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

Jakub Hajic Artificial Intelligence Seminar I

Based on the original slides of Hung-yi Lee

Introduction to Machine Learning (67577)

Neural networks and optimization

Nonlinear Models. Numerical Methods for Deep Learning. Lars Ruthotto. Departments of Mathematics and Computer Science, Emory University.

Introduction to Convolutional Neural Networks 2018 / 02 / 23

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

CSC321 Lecture 9: Generalization

Understanding How ConvNets See

Lecture 14: Deep Generative Learning

Comparison of Modern Stochastic Optimization Algorithms

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Cheng Soon Ong & Christian Walder. Canberra February June 2018

J. Sadeghi E. Patelli M. de Angelis

CSCI567 Machine Learning (Fall 2018)

Nonlinear Optimization Methods for Machine Learning

Bridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Lecture 9: Generalization

Large-scale Stochastic Optimization

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

IMPROVING STOCHASTIC GRADIENT DESCENT

Neural Networks. Learning and Computer Vision Prof. Olga Veksler CS9840. Lecture 10

Summary and discussion of: Dropout Training as Adaptive Regularization

Non-Linearity. CS 188: Artificial Intelligence. Non-Linear Separators. Non-Linear Separators. Deep Learning I

ECS289: Scalable Machine Learning

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

CSC321 Lecture 9: Generalization

Neural Networks: Optimization Part 2. Intro to Deep Learning, Fall 2017

Machine Learning Basics III

Tips for Deep Learning

Transcription:

Advanced Training Techniques Prajit Ramachandran

Outline Optimization Regularization Initialization

Optimization

Optimization Outline Gradient Descent Momentum RMSProp Adam Distributed SGD Gradient Noise

Optimization Outline Gradient Descent Momentum RMSProp Adam Distributed SGD Gradient Noise

http://weandthecolor.com/wp-content/uploads/2014/11/hills-in-the-fog-landscape-photography-bykilian-sch%c3%b6nberger-a-photographer-from-cologne-germany.jpg

Gradient Descent Goal: optimize parameters to minimize loss Step along the direction of steepest descent (negative gradient)

Gradient Descent Andrew Ng s Machine Learning Course

https://www.cs.cmu.edu/~ggordon/10725-f12/slides/05-gd-revisited.pdf

2nd order Taylor series approximation around x

Hessian measures curvature Wikipedia

Approximate Hessian with scaled identity

Set gradient of function to 0 to get minimum

0 Take the gradient

Solve for y

Same equation!

Computing the gradient Use backpropagation to compute gradients efficiently Need a differentiable function Can t use functions like argmax or hard binary Unless using a different way to compute gradients

Stochastic Gradient Descent Gradient over entire dataset is impractical Better to take quick, noisy steps Estimate gradient over a mini-batch of examples

Mini-batch tips Use as large of a batch as possible Increasing batch size on GPU is essentially free up to a point Crank up learning rate when increasing batch size Trick: use small batches for small datasets

How to pick the learning rate?

Too big learning rate https://www.cs.cmu.edu/~g gordon/10725-f12/slides/0 5-gd-revisited.pdf

Too small learning rate https://www.cs.cmu.edu/~g gordon/10725-f12/slides/0 5-gd-revisited.pdf

How to pick the learning rate? Too big = diverge, too small = slow convergence No one learning rate to rule them all Start from a high value and keep cutting by half if model diverges Learning rate schedule: decay learning rate over time

http://cs231n.github.io/assets/nn3/learningrates.jpeg

Optimization Outline Gradient Descent Momentum RMSProp Adam Distributed SGD Gradient Noise

What will SGD do? Start

Zig-zagging http://dsdeepdive.blogspot.com/2016/03/optimizations-of-gradient-descent.html

What we would like Avoid sliding back and forth along high curvature Go fast in along the consistent direction

Same as vanilla gradient descent

https://91b6be3bd2294a24b7b5-da4c182123f5956a3d22aa43eb816232.ssl.cf1.rackcdn.com/c ontentitem-4807911-33949853-dwh6d2q9i1qw1-or.png

SGD with Momentum Move faster in directions with consistent gradient Damps oscillating gradients in directions of high curvature Friction / momentum hyperparameter μ typically set to {0.50, 0.90, 0.99} Nesterov s Accelerated Gradient is a variant

Momentum Cancels out oscillation Gathers speed in direction that matters

Per parameter learning rate Gradients of different layers have different magnitudes Different units have different firing rates Want different learning rates for different parameters Infeasible to set all of them by hand

Adagrad Gradient update depends on history of magnitude of gradients Parameters with small / sparse updates have larger learning rates Square root important for good performance More tolerance for learning rate Duchi et al 2011. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

What happens as t increases?

Adagrad learning rate goes to 0 Maintain entire history of gradients Sum of magnitude of gradients always increasing Forces learning rate to 0 over time Hard to compensate for in advance

Don t maintain all history Monotonically increasing because we hold all the history Instead, forget gradients far in the past In practice, downweight previous gradients exponentially

RMSProp Only cares about recent gradients Good property because optimization landscape changes Otherwise like Adagrad Standard gamma is 0.9 Hinton et al. 2012, http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf

Momentum RMSProp

Adam Essentially, combine RMSProp and Momentum Includes bias correction term from initializing m and v to 0 Default parameters are surprisingly good Trick: learning rate decay still helps Trick: Adam first then SGD

What to use SGD + momentum and Adam are good first steps Just use default parameters for Adam Learning rate decay always good

Alec Radford

Optimization Outline Gradient Descent Momentum RMSProp Adam Distributed SGD Gradient Noise

How to scale beyond 1 GPU: Model parallelism: partition model across multiple GPUs Dean et al. 2012. Large scale distributed deep learning.

Hogwild! Lock-free update of parameters across multiple threads Fast for sparse updates Surprisingly can work for dense updates Niu et al. 2011. Hogwild! A lock-free approach to parallelizing stochastic gradient descent

Data Parallelism and Async SGD Dean et al. 2012. Large scale distributed deep learning.

Async SGD Trivial to scale up Robust to individual worker failures Equally partition variables across parameter server Trick: at the start of training, slowly add more workers

Stale Gradients These are gradients for w, not w

Stale Gradients Each worker has a different copy of parameters Using old gradients to update new parameters Staleness grows as more workers are added Hack: reject gradients that are from too far ago

Sync SGD Wait for all gradients before update Chen et al. 2016. Revisiting Distributed Synchronous SGD

https://github.com/tensorflow/mod els/tree/master/inception

Sync SGD Equivalent to increasing up the batch size N times, but faster Crank up learning rate Problem: have to wait for slowest worker Solution: add extra backup workers, and update when N gradients received Chen et al. 2016. Revisiting Distributed Synchronous SGD

https://research.googleblog.com/2016/04/announcing-tensorflow-08-now-with.html

Gradient Noise Add Gaussian noise to each gradient Can be a savior for exotic models Neelakantan et al. 2016 Adding gradient noise improves learning for very deep networks Anandkumar and Ge, 2016 Efficient approaches for escaping higher order saddle points in non-convex optimization http://www.offconvex.org/2016/03/22/saddlepoints/

Regularization

Regularization Outline Early stopping L1 / L2 Auxiliary classifiers Penalizing confident output distributions Dropout Batch normalization + variants

Regularization Outline Early stopping L1 / L2 Auxiliary classifiers Penalizing confident output distributions Dropout Batch normalization + variants

Stephen Marsland

L1 / L2 regularization

L1 / L2 regularization L1 encourages sparsity L2 discourages large weights Gaussian prior on weight http://www.efunda.com/math/hyperbolic/images/tanh_plot.gif

Auxiliary Classifiers Lee et al. 2015. Deeply Supervised Nets

Regularization Outline Early stopping L1 / L2 Auxiliary classifiers Penalizing confident output distributions Dropout Batch normalization + variants

Penalizing confident distributions Do not want overconfident model Prefer smoother output distribution Invariant to model parameterization (1) Train towards smoother distribution (2) Penalize entropy Pereyra et al. 2017 Regularizing neural networks by penalizing confident output distributions

True Label Baseline Mixed Target Szegedy et al. 2015 Rethinking the Inception architecture for computer vision

When is uniform a good choice? Bad? Szegedy et al. 2015 Rethinking the Inception architecture for computer vision

http://3.bp.blogspot.com/-rk86-wfbeo0/ubrsiqmdcli/aa AAAAAAAcM/UI6aq-yDEJs/s1600/bernoulli_entropy.png

Enforce entropy to be above some threshold http://3.bp.blogspot.com/-rk86-wfbeo0/ubrsiqmdcli/aa AAAAAAAcM/UI6aq-yDEJs/s1600/bernoulli_entropy.png

Regularization Outline Early stopping L1 / L2 Auxiliary classifiers Penalizing confident output distributions Dropout Batch normalization + variants

Dropout Srivastava et al. 2014. Dropout: a simple way to prevent neural networks from overfitting

Dropout Complex co-adaptations probably do not generalize Forces hidden units to derive useful features on own Sampling from 2n possible related networks Srivastava et al. 2014. Dropout: a simple way to prevent neural networks from overfitting

Srivastava et al. 2014. Dropout: a simple way to prevent neural networks from overfitting

Bayesian interpretation of dropout Variational inference for Gaussian processes Monte Carlo integration over GP posterior http://mlg.eng.cam.ac.uk/yarin/blog_3d801aa532c1ce.html

Dropout for RNNs Can dropout layer-wise connections as normal Recurrent connections use same dropout mask over time Or dropout specific portion of recurrent cell Zaremba et al. 2014. Recurrent neural network regularization Gal 2015. A theoretically grounded application of dropout in recurrent neural networks Semenuita et al. 2016. Recurrent dropout without memory loss

Regularization Outline Early stopping L1 / L2 Auxiliary classifiers Penalizing confident output distributions Dropout Batch normalization + variants

https://img.rt.com/files/2016.10/original/57f28764c36188fc0b8b45e8.jpg

Internal Covariate Shift Distribution of inputs to a layer is changing during training Harder to train: smaller learning rate, careful initialization Easier if distribution of inputs stayed same How to enforce same distribution? Ioffe and Szegedy, 2015. Batch normalization: accelerating deep network training by reducing internal covariate shift

Fighting internal covariate shift Whitening would be a good first step Would remove nasty correlations Problems with whitening? Ioffe and Szegedy, 2015. Batch normalization: accelerating deep network training by reducing internal covariate shift

Problems with whitening Slow (have to do PCA for every layer) Cannot backprop through whitening Next best alternative? Ioffe and Szegedy, 2015. Batch normalization: accelerating deep network training by reducing internal covariate shift

Normalization Make mean = 0 and standard deviation = 1 Doesn t eliminate correlations Fast and can backprop through it How to compute the statistics? Ioffe and Szegedy, 2015. Batch normalization: accelerating deep network training by reducing internal covariate shift

How to compute the statistics Going over the entire dataset is too slow Idea: the batch is an approximation of the dataset Compute statistics over the batch Ioffe and Szegedy, 2015. Batch normalization: accelerating deep network training by reducing internal covariate shift

Batch size Mean Variance Normalize Ioffe and Szegedy, 2015. Batch normalization: accelerating deep network training by reducing internal covariate shift

Distribution of an activation before normalization

Distribution of an activation after normalization

Not all distributions should be normalized A rare feature should not be forced to fire 50% of the time Let the model decide how the distribution should look Even undo the normalization if needed

Ioffe and Szegedy, 2015. Batch normalization: accelerating deep network training by reducing internal covariate shift

Test time batch normalization Want deterministic inference Different test batches will give different results Solution: precompute mean and variance on training set and use for inference Practically: maintain running average of statistics during training Ioffe and Szegedy, 2015. Batch normalization: accelerating deep network training by reducing internal covariate shift

Advantages Enables higher learning rate by stabilizing gradients More resilient to parameter scale Regularizes model, making dropout unnecessary Most SOTA CNN models use BN Ioffe and Szegedy, 2015. Batch normalization: accelerating deep network training by reducing internal covariate shift

Batch Norm for RNNs? Naive application doesn t work Compute different statistics for different time steps? Ideally should be able to reuse existing architectures like LSTM Laurent et al. 2015 Batch normalized recurrent neural networks

Cooijmans et al. 2016. Recurrent Batch Normalization

Cooijmans et al. 2016. Recurrent Batch Normalization

Recurrent Batch Normalization Maintain independent statistics for the first T steps t > T uses the statistics from time T Have to initialize to ~0.1 Cooijmans et al. 2016. Recurrent Batch Normalization

Batch Normalization Layer Normalization Hidden Batch Hidden Batch Ba et al. 2016. Layer Normalization

Advantages of LayerNorm Don t have to worry about normalizing across time Don t have to worry about batch size

Practical tips for regularization Batch normalization for feedforward structures Dropout still gives good performance for RNNs Entropy regularization good for reinforcement learning Don t go crazy with regularization

Initialization

Initialization Outline Basic initialization Smarter initialization schemes Pretraining

Initialization Outline Basic initialization Smarter initialization schemes Pretraining

Baseline Initialization Weights cannot be initialized to same value because all the gradients will be the same Instead, draw from some distribution Uniform from [-0.1, 0.1] is a reasonable starting spot Biases may need special constant initialization

Initialization Outline Basic initialization Smarter initialization schemes Pretraining

He initialization for ReLU networks Call variance of input Var[y0] and of last layer activations Var[yL] If Var[yL] >> Var[y0]? If Var[yL] << Var[y0]? He et al. 2015. Delving deep into rectifiers: surpassing human level performance on ImageNet classification

He initialization for ReLU networks Call variance of input Var[y0] and of last layer activations Var[yL] If Var[yL] >> Var[y0], exploding activations diverge If Var[yL] << Var[y0], diminishing activations vanishing gradient Key idea: Var[yL] = Var[y0] He et al. 2015. Delving deep into rectifiers: surpassing human level performance on ImageNet classification

He initialization Number of inputs to neuron He et al. 2015. Delving deep into rectifiers: surpassing human level performance on ImageNet classification

Identity RNN Basic RNN with ReLU as nonlinearity (instead of tanh) Initialize hidden-to-hidden matrix to identity matrix Le et al. 2015. A simple way to initialize recurrent neural networks of rectified linear units

Initialization Outline Basic initialization Smarter initialization schemes Pretraining

Pretraining Initialize with weights from a network trained for another task / dataset Much faster convergence and better generalization Can either freeze or finetune the pretrained weights

Zeiler and Fergus, 2013. Visualizing and Understanding Convolutional Networks

Pretraining for CNNs in vision New dataset is small New dataset is large Pretrained dataset is similar to new dataset Freeze weights and train linear classifier from top level features Fine-tune all layers (pretrain for faster convergence and better generalization) Pretrained dataset is different from new dataset Freeze weights and train linear classifier from non-top level features Fine-tune all the layers (pretrain for improved convergence speed) http://cs231n.github.io/transfer-learning/

Razavian et al. 2014. CNN features off-the-shelf: an astounding baseline for recognition

Pretraining for Seq2Seq Ramachandran et al. 2016. Unsupervised pretraining for sequence to sequence learning

Progressive networks Rusu et al 2016. Progressive Neural Networks

Key Takeaways Adam and SGD + momentum address key issues of SGD, and are good baseline optimization methods to use Batch norm, dropout, and entropy regularization should be used for improved performance Use smart initialization schemes when possible

Questions?

Appendix

Proof of He initialization