Internal Covariate Shift Batch Normalization Implementation Experiments. Batch Normalization. Devin Willmott. University of Kentucky.

Similar documents
Normalization Techniques in Training of Deep Neural Networks

CS Homework 3. October 15, 2009

Statistical Learning Theory

COGS Q250 Fall Homework 7: Learning in Neural Networks Due: 9:00am, Friday 2nd November.

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Bagging and Other Ensemble Methods

Neural networks and optimization

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

HOMEWORK #4: LOGISTIC REGRESSION

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Deep Feedforward Networks

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Advanced statistical methods for data analysis Lecture 2

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Lecture 17: Neural Networks and Deep Learning

Deep Learning Lab Course 2017 (Deep Learning Practical)

Midterm: CS 6375 Spring 2018

Deep Feedforward Networks

a b = a T b = a i b i (1) i=1 (Geometric definition) The dot product of two Euclidean vectors a and b is defined by a b = a b cos(θ a,b ) (2)

HOMEWORK #4: LOGISTIC REGRESSION

Artificial Neural Networks

Machine Learning Basics III

y(x n, w) t n 2. (1)

COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines. COMP9444 c Alan Blair, 2017

Energy-Based Generative Adversarial Network

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Convolutional Neural Network Architecture

Machine learning - HT Maximum Likelihood

Introduction to Deep Neural Networks

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Lecture 5: Logistic Regression. Neural Networks

Machine Learning 4771

Statistical Machine Learning from Data

Deep Residual. Variations

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Kyle Reing University of Southern California April 18, 2018

Data Mining & Machine Learning

Neural Networks and Deep Learning

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Spatial Transformer Networks

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Neural Networks: Optimization & Regularization

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation

Rapid Introduction to Machine Learning/ Deep Learning

Lecture 2: Logistic Regression and Neural Networks

Covariance and Correlation

Reading Group on Deep Learning Session 1

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Intro to Neural Networks and Deep Learning

A summary of Deep Learning without Poor Local Minima

Variational Inference via Stochastic Backpropagation

Artificial Neural Networks 2

Neural Networks and the Back-propagation Algorithm

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

1 Machine Learning Concepts (16 points)

Hierarchy. Will Penny. 24th March Hierarchy. Will Penny. Linear Models. Convergence. Nonlinear Models. References

text classification 3: neural networks

Logistic Regression & Neural Networks

1 What a Neural Network Computes

Comparison of Modern Stochastic Optimization Algorithms

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Midterm, Fall 2003

Neural Networks: Backpropagation

More Tips for Training Neural Network. Hung-yi Lee

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Statistical Machine Learning from Data

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Gradient-Based Learning. Sargur N. Srihari

Introduction to Neural Networks

CS534 Machine Learning - Spring Final Exam

Jakub Hajic Artificial Intelligence Seminar I

EVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN)

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints

The connection of dropout and Bayesian statistics

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

Automatic Differentiation and Neural Networks

Deep Learning & Artificial Intelligence WS 2018/2019

Auto-Encoding Variational Bayes

Machine Learning Lecture 14

Principal Component Analysis (PCA) CSC411/2515 Tutorial

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Normalization Techniques

Convolutional Neural Networks

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Bits of Machine Learning Part 1: Supervised Learning

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning

From perceptrons to word embeddings. Simon Šuster University of Groningen

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Ch.6 Deep Feedforward Networks (2/3)

CSCI567 Machine Learning (Fall 2018)

Transcription:

Batch Normalization Devin Willmott University of Kentucky October 23, 2017

Overview 1 Internal Covariate Shift 2 Batch Normalization 3 Implementation 4 Experiments

Covariate Shift Suppose we have two distributions, P and Q, where conditional distributions are equal: P(y x) = Q(y x) marginal distributions are not equal: P(x) Q(x) This situation is called covariate shift. Example: Two datasets where x i is a coin, y i is whether that coin is real or counterfeit. Dataset 1 has 50 nickels, 50 dimes, 50 quarters Dataset 2 has 25 nickels, 100 dimes, 25 quarters In each dataset, 90% of coins are real and 10% are counterfeit

Internal Covariate Shift Let N be an L-layer neural network, and θ N be its parameters. Let N 1 and N 2 be two smaller neural networks created by splitting N at layer l: N 1 is layers 1 through l 1, and has parameters θ N1 N 2 is layers l through L, and has parameters θ N2 Let P(y x) be the target distribution for N. Then N 1 has a target distribution of P(h (l) x), and N 2 has a target distribution of P(y h (l) ).

Internal Covariate Shift Suppose we train N using (entire) batch training. Consider this learning task from the perspective of N 2 at two consecutive training iterationsi and i + 1: At iteration i, N 2 receives the hidden layer h (l) from N 1. It computes ŷ, backpropagates to find gradient directions θn2 L, and updates its parameters θ N2. At iteration i + 1, N 2 receives the hidden layer h (l) from N 1. But h (l) is different than it was at iteration i, because N 1 also updated its parameters θ N1.

Internal Covariate Shift Suppose we train N using (entire) batch training. Consider this learning task from the perspective of N 2 at two consecutive training iterationsi and i + 1: At iteration i, N 2 receives the hidden layer h (l) from N 1. It computes ŷ, backpropagates to find gradient directions θn2 L, and updates its parameters θ N2. At iteration i + 1, N 2 receives the hidden layer h (l) from N 1. But h (l) is different than it was at iteration i, because N 1 also updated its parameters θ N1. In the language of distributions: N 1 s target distribution, P(y h (l) ), is remaining the same, but its input distribution, P(h (l) ), is changing. This change is called internal covariate shift.

(Why Is) Internal Covariate Shift (A Problem?) At first, this may not seem like an issue: we just want N 2 to learn P(y h (l) ), so why does it matter if the distribution of h (l) changes? The problem is that N 2 can only represent a certain family of distributions, specified by θ N2. During training, it will try to adjust θ N2 to perform best in the most dense regions of input. Another way to think about it: when training a model, our goal is to find the marginal distribution P(y x), but we train our machine using the joint distribution P(x, y).

Solving Internal Covariate Shift: Data Whitening Data whitening: shifting and linearly transforming a set of data to have mean 0, variance 1, and decorrelated variables. One possible solution to internal covariate shift is to whiten the minibatch between each layer of the network. This would prevent each layer s parameters from having to change to match the mean and variance of the output of the previous layer.

Solving Internal Covariate Shift: Data Whitening For a data point x = (x (1),..., x (n) ) R n, we can whiten x with the transformation ˆx := Cov[x] 1/2 (x E[x]) where Cov[x] = E[xx T ] E[x]E[x] T is the covariance matrix. We would need to compute and invert this matrix for every forward pass, and would need to compute its gradient for every backward pass.

Solving Internal Covariate Shift: Normalization Less expensive would be shifting and scaling such that each dimension of x (each hidden unit) has mean 0 and variance 1 (but does not have decorrelated variables): ˆx (k) = x (k) E[x (k) ] Var[x (k) ] + ɛ for each k in 1,..., n This requires many fewer operations, and will still provide the benefit of a stable mean and variance. This is the operation that we will use to perform batch normalization.

Solving Covariate Shift It may be that a distribution of mean 0 and variance 1 isn t the optimal input for the next layer. We can add extra parameters γ and β to modify the mean and variance after normalizing along each dimension. ( ) y (k) = γ (k)ˆx (k) + β (k) = γ (k) x (k) E[x (k) ] + β (k) Var[x (k) ] + ɛ or equivalently, y = γ ˆx + β

Redundancy We normalized each dimension to have a specific mean and variance, and then added parameters that modify the mean and variance. Isn t this redundant? The idea: we have separated parameters that pay attention to mean and variance of each hidden unit (β and γ, respectively) from parameters that pay attention to interactions among hidden units (the weight matrix W ), instead of having W and b perform both of those functions.

The Batch Normalization Function Putting all of this together, the batch normalization function for each dimension x := x (k) is: This function is placed in between each layer.

The Batch Normalization Function s Derivatives We will also need partial derivatives of this function to backpropagate; these are given by

Batch Normalization at Testing Normalizing each minibatch means the output of each sample is dependent on the other examples in the minibatch, but we want the machine to be deterministic at testing time. To solve this, we keep track of the average mean µ and variance σ of each minibatch during training, and we replace the expectations in the normalization step. ˆx (k) = x (k) E[x (k) ] Var[x (k) ] + ɛ ˆx (k) = x (k) µ (k) (σ (k) ) 2 + ɛ

Implementation Details Works best when shuffling training examples within minibatches We add batch normalization before the activation function: g(bn(wx + b)) which makes b redundant, so we get g(bn(wx))

Benefits Larger initial learning rate Quicker learning rate decay Smaller L2-regularization coefficient No need for dropout

Hidden Activations on MNIST Feedforward neural network with three hidden layers of 100 units each, sigmoid activation, trained on MNIST

Inception & ImageNet Classification Inception: large network ( 10 7 parameters) that performs the ImageNet classification task (1000 possible classes)

Inception & ImageNet Classification

Questions Questions?