Internal Covariate Shift Batch Normalization Implementation Experiments. Batch Normalization. Devin Willmott. University of Kentucky.

Batch Normalization Devin Willmott University of Kentucky October 23, 2017

Overview 1 Internal Covariate Shift 2 Batch Normalization 3 Implementation 4 Experiments

Covariate Shift Suppose we have two distributions, P and Q, where conditional distributions are equal: P(y x) = Q(y x) marginal distributions are not equal: P(x) Q(x) This situation is called covariate shift. Example: Two datasets where x i is a coin, y i is whether that coin is real or counterfeit. Dataset 1 has 50 nickels, 50 dimes, 50 quarters Dataset 2 has 25 nickels, 100 dimes, 25 quarters In each dataset, 90% of coins are real and 10% are counterfeit

Internal Covariate Shift Let N be an L-layer neural network, and θ N be its parameters. Let N 1 and N 2 be two smaller neural networks created by splitting N at layer l: N 1 is layers 1 through l 1, and has parameters θ N1 N 2 is layers l through L, and has parameters θ N2 Let P(y x) be the target distribution for N. Then N 1 has a target distribution of P(h (l) x), and N 2 has a target distribution of P(y h (l) ).

Internal Covariate Shift Suppose we train N using (entire) batch training. Consider this learning task from the perspective of N 2 at two consecutive training iterationsi and i + 1: At iteration i, N 2 receives the hidden layer h (l) from N 1. It computes ŷ, backpropagates to find gradient directions θn2 L, and updates its parameters θ N2. At iteration i + 1, N 2 receives the hidden layer h (l) from N 1. But h (l) is different than it was at iteration i, because N 1 also updated its parameters θ N1.

(Why Is) Internal Covariate Shift (A Problem?) At first, this may not seem like an issue: we just want N 2 to learn P(y h (l) ), so why does it matter if the distribution of h (l) changes? The problem is that N 2 can only represent a certain family of distributions, specified by θ N2. During training, it will try to adjust θ N2 to perform best in the most dense regions of input. Another way to think about it: when training a model, our goal is to find the marginal distribution P(y x), but we train our machine using the joint distribution P(x, y).

Solving Internal Covariate Shift: Data Whitening Data whitening: shifting and linearly transforming a set of data to have mean 0, variance 1, and decorrelated variables. One possible solution to internal covariate shift is to whiten the minibatch between each layer of the network. This would prevent each layer s parameters from having to change to match the mean and variance of the output of the previous layer.

Solving Internal Covariate Shift: Data Whitening For a data point x = (x (1),..., x (n) ) R n, we can whiten x with the transformation ˆx := Cov[x] 1/2 (x E[x]) where Cov[x] = E[xx T ] E[x]E[x] T is the covariance matrix. We would need to compute and invert this matrix for every forward pass, and would need to compute its gradient for every backward pass.

Solving Internal Covariate Shift: Normalization Less expensive would be shifting and scaling such that each dimension of x (each hidden unit) has mean 0 and variance 1 (but does not have decorrelated variables): ˆx (k) = x (k) E[x (k) ] Var[x (k) ] + ɛ for each k in 1,..., n This requires many fewer operations, and will still provide the benefit of a stable mean and variance. This is the operation that we will use to perform batch normalization.

Solving Covariate Shift It may be that a distribution of mean 0 and variance 1 isn t the optimal input for the next layer. We can add extra parameters γ and β to modify the mean and variance after normalizing along each dimension. ( ) y (k) = γ (k)ˆx (k) + β (k) = γ (k) x (k) E[x (k) ] + β (k) Var[x (k) ] + ɛ or equivalently, y = γ ˆx + β

Redundancy We normalized each dimension to have a specific mean and variance, and then added parameters that modify the mean and variance. Isn t this redundant? The idea: we have separated parameters that pay attention to mean and variance of each hidden unit (β and γ, respectively) from parameters that pay attention to interactions among hidden units (the weight matrix W ), instead of having W and b perform both of those functions.

The Batch Normalization Function Putting all of this together, the batch normalization function for each dimension x := x (k) is: This function is placed in between each layer.

The Batch Normalization Function s Derivatives We will also need partial derivatives of this function to backpropagate; these are given by

Batch Normalization at Testing Normalizing each minibatch means the output of each sample is dependent on the other examples in the minibatch, but we want the machine to be deterministic at testing time. To solve this, we keep track of the average mean µ and variance σ of each minibatch during training, and we replace the expectations in the normalization step. ˆx (k) = x (k) E[x (k) ] Var[x (k) ] + ɛ ˆx (k) = x (k) µ (k) (σ (k) ) 2 + ɛ

Implementation Details Works best when shuffling training examples within minibatches We add batch normalization before the activation function: g(bn(wx + b)) which makes b redundant, so we get g(bn(wx))

Benefits Larger initial learning rate Quicker learning rate decay Smaller L2-regularization coefficient No need for dropout

Hidden Activations on MNIST Feedforward neural network with three hidden layers of 100 units each, sigmoid activation, trained on MNIST

Inception & ImageNet Classification Inception: large network ( 10 7 parameters) that performs the ImageNet classification task (1000 possible classes)

Inception & ImageNet Classification

Questions Questions?