Normalization Techniques

Size: px

Start display at page:

Download "Normalization Techniques"

Virgil Kelly
5 years ago
Views:

1 Normalization Techniques Devansh Arpit Normalization Techniques 1 / 39

2 Table of Contents 1 Introduction 2 Motivation 3 Batch Normalization 4 Normalization Propagation 5 Weight Normalization 6 Layer Normalization 7 Conclusion Introduction Normalization Techniques 2 / 39

3 Introduction Task: Train deep networks using gradient descent based methods Introduction Normalization Techniques 3 / 39

4 Introduction Task: Train deep networks using gradient descent based methods Goal: Improve optimization for faster convergence Introduction Normalization Techniques 3 / 39

5 Table of Contents 1 Introduction 2 Motivation 3 Batch Normalization 4 Normalization Propagation 5 Weight Normalization 6 Layer Normalization 7 Conclusion Motivation Normalization Techniques 4 / 39

6 Motivation 1/2 Covariate Shift (Shimodaira, 2000) Transfer mapping from domain 1 to domain 2: X P(Y X) Y P(Y X) Y P(Y X) Y P(X) X P(X) X X Domain 1 Domain 2 P(Y X) identical for both Domain Learn best approximation P θ (Y X) P(Y X) X Motivation Normalization Techniques 5 / 39

7 Motivation 1/2 Covariate Shift (Shimodaira, 2000) Transfer mapping from domain 1 to domain 2: X P(Y X) Y P(Y X) Y P(Y X) Y P(X) X P(X) X X Domain 1 Domain 2 P(Y X) identical for both Domain Learn best approximation P θ (Y X) P(Y X) Maximum Likelihood focuses on high density regions of X X Motivation Normalization Techniques 5 / 39

8 Motivation 1/2 Covariate Shift (Shimodaira, 2000) Transfer mapping from domain 1 to domain 2: X P(Y X) Y P(Y X) Y P(Y X) Y P(X) X P(X) X X Domain 1 Domain 2 P(Y X) identical for both Domain Learn best approximation P θ (Y X) P(Y X) Maximum Likelihood focuses on high density regions of X Low density regions of X are artifacts in P θ (Y X) X Motivation Normalization Techniques 5 / 39

9 Motivation 1/2 Covariate Shift (Shimodaira, 2000) Transfer mapping from domain 1 to domain 2: X P(Y X) Y P(Y X) Y P(Y X) Y P(X) X P(X) X X Domain 1 Domain 2 P(Y X) identical for both Domain Learn best approximation P θ (Y X) P(Y X) Maximum Likelihood focuses on high density regions of X Low density regions of X are artifacts in P θ (Y X) P θ (Y X) for Domain 1 P θ (Y X) for Domain 2 X Motivation Normalization Techniques 5 / 39

10 Motivation 1/2 Covariate Shift (Shimodaira, 2000) Simulation: Design: P(X ) is different but P(Y X ) is identical Motivation Normalization Techniques 6 / 39

11 Motivation 1/2 Covariate Shift (Shimodaira, 2000) Simulation: Design: P(X ) is different but P(Y X ) is identical Sample x P(X ) separately from the 2 domains, then generate y for each x using P(Y X ) Motivation Normalization Techniques 6 / 39

12 Motivation 1/2 Covariate Shift (Shimodaira, 2000) Simulation: Design: P(X ) is different but P(Y X ) is identical Sample x P(X ) separately from the 2 domains, then generate y for each x using P(Y X ) Sampled points for each domain are along the curve contained in the red box Motivation Normalization Techniques 6 / 39

13 Motivation 1/2 Covariate Shift (Shimodaira, 2000) Simulation: Learn P θ (Y X ) separately using the sampled (x,y) pair Motivation Normalization Techniques 7 / 39

14 Motivation 1/2 Covariate Shift (Shimodaira, 2000) Simulation: Learn P θ (Y X ) separately using the sampled (x,y) pair Learned P θ (Y X ) for both domains are shown by dotted curves Motivation Normalization Techniques 7 / 39

15 Learn P θ (Y X ) separately using the sampled (x,y) pair Learned P θ (Y X ) for both domains are shown by dotted curves Optimal P θ (Y X ) are different = P θ (Y X ) not transferable Motivation Normalization Techniques 7 / 39 Motivation 1/2 Covariate Shift (Shimodaira, 2000) Simulation:

16 Motivation 1/2 Covariate Shift Internal Covariate Shift (Ioffe and Szegedy, 2015) X i P(X i+1 X i ) X i+1 Multi-layer end-to-end learning model P(X 1 X 0 ) P(X 2 X 1 ) P(X 3 X 2 ) X 0 X 1 X 2 X 3 Learn the best approximation P θ (X i+1 X i ) P(X i+1 X i ) Motivation Normalization Techniques 8 / 39

17 Motivation 1/2 Covariate Shift Internal Covariate Shift (Ioffe and Szegedy, 2015) X i P(X i+1 X i ) X i+1 Multi-layer end-to-end learning model P(X 1 X 0 ) P(X 2 X 1 ) P(X 3 X 2 ) X 0 X 1 X 2 X 3 Learn the best approximation P θ (X i+1 X i ) P(X i+1 X i ) During SGD updates, hidden layer P(X i ) keeps shifting Motivation Normalization Techniques 8 / 39

18 Motivation 1/2 Covariate Shift Internal Covariate Shift (Ioffe and Szegedy, 2015) X i P(X i+1 X i ) X i+1 Multi-layer end-to-end learning model P(X 1 X 0 ) P(X 2 X 1 ) P(X 3 X 2 ) X 0 X 1 X 2 X 3 Learn the best approximation P θ (X i+1 X i ) P(X i+1 X i ) During SGD updates, hidden layer P(X i ) keeps shifting = target P θ (X i+1 X i ) keeps shifting Motivation Normalization Techniques 8 / 39

19 Motivation 1/2 Covariate Shift Internal Covariate Shift (Ioffe and Szegedy, 2015) X i P(X i+1 X i ) X i+1 Multi-layer end-to-end learning model P(X 1 X 0 ) P(X 2 X 1 ) P(X 3 X 2 ) X 0 X 1 X 2 X 3 Learn the best approximation P θ (X i+1 X i ) P(X i+1 X i ) During SGD updates, hidden layer P(X i ) keeps shifting = target P θ (X i+1 X i ) keeps shifting Learning P θ (X i+1 X i ) using SGD is slow Motivation Normalization Techniques 8 / 39

20 Motivation 2/2 Optimization Perspective (Lecun, 1998) What happens if input samples are not 0 mean? Let: h i = σ(a i ), where a i = w T i x + b i Consider the SGD weight update equation: w t+1 i = wi t η L w i = wi t η L a i a i w i = wi t ηδx x w i h i Motivation Normalization Techniques 9 / 39

21 Motivation 2/2 Optimization Perspective (Lecun, 1998) What happens if input samples are not 0 mean? Let: h i = σ(a i ), where a i = w T i x + b i Consider the SGD weight update equation: w t+1 i = wi t η L w i = wi t η L a i a i w i = wi t ηδx x w i h i If all x are positive = all the weight units get updated with the same sign! Motivation Normalization Techniques 9 / 39

22 Motivation 2/2 Optimization Perspective (Lecun, 1998) What happens if input samples are not 0 mean? Let: h i = σ(a i ), where a i = w T i x + b i Consider the SGD weight update equation: w t+1 i = wi t η L w i = wi t η L a i a i w i = wi t ηδx x w i h i If all x are positive = all the weight units get updated with the same sign! Any bias in x introduces bias in weight updates bad! Learning slows down path to optimal w becomes longer Motivation Normalization Techniques 9 / 39

23 Motivation 2/2 Optimization Perspective (Lecun, 1998) What happens if input samples are not 0 mean and unit variance? x (A more concrete special case) Consider loss: L(W) = (1/2)E x,y [ y Wx 2 ] Update rule: W t+1 i = W t i η L W i w i y i Motivation Normalization Techniques 10 / 39

24 Motivation 2/2 Optimization Perspective (Lecun, 1998) What happens if input samples are not 0 mean and unit variance? x (A more concrete special case) Consider loss: L(W) = (1/2)E x,y [ y Wx 2 ] Update rule: W t+1 i = W t i η L W i w i y i Using Taylor s expansion around a minima Wi : L(W i ) = L(Wi ) + (W i Wi )T 2 L(W wi 2 i Wi ) Motivation Normalization Techniques 10 / 39

25 Motivation 2/2 Optimization Perspective (Lecun, 1998) What happens if input samples are not 0 mean and unit variance? x (A more concrete special case) Consider loss: L(W) = (1/2)E x,y [ y Wx 2 ] Update rule: W t+1 i = W t i η L W i w i y i Using Taylor s expansion around a minima Wi : L(W i ) = L(Wi ) + (W i Wi )T 2 L(W wi 2 i Wi ) Taking derivative on both sides: ( ) L(W i ) W i = (1/2) 2 L(W wi 2 i Wi ) 2 L = E wi 2 x [xx T ] Motivation Normalization Techniques 10 / 39

26 Motivation 2/2 Optimization Perspective (Lecun, 1998) What happens if input samples are not 0 mean and unit variance? x (A more concrete special case) Consider loss: L(W) = (1/2)E x,y [ y Wx 2 ] Update rule: W t+1 i = W t i η L W i w i y i Using Taylor s expansion around a minima Wi : L(W i ) = L(Wi ) + (W i Wi )T 2 L(W wi 2 i Wi ) Taking derivative on both sides: ( ) L(W i ) W i = (1/2) 2 L(W wi 2 i Wi ) 2 L = E wi 2 x [xx T ] Thus update rule is: W t+1 i = Wi t ηe x [xx T ](Wi t Wi ) Motivation Normalization Techniques 10 / 39

27 Motivation 2/2 Optimization Perspective (Lecun, 1998) What happens if input samples are not 0 mean and unit variance? x (A more concrete special case) Consider loss: L(W) = (1/2)E x,y [ y Wx 2 ] Update rule: W t+1 i = W t i ηe x[xx T ](W t i W i ) w i y i Data distribution W t i x Loss curvature y W t+1 i Largest Eigenvector = Wi t η λ max... λ min W i * (W t i W i ) Motivation Normalization Techniques 11 / 39

28 Motivation 2/2 Optimization Perspective (Lecun, 1998) What happens if input samples are not 0 mean and unit variance? (A more concrete special case) Data distribution W t i x Loss curvature y W t+1 i Largest Eigenvector = Wi t η λ max... λ min W i * (W t i W i ) Large mean and unequal variance = λ max >> λ min Each dimension of W t i needs a different learning rate Hence overall learning rate η is capped by 1 λ max (λ max is the largest eigenvalue of E x[xx T ]) Motivation Normalization Techniques 12 / 39

29 Motivation Take home message: At least have zero mean (avoid single large eigenvalue) Equal variance across dimensions is better (not very useful if mean! = 0) Uncorrelated (spherical) representations are best (but expensive!) Motivation Normalization Techniques 13 / 39

30 Motivation Take home message: At least have zero mean (avoid single large eigenvalue) Equal variance across dimensions is better (not very useful if mean! = 0) Uncorrelated (spherical) representations are best (but expensive!) Maintain these properties at all hidden layers during training! Motivation Normalization Techniques 13 / 39

31 Table of Contents 1 Introduction 2 Motivation 3 Batch Normalization 4 Normalization Propagation 5 Weight Normalization 6 Layer Normalization 7 Conclusion Batch Normalization Normalization Techniques 14 / 39

32 Batch Normalization (Ioffe and Szegedy, 2015) Avoid covariate shift Normalize the entire distribution at every layer at every iteration! Batch Normalization Normalization Techniques 15 / 39

33 Batch Normalization (Ioffe and Szegedy, 2015) Avoid covariate shift Normalize the entire distribution at every layer at every iteration! Simply infeasible!!! Batch Normalization Normalization Techniques 15 / 39

34 Batch Normalization (Ioffe and Szegedy, 2015) Avoid covariate shift Normalize the entire distribution at every layer at every iteration! Simply infeasible!!! Alleviate it by normalizing the 1 st two orders of statistics? Needs whitening over the entire dataset at every iteration! Batch Normalization Normalization Techniques 15 / 39

35 Batch Normalization (Ioffe and Szegedy, 2015) Avoid covariate shift Normalize the entire distribution at every layer at every iteration! Simply infeasible!!! Alleviate it by normalizing the 1 st two orders of statistics? Needs whitening over the entire dataset at every iteration! Still very expensive!!! Batch Normalization Normalization Techniques 15 / 39

36 Batch Normalization (Ioffe and Szegedy, 2015) Avoid covariate shift Normalize the entire distribution at every layer at every iteration! Simply infeasible!!! Alleviate it by normalizing the 1 st two orders of statistics? Needs whitening over the entire dataset at every iteration! Still very expensive!!! Key idea: Compute statistics over mini-batch Compute statistics unit-wise (notice mean is still optimal per batch) Batch Normalization Normalization Techniques 15 / 39

37 Batch Normalization (Ioffe and Szegedy, 2015) Avoid covariate shift Normalize the entire distribution at every layer at every iteration! Simply infeasible!!! Alleviate it by normalizing the 1 st two orders of statistics? Needs whitening over the entire dataset at every iteration! Still very expensive!!! Key idea: Compute statistics over mini-batch Compute statistics unit-wise (notice mean is still optimal per batch) Finally we can do something about it :) Batch Normalization Normalization Techniques 15 / 39

38 Batch Normalization (Ioffe and Szegedy, 2015) Consider any hidden unit s pre-activation a i Then normalized pre-activation under BN is given by: BN(a i ) = (a i E B [a i ])) varb (a i ) Batch Normalization Normalization Techniques 16 / 39

Batch Normalization (Ioffe and Szegedy, 2015) Consider any hidden unit s pre-activation a i Then normalized pre-activation under BN is given by: BN(a i ) = (a i E B [a i ])) varb (a i ) a 2 Data

39 Batch Normalization (Ioffe and Szegedy, 2015) Consider any hidden unit s pre-activation a i Then normalized pre-activation under BN is given by: BN(a i ) = (a i E B [a i ])) varb (a i ) a 2 Data covariance BN Data covariance Unit sphere a 1 Unit sphere Pretty close! (approximation is worst when principal components are maximally away from axis) Batch Normalization Normalization Techniques 16 / 39

40 Batch Normalization (Ioffe and Szegedy, 2015) A traditional vs. Batch Normalized (BN) ReLU layer: Traditional: x a i = W T i x + β i h =ReLU(a) o = h BN: x â i = γ i (WT i (x E B [x])) var B (W i T + β i h =ReLU(â) o = h x) Batch Normalization Normalization Techniques 17 / 39

41 Batch Normalization (Ioffe and Szegedy, 2015) A traditional vs. Batch Normalized (BN) ReLU layer: Traditional: x a i = W T i x + β i h =ReLU(a) o = h BN: x â i = γ i (WT i (x E B [x])) var B (W i T + β i h =ReLU(â) o = h x) γ i and β i are learnable scale and bias parameters Batch Normalization Normalization Techniques 17 / 39

42 Batch Normalization (Ioffe and Szegedy, 2015) A traditional vs. Batch Normalized (BN) ReLU layer: Traditional: x a i = W T i x + β i h =ReLU(a) o = h BN: x â i = γ i (WT i (x E B [x])) var B (W i T + β i h =ReLU(â) o = h x) γ i and β i are learnable scale and bias parameters Notice bias β i is outside the normalization (why?) Batch Normalization Normalization Techniques 17 / 39

43 Batch Normalization (Ioffe and Szegedy, 2015) A traditional vs. Batch Normalized (BN) ReLU layer: Traditional: x a i = W T i x + β i h =ReLU(a) o = h BN: x â i = γ i (WT i (x E B [x])) var B (W i T + β i h =ReLU(â) o = h x) γ i and β i are learnable scale and bias parameters Notice bias β i is outside the normalization (why?) Why not apply BN post-activation? (hint: Gaussianity) Batch Normalization Normalization Techniques 17 / 39

44 Batch Normalization (Ioffe and Szegedy, 2015) A traditional vs. Batch Normalized (BN) ReLU layer: Traditional: x a i = W T i x + β i h =ReLU(a) o = h BN: x â i = γ i (WT i (x E B [x])) var B (W i T + β i h =ReLU(â) o = h x) γ i and β i are learnable scale and bias parameters Notice bias β i is outside the normalization (why?) Why not apply BN post-activation? (hint: Gaussianity) During test time use running average estimates of mean and standard deviation Batch Normalization Normalization Techniques 17 / 39

45 Batch Normalization (Ioffe and Szegedy, 2015) How to extend to convolutional layers? Simply treat each depth vector as a separate sample Batch Normalization Normalization Techniques 18 / 39

46 Batch Normalization (Ioffe and Szegedy, 2015) How to extend to convolutional layers? Simply treat each depth vector as a separate sample Back-propagate through the normalization Otherwise normalizing representations changes prediction Batch Normalization Normalization Techniques 18 / 39

47 Batch Normalization (Ioffe and Szegedy, 2015) Effects of batch normalization: Parameter scaling: Representations become scale invariant BN(cWx) = BN(Wx) Gradients become inversely proportional to parameter scale BN(cWx) cw = (1/c) BN(Wx) W Allows for large learning rates! Batch Normalization Normalization Techniques 19 / 39

48 Batch Normalization (Ioffe and Szegedy, 2015) Effects of batch normalization: Parameter scaling: Representations become scale invariant BN(cWx) = BN(Wx) Gradients become inversely proportional to parameter scale BN(cWx) cw = (1/c) BN(Wx) W Allows for large learning rates! Regularization: Each sample gets a different representation depending on mini-batch Batch Normalization Normalization Techniques 19 / 39

49 Batch Normalization (Ioffe and Szegedy, 2015) Effects of batch normalization: Parameter scaling: Representations become scale invariant BN(cWx) = BN(Wx) Gradients become inversely proportional to parameter scale BN(cWx) cw = (1/c) BN(Wx) W Allows for large learning rates! Regularization: Each sample gets a different representation depending on mini-batch Awesome trick! But also puzzling at the same time What if 2 different samples in two different batches get the same representation after BN? Batch Normalization Normalization Techniques 19 / 39

50 Batch Normalization (Ioffe and Szegedy, 2015) Figures taken from Ioffe, Sergey, and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML (2015). Batch Normalization Normalization Techniques 20 / 39

51 Table of Contents 1 Introduction 2 Motivation 3 Batch Normalization 4 Normalization Propagation 5 Weight Normalization 6 Layer Normalization 7 Conclusion Normalization Propagation Normalization Techniques 21 / 39

52 Normalization Propagation (Arpit et al, 2016) Key idea: Instead of recomputing statistics at every layer, exploit normalization in data by propagating it to hidden layers W Coherence = max i T W j Wi,W j,i j W i 2 W j 2 Incoherence = pairwise angle between vectors is large Normalization Propagation Normalization Techniques 22 / 39

53 Normalization Propagation (Arpit et al, 2016) Key idea: Instead of recomputing statistics at every layer, exploit normalization in data by propagating it to hidden layers Trick: If x has 0 mean I covariance, then Wx also has 0 mean I covariance if W i 2 = 1 and W is incoherent W Coherence = max i T W j Wi,W j,i j W i 2 W j 2 Incoherence = pairwise angle between vectors is large Normalization Propagation Normalization Techniques 22 / 39

54 Normalization Propagation (Arpit et al, 2016) Key idea: Instead of recomputing statistics at every layer, exploit normalization in data by propagating it to hidden layers Trick: If x has 0 mean I covariance, then Wx also has 0 mean I covariance if W i 2 = 1 and W is incoherent Data covariance Unit sphere Wx W i 2 =1,incoherent W Data covariance Unit sphere W Coherence = max i T W j Wi,W j,i j W i 2 W j 2 Incoherence = pairwise angle between vectors is large Normalization Propagation Normalization Techniques 22 / 39

Normalization Propagation (Arpit et al, 2016) Key idea: Instead of recomputing statistics at every layer, exploit normalization in data by propagating it

W i 2 =1,incoherent W Data covariance Unit sphere Assumption: pre-activations Wx are approximately Gaussian σ(w T i x) has fixed mean and variance W

55 Normalization Propagation (Arpit et al, 2016) Key idea: Instead of recomputing statistics at every layer, exploit normalization in data by propagating it to hidden layers Trick: If x has 0 mean I covariance, then Wx also has 0 mean I covariance if W i 2 = 1 and W is incoherent Data covariance Unit sphere Wx W i 2 =1,incoherent W Data covariance Unit sphere Assumption: pre-activations Wx are approximately Gaussian σ(w T i x) has fixed mean and variance W Coherence = max i T W j Wi,W j,i j W i 2 W j 2 Incoherence = pairwise angle between vectors is large Normalization Propagation Normalization Techniques 22 / 39

56 Normalization Propagation (Arpit et al, 2016) A traditional vs. NormProp ReLU layer: Traditional: x a i = W T i x + β i h =ReLU(a) o = h NormProp: x â i = γ i (WT i [ ] h 1 x) 2π + β W i i h =ReLU(â) o = ( π 1 ) Normalization Propagation Normalization Techniques 23 / 39

57 Normalization Propagation (Arpit et al, 2016) A traditional vs. NormProp ReLU layer: Traditional: x a i = W T i x + β i h =ReLU(a) o = h NormProp: x â i = γ i (WT i [ ] h 1 x) 2π + β W i i h =ReLU(â) o = ( π 1 ) γ i and β i are learnable scale and bias parameters Normalization Propagation Normalization Techniques 23 / 39

58 Normalization Propagation (Arpit et al, 2016) A traditional vs. NormProp ReLU layer: Traditional: x a i = W T i x + β i h =ReLU(a) o = h NormProp: x â i = γ i (WT i [ ] h 1 x) 2π + β W i i h =ReLU(â) o = ( π 1 ) γ i and β i are learnable scale and bias parameters Data can be normalized globally or batch-wise Normalization Propagation Normalization Techniques 23 / 39

59 Normalization Propagation (Arpit et al, 2016) A traditional vs. NormProp ReLU layer: Traditional: x a i = W T i x + β i h =ReLU(a) o = h NormProp: x â i = γ i (WT i [ ] h 1 x) 2π + β W i i h =ReLU(â) o = ( π 1 ) γ i and β i are learnable scale and bias parameters Data can be normalized globally or batch-wise train and test normalizations identical (for global data normalization) Normalization Propagation Normalization Techniques 23 / 39

60 Normalization Propagation (Arpit et al, 2016) A traditional vs. NormProp ReLU layer: Traditional: x a i = W T i x + β i h =ReLU(a) o = h NormProp: x â i = γ i (WT i [ ] h 1 x) 2π + β W i i h =ReLU(â) o = ( π 1 ) γ i and β i are learnable scale and bias parameters Data can be normalized globally or batch-wise train and test normalizations identical (for global data normalization) Applicable with batch-size 1 Normalization Propagation Normalization Techniques 23 / 39

61 Normalization Propagation (Arpit et al, 2016) How to extend to convolutional layers? â i = γ i (W i x) W i F + β i (i th featuremap) Normalization Propagation Normalization Techniques 24 / 39

62 Normalization Propagation (Arpit et al, 2016) How to extend to convolutional layers? â i = γ i (W i x) W i F + β i (i th featuremap) Back-propagate through the normalization of weight scale Otherwise normalizing representations changes prediction Normalization Propagation Normalization Techniques 24 / 39

63 Normalization Propagation (Arpit et al, 2016) Analysis: Singular values of J = o x 1 prevents gradient problems (Saxe et al. 2014) For ReLU layer: E x [JJ T ] 1.47I = Singular values of J 1.2 Extension to other Activation functions (σ): ( ) ] o i = 1 c 1 [σ γi (W i x) W i F + β i c 2 c 1 = var(σ(y )), c 2 = E[σ(Y )] Y has Standard Normal distribution Identical parameter scaling and regularization effects as BN Normalization Propagation Normalization Techniques 25 / 39

64 Normalization Propagation (Arpit et al, 2016) validation accuracy NormProp vs. BN Convergence (batch-size 50) NormProp BN epochs Methods Test Error (%) CIFAR-10 with data augmentation NormProp 7.47 Batch Normalization 7.25 NIN + ALP units 7.51 NIN 8.81 DSN 7.97 Maxout 9.38 Normalization Propagation Normalization Techniques 26 / 39

65 Table of Contents 1 Introduction 2 Motivation 3 Batch Normalization 4 Normalization Propagation 5 Weight Normalization 6 Layer Normalization 7 Conclusion Weight Normalization Normalization Techniques 27 / 39

66 Weight Normalization (Salimans and Kingma, 2016) Key idea: Decouple the scale and direction of weights Weight Normalization Normalization Techniques 28 / 39

67 Weight Normalization (Salimans and Kingma, 2016) Key idea: Decouple the scale and direction of weights Initialize parameters to have 0 mean unit variance pre-activations Weight Normalization Normalization Techniques 28 / 39

68 Weight Normalization (Salimans and Kingma, 2016) Key idea: Decouple the scale and direction of weights Initialize parameters to have 0 mean unit variance pre-activations Back-propagate through the normalization Weight Normalization Normalization Techniques 28 / 39

69 Weight Normalization (Salimans and Kingma, 2016) Consider any hidden unit s pre-activation a i Then weight normalized pre-activation is given by: WeighNorm(a i ) = γ i(w T i x) W i 2 + β i 1 Initialize: γ i = β varb (Wi T i = E B[Wi T x/ W i 2 ] x)/ W i 2 varb (Wi T x)/ W i 2 Equivalently pre-activations get normalized using 1 mini-batch initially Optimization doesn t have explicit mean and variance normalization Weight Normalization Normalization Techniques 29 / 39

70 Weight Normalization (Salimans and Kingma, 2016) A traditional vs. Weight Normalized ReLU layer: Traditional: x a i = W T i x + β i h =ReLU(a) o = h WeightNorm: x â i = γ i (WT i x) + β W i i h =ReLU(â) o = h 2 Weight Normalization Normalization Techniques 30 / 39

71 Weight Normalization (Salimans and Kingma, 2016) A traditional vs. Weight Normalized ReLU layer: Traditional: x a i = W T i x + β i h =ReLU(a) o = h WeightNorm: x â i = γ i (WT i x) + β W i i h =ReLU(â) o = h 2 γ i and β i are learnable scale and bias parameters Weight Normalization Normalization Techniques 30 / 39

72 Weight Normalization (Salimans and Kingma, 2016) A traditional vs. Weight Normalized ReLU layer: Traditional: x a i = W T i x + β i h =ReLU(a) o = h WeightNorm: x â i = γ i (WT i x) + β W i i h =ReLU(â) o = h 2 γ i and β i are learnable scale and bias parameters train and test normalizations identical Weight Normalization Normalization Techniques 30 / 39

73 Weight Normalization (Salimans and Kingma, 2016) A traditional vs. Weight Normalized ReLU layer: Traditional: x a i = W T i x + β i h =ReLU(a) o = h WeightNorm: x â i = γ i (WT i x) + β W i i h =ReLU(â) o = h 2 γ i and β i are learnable scale and bias parameters train and test normalizations identical Applicable with batch-size 1 Weight Normalization Normalization Techniques 30 / 39

74 Weight Normalization (Salimans and Kingma, 2016) Effect of backpropagating through normalization: Let θ = γ(w) w 2, then using SGD: Thus, w w w = η L w γ = η (I w t θt θ tt ) L 2 θ t 2 θ Weight Normalization Normalization Techniques 31 / 39

75 Weight Normalization (Salimans and Kingma, 2016) Effect of backpropagating through normalization: Let θ = γ(w) w 2, then using SGD: Thus, w w w = η L w γ = η (I w t θt θ tt ) L 2 θ t 2 θ Let c = w / w, notice: w t+1 = (1 + c 2 ) w t Weight Normalization Normalization Techniques 31 / 39

76 Weight Normalization (Salimans and Kingma, 2016) Effect of backpropagating through normalization: Let θ = γ(w) w 2, then using SGD: Thus, w w w = η L w γ = η (I w t θt θ tt ) L 2 θ t 2 θ Let c = w / w, notice: w t+1 = (1 + c 2 ) w t Large c = large w t+1 = small gradient w Weight Normalization Normalization Techniques 31 / 39

77 Weight Normalization (Salimans and Kingma, 2016) Effect of backpropagating through normalization: Let θ = γ(w) w 2, then using SGD: Thus, w w w = η L w γ = η (I w t θt θ tt ) L 2 θ t 2 θ Let c = w / w, notice: w t+1 = (1 + c 2 ) w t Large c = large w t+1 = small gradient w Small c = w t+1 w t = gradient scale w unchanged Weight Normalization Normalization Techniques 31 / 39

78 Weight Normalization (Salimans and Kingma, 2016) Effect of backpropagating through normalization: Let θ = γ(w) w 2, then using SGD: Thus, w w w = η L w γ = η (I w t θt θ tt ) L 2 θ t 2 θ Let c = w / w, notice: w t+1 = (1 + c 2 ) w t Large c = large w t+1 = small gradient w Small c = w t+1 w t = gradient scale w unchanged The above analysis applies to NormProp as well! Weight Normalization Normalization Techniques 31 / 39

Weight Normalization (Salimans and Kingma, 2016) Figures taken from Salimans, Tim, and Diederik P. Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks.

79 Weight Normalization (Salimans and Kingma, 2016) Figures taken from Salimans, Tim, and Diederik P. Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. Advances in Neural Information Processing Systems Weight Normalization Normalization Techniques 32 / 39

80 Table of Contents 1 Introduction 2 Motivation 3 Batch Normalization 4 Normalization Propagation 5 Weight Normalization 6 Layer Normalization 7 Conclusion Layer Normalization Normalization Techniques 33 / 39

81 Layer Normalization (Ba et al, 2016) Key idea: Compute statistics for each sample separately Layer Normalization Normalization Techniques 34 / 39

82 Layer Normalization (Ba et al, 2016) Key idea: Compute statistics for each sample separately Trick: Compute mean and standard deviation across units instead of mini-batch Layer Normalization Normalization Techniques 34 / 39

83 Layer Normalization (Ba et al, 2016) Key idea: Compute statistics for each sample separately Trick: Compute mean and standard deviation across units instead of mini-batch Aimed towards application to recurrent neural nets: RNNs have variable number of temporal layers There can be more number of layers at test time (BN is not applicable) Layer Normalization Normalization Techniques 34 / 39

84 Layer Normalization (Ba et al, 2016) Consider a temporal hidden layer s pre-activation a t = W hh h t 1 + W xh x t Then layer normalized pre-activation a t is given by: LN(a t ) = γ σ t (a t µ t ) + β µ t = 1 H H ai t σ t = 1 H i=1 H (ai t µ t ) 2 i=1 Layer Normalization Normalization Techniques 35 / 39

85 Layer Normalization (Ba et al, 2016) LN(a t ) = γ σ t (a t µ t ) + β µ t = 1 H H ai t σ t = 1 H i=1 H (ai t µ t ) 2 i=1 Effects of layer normalization: Invariance to weight scaling and translation Data rescaling and translation Norm of weight controls effective learning rate Layer Normalization Normalization Techniques 36 / 39

86 Layer Normalization (Ba et al, 2016) Figures taken from Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. arxiv preprint arxiv: (2016). Layer Normalization Normalization Techniques 37 / 39

87 Table of Contents 1 Introduction 2 Motivation 3 Batch Normalization 4 Normalization Propagation 5 Weight Normalization 6 Layer Normalization 7 Conclusion Conclusion Normalization Techniques 38 / 39

88 Conclusion Removing internal covariate shift in DNNs leads to faster convergence Generally, even good initialization plays an important role (Mishkin and Matas, 2016) Making representations (scale, shift) invariant seems to boost convergence SGD + better optimization = better generalization? what about bad local minima? (Zhang et al, 2016) Conclusion Normalization Techniques 39 / 39

Normalization Techniques in Training of Deep Neural Networks

Normalization Techniques in Training of Deep Neural Networks Lei Huang ( 黄雷 ) State Key Laboratory of Software Development Environment, Beihang University Mail:huanglei@nlsde.buaa.edu.cn August 17 th,