Normalization Techniques

Size: px
Start display at page:

Download "Normalization Techniques"

Transcription

1 Normalization Techniques Devansh Arpit Normalization Techniques 1 / 39

2 Table of Contents 1 Introduction 2 Motivation 3 Batch Normalization 4 Normalization Propagation 5 Weight Normalization 6 Layer Normalization 7 Conclusion Introduction Normalization Techniques 2 / 39

3 Introduction Task: Train deep networks using gradient descent based methods Introduction Normalization Techniques 3 / 39

4 Introduction Task: Train deep networks using gradient descent based methods Goal: Improve optimization for faster convergence Introduction Normalization Techniques 3 / 39

5 Table of Contents 1 Introduction 2 Motivation 3 Batch Normalization 4 Normalization Propagation 5 Weight Normalization 6 Layer Normalization 7 Conclusion Motivation Normalization Techniques 4 / 39

6 Motivation 1/2 Covariate Shift (Shimodaira, 2000) Transfer mapping from domain 1 to domain 2: X P(Y X) Y P(Y X) Y P(Y X) Y P(X) X P(X) X X Domain 1 Domain 2 P(Y X) identical for both Domain Learn best approximation P θ (Y X) P(Y X) X Motivation Normalization Techniques 5 / 39

7 Motivation 1/2 Covariate Shift (Shimodaira, 2000) Transfer mapping from domain 1 to domain 2: X P(Y X) Y P(Y X) Y P(Y X) Y P(X) X P(X) X X Domain 1 Domain 2 P(Y X) identical for both Domain Learn best approximation P θ (Y X) P(Y X) Maximum Likelihood focuses on high density regions of X X Motivation Normalization Techniques 5 / 39

8 Motivation 1/2 Covariate Shift (Shimodaira, 2000) Transfer mapping from domain 1 to domain 2: X P(Y X) Y P(Y X) Y P(Y X) Y P(X) X P(X) X X Domain 1 Domain 2 P(Y X) identical for both Domain Learn best approximation P θ (Y X) P(Y X) Maximum Likelihood focuses on high density regions of X Low density regions of X are artifacts in P θ (Y X) X Motivation Normalization Techniques 5 / 39

9 Motivation 1/2 Covariate Shift (Shimodaira, 2000) Transfer mapping from domain 1 to domain 2: X P(Y X) Y P(Y X) Y P(Y X) Y P(X) X P(X) X X Domain 1 Domain 2 P(Y X) identical for both Domain Learn best approximation P θ (Y X) P(Y X) Maximum Likelihood focuses on high density regions of X Low density regions of X are artifacts in P θ (Y X) P θ (Y X) for Domain 1 P θ (Y X) for Domain 2 X Motivation Normalization Techniques 5 / 39

10 Motivation 1/2 Covariate Shift (Shimodaira, 2000) Simulation: Design: P(X ) is different but P(Y X ) is identical Motivation Normalization Techniques 6 / 39

11 Motivation 1/2 Covariate Shift (Shimodaira, 2000) Simulation: Design: P(X ) is different but P(Y X ) is identical Sample x P(X ) separately from the 2 domains, then generate y for each x using P(Y X ) Motivation Normalization Techniques 6 / 39

12 Motivation 1/2 Covariate Shift (Shimodaira, 2000) Simulation: Design: P(X ) is different but P(Y X ) is identical Sample x P(X ) separately from the 2 domains, then generate y for each x using P(Y X ) Sampled points for each domain are along the curve contained in the red box Motivation Normalization Techniques 6 / 39

13 Motivation 1/2 Covariate Shift (Shimodaira, 2000) Simulation: Learn P θ (Y X ) separately using the sampled (x,y) pair Motivation Normalization Techniques 7 / 39

14 Motivation 1/2 Covariate Shift (Shimodaira, 2000) Simulation: Learn P θ (Y X ) separately using the sampled (x,y) pair Learned P θ (Y X ) for both domains are shown by dotted curves Motivation Normalization Techniques 7 / 39

15 Learn P θ (Y X ) separately using the sampled (x,y) pair Learned P θ (Y X ) for both domains are shown by dotted curves Optimal P θ (Y X ) are different = P θ (Y X ) not transferable Motivation Normalization Techniques 7 / 39 Motivation 1/2 Covariate Shift (Shimodaira, 2000) Simulation:

16 Motivation 1/2 Covariate Shift Internal Covariate Shift (Ioffe and Szegedy, 2015) X i P(X i+1 X i ) X i+1 Multi-layer end-to-end learning model P(X 1 X 0 ) P(X 2 X 1 ) P(X 3 X 2 ) X 0 X 1 X 2 X 3 Learn the best approximation P θ (X i+1 X i ) P(X i+1 X i ) Motivation Normalization Techniques 8 / 39

17 Motivation 1/2 Covariate Shift Internal Covariate Shift (Ioffe and Szegedy, 2015) X i P(X i+1 X i ) X i+1 Multi-layer end-to-end learning model P(X 1 X 0 ) P(X 2 X 1 ) P(X 3 X 2 ) X 0 X 1 X 2 X 3 Learn the best approximation P θ (X i+1 X i ) P(X i+1 X i ) During SGD updates, hidden layer P(X i ) keeps shifting Motivation Normalization Techniques 8 / 39

18 Motivation 1/2 Covariate Shift Internal Covariate Shift (Ioffe and Szegedy, 2015) X i P(X i+1 X i ) X i+1 Multi-layer end-to-end learning model P(X 1 X 0 ) P(X 2 X 1 ) P(X 3 X 2 ) X 0 X 1 X 2 X 3 Learn the best approximation P θ (X i+1 X i ) P(X i+1 X i ) During SGD updates, hidden layer P(X i ) keeps shifting = target P θ (X i+1 X i ) keeps shifting Motivation Normalization Techniques 8 / 39

19 Motivation 1/2 Covariate Shift Internal Covariate Shift (Ioffe and Szegedy, 2015) X i P(X i+1 X i ) X i+1 Multi-layer end-to-end learning model P(X 1 X 0 ) P(X 2 X 1 ) P(X 3 X 2 ) X 0 X 1 X 2 X 3 Learn the best approximation P θ (X i+1 X i ) P(X i+1 X i ) During SGD updates, hidden layer P(X i ) keeps shifting = target P θ (X i+1 X i ) keeps shifting Learning P θ (X i+1 X i ) using SGD is slow Motivation Normalization Techniques 8 / 39

20 Motivation 2/2 Optimization Perspective (Lecun, 1998) What happens if input samples are not 0 mean? Let: h i = σ(a i ), where a i = w T i x + b i Consider the SGD weight update equation: w t+1 i = wi t η L w i = wi t η L a i a i w i = wi t ηδx x w i h i Motivation Normalization Techniques 9 / 39

21 Motivation 2/2 Optimization Perspective (Lecun, 1998) What happens if input samples are not 0 mean? Let: h i = σ(a i ), where a i = w T i x + b i Consider the SGD weight update equation: w t+1 i = wi t η L w i = wi t η L a i a i w i = wi t ηδx x w i h i If all x are positive = all the weight units get updated with the same sign! Motivation Normalization Techniques 9 / 39

22 Motivation 2/2 Optimization Perspective (Lecun, 1998) What happens if input samples are not 0 mean? Let: h i = σ(a i ), where a i = w T i x + b i Consider the SGD weight update equation: w t+1 i = wi t η L w i = wi t η L a i a i w i = wi t ηδx x w i h i If all x are positive = all the weight units get updated with the same sign! Any bias in x introduces bias in weight updates bad! Learning slows down path to optimal w becomes longer Motivation Normalization Techniques 9 / 39

23 Motivation 2/2 Optimization Perspective (Lecun, 1998) What happens if input samples are not 0 mean and unit variance? x (A more concrete special case) Consider loss: L(W) = (1/2)E x,y [ y Wx 2 ] Update rule: W t+1 i = W t i η L W i w i y i Motivation Normalization Techniques 10 / 39

24 Motivation 2/2 Optimization Perspective (Lecun, 1998) What happens if input samples are not 0 mean and unit variance? x (A more concrete special case) Consider loss: L(W) = (1/2)E x,y [ y Wx 2 ] Update rule: W t+1 i = W t i η L W i w i y i Using Taylor s expansion around a minima Wi : L(W i ) = L(Wi ) + (W i Wi )T 2 L(W wi 2 i Wi ) Motivation Normalization Techniques 10 / 39

25 Motivation 2/2 Optimization Perspective (Lecun, 1998) What happens if input samples are not 0 mean and unit variance? x (A more concrete special case) Consider loss: L(W) = (1/2)E x,y [ y Wx 2 ] Update rule: W t+1 i = W t i η L W i w i y i Using Taylor s expansion around a minima Wi : L(W i ) = L(Wi ) + (W i Wi )T 2 L(W wi 2 i Wi ) Taking derivative on both sides: ( ) L(W i ) W i = (1/2) 2 L(W wi 2 i Wi ) 2 L = E wi 2 x [xx T ] Motivation Normalization Techniques 10 / 39

26 Motivation 2/2 Optimization Perspective (Lecun, 1998) What happens if input samples are not 0 mean and unit variance? x (A more concrete special case) Consider loss: L(W) = (1/2)E x,y [ y Wx 2 ] Update rule: W t+1 i = W t i η L W i w i y i Using Taylor s expansion around a minima Wi : L(W i ) = L(Wi ) + (W i Wi )T 2 L(W wi 2 i Wi ) Taking derivative on both sides: ( ) L(W i ) W i = (1/2) 2 L(W wi 2 i Wi ) 2 L = E wi 2 x [xx T ] Thus update rule is: W t+1 i = Wi t ηe x [xx T ](Wi t Wi ) Motivation Normalization Techniques 10 / 39

27 Motivation 2/2 Optimization Perspective (Lecun, 1998) What happens if input samples are not 0 mean and unit variance? x (A more concrete special case) Consider loss: L(W) = (1/2)E x,y [ y Wx 2 ] Update rule: W t+1 i = W t i ηe x[xx T ](W t i W i ) w i y i Data distribution W t i x Loss curvature y W t+1 i Largest Eigenvector = Wi t η λ max... λ min W i * (W t i W i ) Motivation Normalization Techniques 11 / 39

28 Motivation 2/2 Optimization Perspective (Lecun, 1998) What happens if input samples are not 0 mean and unit variance? (A more concrete special case) Data distribution W t i x Loss curvature y W t+1 i Largest Eigenvector = Wi t η λ max... λ min W i * (W t i W i ) Large mean and unequal variance = λ max >> λ min Each dimension of W t i needs a different learning rate Hence overall learning rate η is capped by 1 λ max (λ max is the largest eigenvalue of E x[xx T ]) Motivation Normalization Techniques 12 / 39

29 Motivation Take home message: At least have zero mean (avoid single large eigenvalue) Equal variance across dimensions is better (not very useful if mean! = 0) Uncorrelated (spherical) representations are best (but expensive!) Motivation Normalization Techniques 13 / 39

30 Motivation Take home message: At least have zero mean (avoid single large eigenvalue) Equal variance across dimensions is better (not very useful if mean! = 0) Uncorrelated (spherical) representations are best (but expensive!) Maintain these properties at all hidden layers during training! Motivation Normalization Techniques 13 / 39

31 Table of Contents 1 Introduction 2 Motivation 3 Batch Normalization 4 Normalization Propagation 5 Weight Normalization 6 Layer Normalization 7 Conclusion Batch Normalization Normalization Techniques 14 / 39

32 Batch Normalization (Ioffe and Szegedy, 2015) Avoid covariate shift Normalize the entire distribution at every layer at every iteration! Batch Normalization Normalization Techniques 15 / 39

33 Batch Normalization (Ioffe and Szegedy, 2015) Avoid covariate shift Normalize the entire distribution at every layer at every iteration! Simply infeasible!!! Batch Normalization Normalization Techniques 15 / 39

34 Batch Normalization (Ioffe and Szegedy, 2015) Avoid covariate shift Normalize the entire distribution at every layer at every iteration! Simply infeasible!!! Alleviate it by normalizing the 1 st two orders of statistics? Needs whitening over the entire dataset at every iteration! Batch Normalization Normalization Techniques 15 / 39

35 Batch Normalization (Ioffe and Szegedy, 2015) Avoid covariate shift Normalize the entire distribution at every layer at every iteration! Simply infeasible!!! Alleviate it by normalizing the 1 st two orders of statistics? Needs whitening over the entire dataset at every iteration! Still very expensive!!! Batch Normalization Normalization Techniques 15 / 39

36 Batch Normalization (Ioffe and Szegedy, 2015) Avoid covariate shift Normalize the entire distribution at every layer at every iteration! Simply infeasible!!! Alleviate it by normalizing the 1 st two orders of statistics? Needs whitening over the entire dataset at every iteration! Still very expensive!!! Key idea: Compute statistics over mini-batch Compute statistics unit-wise (notice mean is still optimal per batch) Batch Normalization Normalization Techniques 15 / 39

37 Batch Normalization (Ioffe and Szegedy, 2015) Avoid covariate shift Normalize the entire distribution at every layer at every iteration! Simply infeasible!!! Alleviate it by normalizing the 1 st two orders of statistics? Needs whitening over the entire dataset at every iteration! Still very expensive!!! Key idea: Compute statistics over mini-batch Compute statistics unit-wise (notice mean is still optimal per batch) Finally we can do something about it :) Batch Normalization Normalization Techniques 15 / 39

38 Batch Normalization (Ioffe and Szegedy, 2015) Consider any hidden unit s pre-activation a i Then normalized pre-activation under BN is given by: BN(a i ) = (a i E B [a i ])) varb (a i ) Batch Normalization Normalization Techniques 16 / 39

39 Batch Normalization (Ioffe and Szegedy, 2015) Consider any hidden unit s pre-activation a i Then normalized pre-activation under BN is given by: BN(a i ) = (a i E B [a i ])) varb (a i ) a 2 Data covariance BN Data covariance Unit sphere a 1 Unit sphere Pretty close! (approximation is worst when principal components are maximally away from axis) Batch Normalization Normalization Techniques 16 / 39

40 Batch Normalization (Ioffe and Szegedy, 2015) A traditional vs. Batch Normalized (BN) ReLU layer: Traditional: x a i = W T i x + β i h =ReLU(a) o = h BN: x â i = γ i (WT i (x E B [x])) var B (W i T + β i h =ReLU(â) o = h x) Batch Normalization Normalization Techniques 17 / 39

41 Batch Normalization (Ioffe and Szegedy, 2015) A traditional vs. Batch Normalized (BN) ReLU layer: Traditional: x a i = W T i x + β i h =ReLU(a) o = h BN: x â i = γ i (WT i (x E B [x])) var B (W i T + β i h =ReLU(â) o = h x) γ i and β i are learnable scale and bias parameters Batch Normalization Normalization Techniques 17 / 39

42 Batch Normalization (Ioffe and Szegedy, 2015) A traditional vs. Batch Normalized (BN) ReLU layer: Traditional: x a i = W T i x + β i h =ReLU(a) o = h BN: x â i = γ i (WT i (x E B [x])) var B (W i T + β i h =ReLU(â) o = h x) γ i and β i are learnable scale and bias parameters Notice bias β i is outside the normalization (why?) Batch Normalization Normalization Techniques 17 / 39

43 Batch Normalization (Ioffe and Szegedy, 2015) A traditional vs. Batch Normalized (BN) ReLU layer: Traditional: x a i = W T i x + β i h =ReLU(a) o = h BN: x â i = γ i (WT i (x E B [x])) var B (W i T + β i h =ReLU(â) o = h x) γ i and β i are learnable scale and bias parameters Notice bias β i is outside the normalization (why?) Why not apply BN post-activation? (hint: Gaussianity) Batch Normalization Normalization Techniques 17 / 39

44 Batch Normalization (Ioffe and Szegedy, 2015) A traditional vs. Batch Normalized (BN) ReLU layer: Traditional: x a i = W T i x + β i h =ReLU(a) o = h BN: x â i = γ i (WT i (x E B [x])) var B (W i T + β i h =ReLU(â) o = h x) γ i and β i are learnable scale and bias parameters Notice bias β i is outside the normalization (why?) Why not apply BN post-activation? (hint: Gaussianity) During test time use running average estimates of mean and standard deviation Batch Normalization Normalization Techniques 17 / 39

45 Batch Normalization (Ioffe and Szegedy, 2015) How to extend to convolutional layers? Simply treat each depth vector as a separate sample Batch Normalization Normalization Techniques 18 / 39

46 Batch Normalization (Ioffe and Szegedy, 2015) How to extend to convolutional layers? Simply treat each depth vector as a separate sample Back-propagate through the normalization Otherwise normalizing representations changes prediction Batch Normalization Normalization Techniques 18 / 39

47 Batch Normalization (Ioffe and Szegedy, 2015) Effects of batch normalization: Parameter scaling: Representations become scale invariant BN(cWx) = BN(Wx) Gradients become inversely proportional to parameter scale BN(cWx) cw = (1/c) BN(Wx) W Allows for large learning rates! Batch Normalization Normalization Techniques 19 / 39

48 Batch Normalization (Ioffe and Szegedy, 2015) Effects of batch normalization: Parameter scaling: Representations become scale invariant BN(cWx) = BN(Wx) Gradients become inversely proportional to parameter scale BN(cWx) cw = (1/c) BN(Wx) W Allows for large learning rates! Regularization: Each sample gets a different representation depending on mini-batch Batch Normalization Normalization Techniques 19 / 39

49 Batch Normalization (Ioffe and Szegedy, 2015) Effects of batch normalization: Parameter scaling: Representations become scale invariant BN(cWx) = BN(Wx) Gradients become inversely proportional to parameter scale BN(cWx) cw = (1/c) BN(Wx) W Allows for large learning rates! Regularization: Each sample gets a different representation depending on mini-batch Awesome trick! But also puzzling at the same time What if 2 different samples in two different batches get the same representation after BN? Batch Normalization Normalization Techniques 19 / 39

50 Batch Normalization (Ioffe and Szegedy, 2015) Figures taken from Ioffe, Sergey, and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML (2015). Batch Normalization Normalization Techniques 20 / 39

51 Table of Contents 1 Introduction 2 Motivation 3 Batch Normalization 4 Normalization Propagation 5 Weight Normalization 6 Layer Normalization 7 Conclusion Normalization Propagation Normalization Techniques 21 / 39

52 Normalization Propagation (Arpit et al, 2016) Key idea: Instead of recomputing statistics at every layer, exploit normalization in data by propagating it to hidden layers W Coherence = max i T W j Wi,W j,i j W i 2 W j 2 Incoherence = pairwise angle between vectors is large Normalization Propagation Normalization Techniques 22 / 39

53 Normalization Propagation (Arpit et al, 2016) Key idea: Instead of recomputing statistics at every layer, exploit normalization in data by propagating it to hidden layers Trick: If x has 0 mean I covariance, then Wx also has 0 mean I covariance if W i 2 = 1 and W is incoherent W Coherence = max i T W j Wi,W j,i j W i 2 W j 2 Incoherence = pairwise angle between vectors is large Normalization Propagation Normalization Techniques 22 / 39

54 Normalization Propagation (Arpit et al, 2016) Key idea: Instead of recomputing statistics at every layer, exploit normalization in data by propagating it to hidden layers Trick: If x has 0 mean I covariance, then Wx also has 0 mean I covariance if W i 2 = 1 and W is incoherent Data covariance Unit sphere Wx W i 2 =1,incoherent W Data covariance Unit sphere W Coherence = max i T W j Wi,W j,i j W i 2 W j 2 Incoherence = pairwise angle between vectors is large Normalization Propagation Normalization Techniques 22 / 39

55 Normalization Propagation (Arpit et al, 2016) Key idea: Instead of recomputing statistics at every layer, exploit normalization in data by propagating it to hidden layers Trick: If x has 0 mean I covariance, then Wx also has 0 mean I covariance if W i 2 = 1 and W is incoherent Data covariance Unit sphere Wx W i 2 =1,incoherent W Data covariance Unit sphere Assumption: pre-activations Wx are approximately Gaussian σ(w T i x) has fixed mean and variance W Coherence = max i T W j Wi,W j,i j W i 2 W j 2 Incoherence = pairwise angle between vectors is large Normalization Propagation Normalization Techniques 22 / 39

56 Normalization Propagation (Arpit et al, 2016) A traditional vs. NormProp ReLU layer: Traditional: x a i = W T i x + β i h =ReLU(a) o = h NormProp: x â i = γ i (WT i [ ] h 1 x) 2π + β W i i h =ReLU(â) o = ( π 1 ) Normalization Propagation Normalization Techniques 23 / 39

57 Normalization Propagation (Arpit et al, 2016) A traditional vs. NormProp ReLU layer: Traditional: x a i = W T i x + β i h =ReLU(a) o = h NormProp: x â i = γ i (WT i [ ] h 1 x) 2π + β W i i h =ReLU(â) o = ( π 1 ) γ i and β i are learnable scale and bias parameters Normalization Propagation Normalization Techniques 23 / 39

58 Normalization Propagation (Arpit et al, 2016) A traditional vs. NormProp ReLU layer: Traditional: x a i = W T i x + β i h =ReLU(a) o = h NormProp: x â i = γ i (WT i [ ] h 1 x) 2π + β W i i h =ReLU(â) o = ( π 1 ) γ i and β i are learnable scale and bias parameters Data can be normalized globally or batch-wise Normalization Propagation Normalization Techniques 23 / 39

59 Normalization Propagation (Arpit et al, 2016) A traditional vs. NormProp ReLU layer: Traditional: x a i = W T i x + β i h =ReLU(a) o = h NormProp: x â i = γ i (WT i [ ] h 1 x) 2π + β W i i h =ReLU(â) o = ( π 1 ) γ i and β i are learnable scale and bias parameters Data can be normalized globally or batch-wise train and test normalizations identical (for global data normalization) Normalization Propagation Normalization Techniques 23 / 39

60 Normalization Propagation (Arpit et al, 2016) A traditional vs. NormProp ReLU layer: Traditional: x a i = W T i x + β i h =ReLU(a) o = h NormProp: x â i = γ i (WT i [ ] h 1 x) 2π + β W i i h =ReLU(â) o = ( π 1 ) γ i and β i are learnable scale and bias parameters Data can be normalized globally or batch-wise train and test normalizations identical (for global data normalization) Applicable with batch-size 1 Normalization Propagation Normalization Techniques 23 / 39

61 Normalization Propagation (Arpit et al, 2016) How to extend to convolutional layers? â i = γ i (W i x) W i F + β i (i th featuremap) Normalization Propagation Normalization Techniques 24 / 39

62 Normalization Propagation (Arpit et al, 2016) How to extend to convolutional layers? â i = γ i (W i x) W i F + β i (i th featuremap) Back-propagate through the normalization of weight scale Otherwise normalizing representations changes prediction Normalization Propagation Normalization Techniques 24 / 39

63 Normalization Propagation (Arpit et al, 2016) Analysis: Singular values of J = o x 1 prevents gradient problems (Saxe et al. 2014) For ReLU layer: E x [JJ T ] 1.47I = Singular values of J 1.2 Extension to other Activation functions (σ): ( ) ] o i = 1 c 1 [σ γi (W i x) W i F + β i c 2 c 1 = var(σ(y )), c 2 = E[σ(Y )] Y has Standard Normal distribution Identical parameter scaling and regularization effects as BN Normalization Propagation Normalization Techniques 25 / 39

64 Normalization Propagation (Arpit et al, 2016) validation accuracy NormProp vs. BN Convergence (batch-size 50) NormProp BN epochs Methods Test Error (%) CIFAR-10 with data augmentation NormProp 7.47 Batch Normalization 7.25 NIN + ALP units 7.51 NIN 8.81 DSN 7.97 Maxout 9.38 Normalization Propagation Normalization Techniques 26 / 39

65 Table of Contents 1 Introduction 2 Motivation 3 Batch Normalization 4 Normalization Propagation 5 Weight Normalization 6 Layer Normalization 7 Conclusion Weight Normalization Normalization Techniques 27 / 39

66 Weight Normalization (Salimans and Kingma, 2016) Key idea: Decouple the scale and direction of weights Weight Normalization Normalization Techniques 28 / 39

67 Weight Normalization (Salimans and Kingma, 2016) Key idea: Decouple the scale and direction of weights Initialize parameters to have 0 mean unit variance pre-activations Weight Normalization Normalization Techniques 28 / 39

68 Weight Normalization (Salimans and Kingma, 2016) Key idea: Decouple the scale and direction of weights Initialize parameters to have 0 mean unit variance pre-activations Back-propagate through the normalization Weight Normalization Normalization Techniques 28 / 39

69 Weight Normalization (Salimans and Kingma, 2016) Consider any hidden unit s pre-activation a i Then weight normalized pre-activation is given by: WeighNorm(a i ) = γ i(w T i x) W i 2 + β i 1 Initialize: γ i = β varb (Wi T i = E B[Wi T x/ W i 2 ] x)/ W i 2 varb (Wi T x)/ W i 2 Equivalently pre-activations get normalized using 1 mini-batch initially Optimization doesn t have explicit mean and variance normalization Weight Normalization Normalization Techniques 29 / 39

70 Weight Normalization (Salimans and Kingma, 2016) A traditional vs. Weight Normalized ReLU layer: Traditional: x a i = W T i x + β i h =ReLU(a) o = h WeightNorm: x â i = γ i (WT i x) + β W i i h =ReLU(â) o = h 2 Weight Normalization Normalization Techniques 30 / 39

71 Weight Normalization (Salimans and Kingma, 2016) A traditional vs. Weight Normalized ReLU layer: Traditional: x a i = W T i x + β i h =ReLU(a) o = h WeightNorm: x â i = γ i (WT i x) + β W i i h =ReLU(â) o = h 2 γ i and β i are learnable scale and bias parameters Weight Normalization Normalization Techniques 30 / 39

72 Weight Normalization (Salimans and Kingma, 2016) A traditional vs. Weight Normalized ReLU layer: Traditional: x a i = W T i x + β i h =ReLU(a) o = h WeightNorm: x â i = γ i (WT i x) + β W i i h =ReLU(â) o = h 2 γ i and β i are learnable scale and bias parameters train and test normalizations identical Weight Normalization Normalization Techniques 30 / 39

73 Weight Normalization (Salimans and Kingma, 2016) A traditional vs. Weight Normalized ReLU layer: Traditional: x a i = W T i x + β i h =ReLU(a) o = h WeightNorm: x â i = γ i (WT i x) + β W i i h =ReLU(â) o = h 2 γ i and β i are learnable scale and bias parameters train and test normalizations identical Applicable with batch-size 1 Weight Normalization Normalization Techniques 30 / 39

74 Weight Normalization (Salimans and Kingma, 2016) Effect of backpropagating through normalization: Let θ = γ(w) w 2, then using SGD: Thus, w w w = η L w γ = η (I w t θt θ tt ) L 2 θ t 2 θ Weight Normalization Normalization Techniques 31 / 39

75 Weight Normalization (Salimans and Kingma, 2016) Effect of backpropagating through normalization: Let θ = γ(w) w 2, then using SGD: Thus, w w w = η L w γ = η (I w t θt θ tt ) L 2 θ t 2 θ Let c = w / w, notice: w t+1 = (1 + c 2 ) w t Weight Normalization Normalization Techniques 31 / 39

76 Weight Normalization (Salimans and Kingma, 2016) Effect of backpropagating through normalization: Let θ = γ(w) w 2, then using SGD: Thus, w w w = η L w γ = η (I w t θt θ tt ) L 2 θ t 2 θ Let c = w / w, notice: w t+1 = (1 + c 2 ) w t Large c = large w t+1 = small gradient w Weight Normalization Normalization Techniques 31 / 39

77 Weight Normalization (Salimans and Kingma, 2016) Effect of backpropagating through normalization: Let θ = γ(w) w 2, then using SGD: Thus, w w w = η L w γ = η (I w t θt θ tt ) L 2 θ t 2 θ Let c = w / w, notice: w t+1 = (1 + c 2 ) w t Large c = large w t+1 = small gradient w Small c = w t+1 w t = gradient scale w unchanged Weight Normalization Normalization Techniques 31 / 39

78 Weight Normalization (Salimans and Kingma, 2016) Effect of backpropagating through normalization: Let θ = γ(w) w 2, then using SGD: Thus, w w w = η L w γ = η (I w t θt θ tt ) L 2 θ t 2 θ Let c = w / w, notice: w t+1 = (1 + c 2 ) w t Large c = large w t+1 = small gradient w Small c = w t+1 w t = gradient scale w unchanged The above analysis applies to NormProp as well! Weight Normalization Normalization Techniques 31 / 39

79 Weight Normalization (Salimans and Kingma, 2016) Figures taken from Salimans, Tim, and Diederik P. Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. Advances in Neural Information Processing Systems Weight Normalization Normalization Techniques 32 / 39

80 Table of Contents 1 Introduction 2 Motivation 3 Batch Normalization 4 Normalization Propagation 5 Weight Normalization 6 Layer Normalization 7 Conclusion Layer Normalization Normalization Techniques 33 / 39

81 Layer Normalization (Ba et al, 2016) Key idea: Compute statistics for each sample separately Layer Normalization Normalization Techniques 34 / 39

82 Layer Normalization (Ba et al, 2016) Key idea: Compute statistics for each sample separately Trick: Compute mean and standard deviation across units instead of mini-batch Layer Normalization Normalization Techniques 34 / 39

83 Layer Normalization (Ba et al, 2016) Key idea: Compute statistics for each sample separately Trick: Compute mean and standard deviation across units instead of mini-batch Aimed towards application to recurrent neural nets: RNNs have variable number of temporal layers There can be more number of layers at test time (BN is not applicable) Layer Normalization Normalization Techniques 34 / 39

84 Layer Normalization (Ba et al, 2016) Consider a temporal hidden layer s pre-activation a t = W hh h t 1 + W xh x t Then layer normalized pre-activation a t is given by: LN(a t ) = γ σ t (a t µ t ) + β µ t = 1 H H ai t σ t = 1 H i=1 H (ai t µ t ) 2 i=1 Layer Normalization Normalization Techniques 35 / 39

85 Layer Normalization (Ba et al, 2016) LN(a t ) = γ σ t (a t µ t ) + β µ t = 1 H H ai t σ t = 1 H i=1 H (ai t µ t ) 2 i=1 Effects of layer normalization: Invariance to weight scaling and translation Data rescaling and translation Norm of weight controls effective learning rate Layer Normalization Normalization Techniques 36 / 39

86 Layer Normalization (Ba et al, 2016) Figures taken from Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. arxiv preprint arxiv: (2016). Layer Normalization Normalization Techniques 37 / 39

87 Table of Contents 1 Introduction 2 Motivation 3 Batch Normalization 4 Normalization Propagation 5 Weight Normalization 6 Layer Normalization 7 Conclusion Conclusion Normalization Techniques 38 / 39

88 Conclusion Removing internal covariate shift in DNNs leads to faster convergence Generally, even good initialization plays an important role (Mishkin and Matas, 2016) Making representations (scale, shift) invariant seems to boost convergence SGD + better optimization = better generalization? what about bad local minima? (Zhang et al, 2016) Conclusion Normalization Techniques 39 / 39

Normalization Techniques in Training of Deep Neural Networks

Normalization Techniques in Training of Deep Neural Networks Normalization Techniques in Training of Deep Neural Networks Lei Huang ( 黄雷 ) State Key Laboratory of Software Development Environment, Beihang University Mail:huanglei@nlsde.buaa.edu.cn August 17 th,

More information

arxiv: v1 [cs.lg] 16 Jun 2017

arxiv: v1 [cs.lg] 16 Jun 2017 L 2 Regularization versus Batch and Weight Normalization arxiv:1706.05350v1 [cs.lg] 16 Jun 2017 Twan van Laarhoven Institute for Computer Science Radboud University Postbus 9010, 6500GL Nijmegen, The Netherlands

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

arxiv: v6 [stat.ml] 12 Jul 2016

arxiv: v6 [stat.ml] 12 Jul 2016 : A Parametric Technique for Removing Internal Covariate Shift in Deep Networks arxiv:603.043v6 [stat.ml] Jul 06 Devansh Arpit Yingbo Zhou Bhargava U. Kota Venu Govindaraju SUNY Buffalo Abstract While

More information

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/41 Outline

More information

arxiv: v3 [cs.lg] 4 Jun 2016 Abstract

arxiv: v3 [cs.lg] 4 Jun 2016 Abstract Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks Tim Salimans OpenAI tim@openai.com Diederik P. Kingma OpenAI dpkingma@openai.com arxiv:1602.07868v3 [cs.lg]

More information

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017 Non-Convex Optimization CS6787 Lecture 7 Fall 2017 First some words about grading I sent out a bunch of grades on the course management system Everyone should have all their grades in Not including paper

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation) Learning for Deep Neural Networks (Back-propagation) Outline Summary of Previous Standford Lecture Universal Approximation Theorem Inference vs Training Gradient Descent Back-Propagation

More information

Deep Learning II: Momentum & Adaptive Step Size

Deep Learning II: Momentum & Adaptive Step Size Deep Learning II: Momentum & Adaptive Step Size CS 760: Machine Learning Spring 2018 Mark Craven and David Page www.biostat.wisc.edu/~craven/cs760 1 Goals for the Lecture You should understand the following

More information

Optimization for neural networks

Optimization for neural networks 0 - : Optimization for neural networks Prof. J.C. Kao, UCLA Optimization for neural networks We previously introduced the principle of gradient descent. Now we will discuss specific modifications we make

More information

Machine Learning Basics III

Machine Learning Basics III Machine Learning Basics III Benjamin Roth CIS LMU München Benjamin Roth (CIS LMU München) Machine Learning Basics III 1 / 62 Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient

More information

Deep Residual. Variations

Deep Residual. Variations Deep Residual Network and Its Variations Diyu Yang (Originally prepared by Kaiming He from Microsoft Research) Advantages of Depth Degradation Problem Possible Causes? Vanishing/Exploding Gradients. Overfitting

More information

OPTIMIZATION METHODS IN DEEP LEARNING

OPTIMIZATION METHODS IN DEEP LEARNING Tutorial outline OPTIMIZATION METHODS IN DEEP LEARNING Based on Deep Learning, chapter 8 by Ian Goodfellow, Yoshua Bengio and Aaron Courville Presented By Nadav Bhonker Optimization vs Learning Surrogate

More information

Introduction to Convolutional Neural Networks (CNNs)

Introduction to Convolutional Neural Networks (CNNs) Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei

More information

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Hiroaki Hayashi 1,* Jayanth Koushik 1,* Graham Neubig 1 arxiv:1611.01505v3 [cs.lg] 11 Jun 2018 Abstract Adaptive

More information

Lecture 17: Neural Networks and Deep Learning

Lecture 17: Neural Networks and Deep Learning UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions

More information

arxiv: v1 [cs.lg] 25 Sep 2018

arxiv: v1 [cs.lg] 25 Sep 2018 Utilizing Class Information for DNN Representation Shaping Daeyoung Choi and Wonjong Rhee Department of Transdisciplinary Studies Seoul National University Seoul, 08826, South Korea {choid, wrhee}@snu.ac.kr

More information

CSC 578 Neural Networks and Deep Learning

CSC 578 Neural Networks and Deep Learning CSC 578 Neural Networks and Deep Learning Fall 2018/19 3. Improving Neural Networks (Some figures adapted from NNDL book) 1 Various Approaches to Improve Neural Networks 1. Cost functions Quadratic Cross

More information

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton Motivation Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses

More information

CSC2541 Lecture 5 Natural Gradient

CSC2541 Lecture 5 Natural Gradient CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse CSC2541 Lecture 5 Natural Gradient 1 / 12 Motivation Two classes of optimization procedures used throughout ML (stochastic) gradient descent,

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Nov 2, 2016 Outline SGD-typed algorithms for Deep Learning Parallel SGD for deep learning Perceptron Prediction value for a training data: prediction

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Deep Learning Year in Review 2016: Computer Vision Perspective

Deep Learning Year in Review 2016: Computer Vision Perspective Deep Learning Year in Review 2016: Computer Vision Perspective Alex Kalinin, PhD Candidate Bioinformatics @ UMich alxndrkalinin@gmail.com @alxndrkalinin Architectures Summary of CNN architecture development

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

More Optimization. Optimization Methods. Methods

More Optimization. Optimization Methods. Methods More More Optimization Optimization Methods Methods Yann YannLeCun LeCun Courant CourantInstitute Institute http://yann.lecun.com http://yann.lecun.com (almost) (almost) everything everything you've you've

More information

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:

More information

Neural networks and optimization

Neural networks and optimization Neural networks and optimization Nicolas Le Roux Criteo 18/05/15 Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 1 / 85 1 Introduction 2 Deep networks 3 Optimization 4 Convolutional

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science Topics covered

More information

arxiv: v2 [cs.cv] 8 Mar 2018

arxiv: v2 [cs.cv] 8 Mar 2018 RESIDUAL CONNECTIONS ENCOURAGE ITERATIVE IN- FERENCE Stanisław Jastrzębski 1,2,, Devansh Arpit 2,, Nicolas Ballas 3, Vikas Verma 5, Tong Che 2 & Yoshua Bengio 2,6 arxiv:171773v2 [cs.cv] 8 Mar 2018 1 Jagiellonian

More information

Advanced Training Techniques. Prajit Ramachandran

Advanced Training Techniques. Prajit Ramachandran Advanced Training Techniques Prajit Ramachandran Outline Optimization Regularization Initialization Optimization Optimization Outline Gradient Descent Momentum RMSProp Adam Distributed SGD Gradient Noise

More information

Basic Principles of Unsupervised and Unsupervised

Basic Principles of Unsupervised and Unsupervised Basic Principles of Unsupervised and Unsupervised Learning Toward Deep Learning Shun ichi Amari (RIKEN Brain Science Institute) collaborators: R. Karakida, M. Okada (U. Tokyo) Deep Learning Self Organization

More information

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic

More information

Regularization and Optimization of Backpropagation

Regularization and Optimization of Backpropagation Regularization and Optimization of Backpropagation The Norwegian University of Science and Technology (NTNU) Trondheim, Norway keithd@idi.ntnu.no October 17, 2017 Regularization Definition of Regularization

More information

UNSUPERVISED LEARNING

UNSUPERVISED LEARNING UNSUPERVISED LEARNING Topics Layer-wise (unsupervised) pre-training Restricted Boltzmann Machines Auto-encoders LAYER-WISE (UNSUPERVISED) PRE-TRAINING Breakthrough in 2006 Layer-wise (unsupervised) pre-training

More information

Training Neural Networks Practical Issues

Training Neural Networks Practical Issues Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology Fall 2017 Most slides have been adapted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017, and some from

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

ANALYSIS ON GRADIENT PROPAGATION IN BATCH NORMALIZED RESIDUAL NETWORKS

ANALYSIS ON GRADIENT PROPAGATION IN BATCH NORMALIZED RESIDUAL NETWORKS ANALYSIS ON GRADIENT PROPAGATION IN BATCH NORMALIZED RESIDUAL NETWORKS Anonymous authors Paper under double-blind review ABSTRACT We conduct mathematical analysis on the effect of batch normalization (BN)

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Based on the original slides of Hung-yi Lee

Based on the original slides of Hung-yi Lee Based on the original slides of Hung-yi Lee New Activation Function Rectified Linear Unit (ReLU) σ z a a = z Reason: 1. Fast to compute 2. Biological reason a = 0 [Xavier Glorot, AISTATS 11] [Andrew L.

More information

Introduction to Neural Networks

Introduction to Neural Networks CUONG TUAN NGUYEN SEIJI HOTTA MASAKI NAKAGAWA Tokyo University of Agriculture and Technology Copyright by Nguyen, Hotta and Nakagawa 1 Pattern classification Which category of an input? Example: Character

More information

Based on the original slides of Hung-yi Lee

Based on the original slides of Hung-yi Lee Based on the original slides of Hung-yi Lee Google Trends Deep learning obtains many exciting results. Can contribute to new Smart Services in the Context of the Internet of Things (IoT). IoT Services

More information

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University Deep Feedforward Networks Seung-Hoon Na Chonbuk National University Neural Network: Types Feedforward neural networks (FNN) = Deep feedforward networks = multilayer perceptrons (MLP) No feedback connections

More information

Introduction to Deep Neural Networks

Introduction to Deep Neural Networks Introduction to Deep Neural Networks Presenter: Chunyuan Li Pattern Classification and Recognition (ECE 681.01) Duke University April, 2016 Outline 1 Background and Preliminaries Why DNNs? Model: Logistic

More information

arxiv: v1 [cs.lg] 28 Mar 2018

arxiv: v1 [cs.lg] 28 Mar 2018 3 rd Computer Vision Winter Workshop Zuzana Kúkelová and Júlia Škovierová (eds.) Český Krumlov, Czech Republic, February 5 7, 8 Normalization of Neural Networks using Analytic Variance Propagation Alexander

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

Neural Networks. Mark van Rossum. January 15, School of Informatics, University of Edinburgh 1 / 28

Neural Networks. Mark van Rossum. January 15, School of Informatics, University of Edinburgh 1 / 28 1 / 28 Neural Networks Mark van Rossum School of Informatics, University of Edinburgh January 15, 2018 2 / 28 Goals: Understand how (recurrent) networks behave Find a way to teach networks to do a certain

More information

Machine Learning Lecture 14

Machine Learning Lecture 14 Machine Learning Lecture 14 Tricks of the Trade 07.12.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory Probability

More information

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Learning Deep Architectures for AI. Part II - Vijay Chakilam Learning Deep Architectures for AI - Yoshua Bengio Part II - Vijay Chakilam Limitations of Perceptron x1 W, b 0,1 1,1 y x2 weight plane output =1 output =0 There is no value for W and b such that the model

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Lecture 6 Optimization for Deep Neural Networks

Lecture 6 Optimization for Deep Neural Networks Lecture 6 Optimization for Deep Neural Networks CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 12, 2017 Things we will look at today Stochastic Gradient Descent Things

More information

Normalized Gradient with Adaptive Stepsize Method for Deep Neural Network Training

Normalized Gradient with Adaptive Stepsize Method for Deep Neural Network Training Normalized Gradient with Adaptive Stepsize Method for Deep Neural Network raining Adams Wei Yu, Qihang Lin, Ruslan Salakhutdinov, and Jaime Carbonell School of Computer Science, Carnegie Mellon University

More information

Internal Covariate Shift Batch Normalization Implementation Experiments. Batch Normalization. Devin Willmott. University of Kentucky.

Internal Covariate Shift Batch Normalization Implementation Experiments. Batch Normalization. Devin Willmott. University of Kentucky. Batch Normalization Devin Willmott University of Kentucky October 23, 2017 Overview 1 Internal Covariate Shift 2 Batch Normalization 3 Implementation 4 Experiments Covariate Shift Suppose we have two distributions,

More information

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu Convolutional Neural Networks II Slides from Dr. Vlad Morariu 1 Optimization Example of optimization progress while training a neural network. (Loss over mini-batches goes down over time.) 2 Learning rate

More information

Understanding Neural Networks : Part I

Understanding Neural Networks : Part I TensorFlow Workshop 2018 Understanding Neural Networks Part I : Artificial Neurons and Network Optimization Nick Winovich Department of Mathematics Purdue University July 2018 Outline 1 Neural Networks

More information

Deep Learning & Artificial Intelligence WS 2018/2019

Deep Learning & Artificial Intelligence WS 2018/2019 Deep Learning & Artificial Intelligence WS 2018/2019 Linear Regression Model Model Error Function: Squared Error Has no special meaning except it makes gradients look nicer Prediction Ground truth / target

More information

Introduction to Machine Learning Spring 2018 Note Neural Networks

Introduction to Machine Learning Spring 2018 Note Neural Networks CS 189 Introduction to Machine Learning Spring 2018 Note 14 1 Neural Networks Neural networks are a class of compositional function approximators. They come in a variety of shapes and sizes. In this class,

More information

1 What a Neural Network Computes

1 What a Neural Network Computes Neural Networks 1 What a Neural Network Computes To begin with, we will discuss fully connected feed-forward neural networks, also known as multilayer perceptrons. A feedforward neural network consists

More information

Improving training of deep neural networks via Singular Value Bounding

Improving training of deep neural networks via Singular Value Bounding Improving training of deep neural networks via Singular Value Bounding Kui Jia 1, Dacheng Tao 2, Shenghua Gao 3, and Xiangmin Xu 1 1 School of Electronic and Information Engineering, South China University

More information

The K-FAC method for neural network optimization

The K-FAC method for neural network optimization The K-FAC method for neural network optimization James Martens Thanks to my various collaborators on K-FAC research and engineering: Roger Grosse, Jimmy Ba, Vikram Tankasali, Matthew Johnson, Daniel Duckworth,

More information

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire

More information

CSC321 Lecture 8: Optimization

CSC321 Lecture 8: Optimization CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:

More information

Bayesian Deep Learning

Bayesian Deep Learning Bayesian Deep Learning Mohammad Emtiyaz Khan AIP (RIKEN), Tokyo http://emtiyaz.github.io emtiyaz.khan@riken.jp June 06, 2018 Mohammad Emtiyaz Khan 2018 1 What will you learn? Why is Bayesian inference

More information

Linear classifiers: Overfitting and regularization

Linear classifiers: Overfitting and regularization Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x

More information

Introduction to Machine Learning (67577)

Introduction to Machine Learning (67577) Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Deep Learning Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks

More information

Neural Networks: Optimization & Regularization

Neural Networks: Optimization & Regularization Neural Networks: Optimization & Regularization Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) NN Opt & Reg

More information

Neural Networks: Backpropagation

Neural Networks: Backpropagation Neural Networks: Backpropagation Seung-Hoon Na 1 1 Department of Computer Science Chonbuk National University 2018.10.25 eung-hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25

More information

STA 414/2104: Lecture 8

STA 414/2104: Lecture 8 STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable models Background PCA

More information

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday! Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:

More information

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks C.M. Bishop s PRML: Chapter 5; Neural Networks Introduction The aim is, as before, to find useful decompositions of the target variable; t(x) = y(x, w) + ɛ(x) (3.7) t(x n ) and x n are the observations,

More information

Deep Learning Lecture 2

Deep Learning Lecture 2 Fall 2016 Machine Learning CMPSCI 689 Deep Learning Lecture 2 Sridhar Mahadevan Autonomous Learning Lab UMass Amherst COLLEGE Outline of lecture New type of units Convolutional units, Rectified linear

More information

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY 1 On-line Resources http://neuralnetworksanddeeplearning.com/index.html Online book by Michael Nielsen http://matlabtricks.com/post-5/3x3-convolution-kernelswith-online-demo

More information

IMPROVING STOCHASTIC GRADIENT DESCENT

IMPROVING STOCHASTIC GRADIENT DESCENT IMPROVING STOCHASTIC GRADIENT DESCENT WITH FEEDBACK Jayanth Koushik & Hiroaki Hayashi Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213, USA {jkoushik,hiroakih}@cs.cmu.edu

More information

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

COMP 551 Applied Machine Learning Lecture 14: Neural Networks COMP 551 Applied Machine Learning Lecture 14: Neural Networks Instructor: Ryan Lowe (ryan.lowe@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted,

More information

Auto-Encoding Variational Bayes

Auto-Encoding Variational Bayes Auto-Encoding Variational Bayes Diederik P Kingma, Max Welling June 18, 2018 Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 1 / 39 Outline 1 Introduction 2 Variational Lower

More information

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Steve Renals Machine Learning Practical MLP Lecture 5 16 October 2018 MLP Lecture 5 / 16 October 2018 Deep Neural Networks

More information

CSC321 Lecture 7: Optimization

CSC321 Lecture 7: Optimization CSC321 Lecture 7: Optimization Roger Grosse Roger Grosse CSC321 Lecture 7: Optimization 1 / 25 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:

More information

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning Lei Lei Ruoxuan Xiong December 16, 2017 1 Introduction Deep Neural Network

More information

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others) Machine Learning Neural Networks (slides from Domingos, Pardo, others) For this week, Reading Chapter 4: Neural Networks (Mitchell, 1997) See Canvas For subsequent weeks: Scaling Learning Algorithms toward

More information

Neural networks and optimization

Neural networks and optimization Neural networks and optimization Nicolas Le Roux INRIA 8 Nov 2011 Nicolas Le Roux (INRIA) Neural networks and optimization 8 Nov 2011 1 / 80 1 Introduction 2 Linear classifier 3 Convolutional neural networks

More information

CTJLSVM: Componentwise Triple Jump Acceleration for Training Linear SVM

CTJLSVM: Componentwise Triple Jump Acceleration for Training Linear SVM CTJLSVM: Componentwise Triple Jump Acceleration for Training Linear SVM Han-Shen Huang, Porter Chang (Ker2) and Chun-Nan Hsu AI for Investigating Anti-cancer solutions (AIIA Lab) Institute of Information

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression Machine Learning 070/578 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features Y target

More information

Lecture 3 Feedforward Networks and Backpropagation

Lecture 3 Feedforward Networks and Backpropagation Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression

More information

Intro to Neural Networks and Deep Learning

Intro to Neural Networks and Deep Learning Intro to Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi UVA CS 6316 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions Backpropagation Nonlinearity Functions NNs

More information

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS LAST TIME Intro to cudnn Deep neural nets using cublas and cudnn TODAY Building a better model for image classification Overfitting

More information

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions BACK-PROPAGATION NETWORKS Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks Cannot approximate (learn) non-linear functions Difficult (if not impossible) to design

More information

On stochas*c gradient descent, flatness and generaliza*on Yoshua Bengio

On stochas*c gradient descent, flatness and generaliza*on Yoshua Bengio On stochas*c gradient descent, flatness and generaliza*on Yoshua Bengio July 14, 2018 ICML 2018 Workshop on nonconvex op=miza=on Disentangling optimization and generalization The tradi=onal ML picture is

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Probabilistic Graphical Models

Probabilistic Graphical Models 10-708 Probabilistic Graphical Models Homework 3 (v1.1.0) Due Apr 14, 7:00 PM Rules: 1. Homework is due on the due date at 7:00 PM. The homework should be submitted via Gradescope. Solution to each problem

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012 Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University September 20, 2012 Today: Logistic regression Generative/Discriminative classifiers Readings: (see class website)

More information

Optimization for Training I. First-Order Methods Training algorithm

Optimization for Training I. First-Order Methods Training algorithm Optimization for Training I First-Order Methods Training algorithm 2 OPTIMIZATION METHODS Topics: Types of optimization methods. Practical optimization methods breakdown into two categories: 1. First-order

More information

CS260: Machine Learning Algorithms

CS260: Machine Learning Algorithms CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {

More information

arxiv: v1 [cs.lg] 9 Nov 2015

arxiv: v1 [cs.lg] 9 Nov 2015 HOW FAR CAN WE GO WITHOUT CONVOLUTION: IM- PROVING FULLY-CONNECTED NETWORKS Zhouhan Lin & Roland Memisevic Université de Montréal Canada {zhouhan.lin, roland.memisevic}@umontreal.ca arxiv:1511.02580v1

More information

A picture of the energy landscape of! deep neural networks

A picture of the energy landscape of! deep neural networks A picture of the energy landscape of! deep neural networks Pratik Chaudhari December 15, 2017 UCLA VISION LAB 1 Dy (x; w) = (w p (w p 1 (... (w 1 x))...)) w = argmin w (x,y ) 2 D kx 1 {y =i } log Dy i

More information

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017 Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017 Step Sizes and Convergence Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient

More information

The connection of dropout and Bayesian statistics

The connection of dropout and Bayesian statistics The connection of dropout and Bayesian statistics Interpretation of dropout as approximate Bayesian modelling of NN http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf Dropout Geoffrey Hinton Google, University

More information

Linear discriminant functions

Linear discriminant functions Andrea Passerini passerini@disi.unitn.it Machine Learning Discriminative learning Discriminative vs generative Generative learning assumes knowledge of the distribution governing the data Discriminative

More information