Lecture 6 Optimization for Deep Neural Networks

Size: px

Start display at page:

Download "Lecture 6 Optimization for Deep Neural Networks"

Oswin Jordan
5 years ago
Views:

1 Lecture 6 Optimization for Deep Neural Networks CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 12, 2017

2 Things we will look at today Stochastic Gradient Descent

3 Things we will look at today Stochastic Gradient Descent Momentum Method and the Nesterov Variant

4 Things we will look at today Stochastic Gradient Descent Momentum Method and the Nesterov Variant Adaptive Learning Methods (AdaGrad, RMSProp, Adam)

5 Things we will look at today Stochastic Gradient Descent Momentum Method and the Nesterov Variant Adaptive Learning Methods (AdaGrad, RMSProp, Adam) Batch Normalization

6 Things we will look at today Stochastic Gradient Descent Momentum Method and the Nesterov Variant Adaptive Learning Methods (AdaGrad, RMSProp, Adam) Batch Normalization Intialization Heuristics

7 Things we will look at today Stochastic Gradient Descent Momentum Method and the Nesterov Variant Adaptive Learning Methods (AdaGrad, RMSProp, Adam) Batch Normalization Intialization Heuristics Polyak Averaging

8 Things we will look at today Stochastic Gradient Descent Momentum Method and the Nesterov Variant Adaptive Learning Methods (AdaGrad, RMSProp, Adam) Batch Normalization Intialization Heuristics Polyak Averaging On Slides but for self study: Newton and Quasi Newton Methods (BFGS, L-BFGS, Conjugate Gradient)

9 Optimization We ve seen backpropagation as a method for computing gradients

10 Optimization We ve seen backpropagation as a method for computing gradients Assignment: Was about implementation of SGD in conjunction with backprop

11 Optimization We ve seen backpropagation as a method for computing gradients Assignment: Was about implementation of SGD in conjunction with backprop Let s see a family of first order methods

12 Batch Gradient Descent Algorithm 1 Batch Gradient Descent at Iteration k Require: Learning rate ɛ k Require: Initial Parameter θ 1: while stopping criteria not met do 2: Compute gradient estimate over N examples: i L(f(x(i) ; θ), y (i) ) 3: ĝ + 1 N θ 4: Apply Update: θ θ ɛĝ 5: end while Positive: Gradient estimates are stable Negative: Need to compute gradients over the entire training for one update

13 Gradient Descent

14 Gradient Descent

15 Gradient Descent

16 Gradient Descent

17 Gradient Descent

18 Gradient Descent

19 Stochastic Gradient Descent Algorithm 2 Stochastic Gradient Descent at Iteration k Require: Learning rate ɛ k Require: Initial Parameter θ 1: while stopping criteria not met do 2: Sample example (x (i), y (i) ) from training set 3: Compute gradient estimate: 4: ĝ + θ L(f(x (i) ; θ), y (i) ) 5: Apply Update: θ θ ɛĝ 6: end while

20 Stochastic Gradient Descent Algorithm 2 Stochastic Gradient Descent at Iteration k Require: Learning rate ɛ k Require: Initial Parameter θ 1: while stopping criteria not met do 2: Sample example (x (i), y (i) ) from training set 3: Compute gradient estimate: 4: ĝ + θ L(f(x (i) ; θ), y (i) ) 5: Apply Update: θ θ ɛĝ 6: end while ɛ k is learning rate at step k

21 Stochastic Gradient Descent Algorithm 2 Stochastic Gradient Descent at Iteration k Require: Learning rate ɛ k Require: Initial Parameter θ 1: while stopping criteria not met do 2: Sample example (x (i), y (i) ) from training set 3: Compute gradient estimate: 4: ĝ + θ L(f(x (i) ; θ), y (i) ) 5: Apply Update: θ θ ɛĝ 6: end while ɛ k is learning rate at step k Sufficient condition to guarantee convergence: ɛ k = and ɛ 2 k < k=1 k=1

22 Learning Rate Schedule In practice the learning rate is decayed linearly till iteration τ

23 Learning Rate Schedule In practice the learning rate is decayed linearly till iteration τ ɛ k = (1 α)ɛ 0 + αɛ τ with α = k τ

24 Learning Rate Schedule In practice the learning rate is decayed linearly till iteration τ ɛ k = (1 α)ɛ 0 + αɛ τ with α = k τ τ is usually set to the number of iterations needed for a large number of passes through the data ɛ τ should roughly be set to 1% of ɛ 0 How to set ɛ 0?

25 Minibatching Potential Problem: Gradient estimates can be very noisy

26 Minibatching Potential Problem: Gradient estimates can be very noisy Obvious Solution: Use larger mini-batches

27 Minibatching Potential Problem: Gradient estimates can be very noisy Obvious Solution: Use larger mini-batches Advantage: Computation time per update does not depend on number of training examples N

28 Minibatching Potential Problem: Gradient estimates can be very noisy Obvious Solution: Use larger mini-batches Advantage: Computation time per update does not depend on number of training examples N This allows convergence on extremely large datasets

29 Minibatching Potential Problem: Gradient estimates can be very noisy Obvious Solution: Use larger mini-batches Advantage: Computation time per update does not depend on number of training examples N This allows convergence on extremely large datasets See: Large Scale Learning with Stochastic Gradient Descent by Leon Bottou

30 Stochastic Gradient Descent

31 Stochastic Gradient Descent

32 Stochastic Gradient Descent

33 Stochastic Gradient Descent

34 Stochastic Gradient Descent

35 Stochastic Gradient Descent

36 Stochastic Gradient Descent

37 Stochastic Gradient Descent

38 Stochastic Gradient Descent

39 So far.. Batch Gradient Descent: ĝ + 1 N θ L(f(x (i) ; θ), y (i) ) i θ θ ɛĝ

40 So far.. Batch Gradient Descent: ĝ + 1 N θ L(f(x (i) ; θ), y (i) ) i θ θ ɛĝ SGD: ĝ + θ L(f(x (i) ; θ), y (i) ) θ θ ɛĝ

41 Momentum The Momentum method is a method to accelerate learning using SGD

42 Momentum The Momentum method is a method to accelerate learning using SGD In particular SGD suffers in the following scenarios: Error surface has high curvature

43 Momentum The Momentum method is a method to accelerate learning using SGD In particular SGD suffers in the following scenarios: Error surface has high curvature We get small but consistent gradients

44 Momentum The Momentum method is a method to accelerate learning using SGD In particular SGD suffers in the following scenarios: Error surface has high curvature We get small but consistent gradients The gradients are very noisy

45 Momentum Gradient Descent would move quickly down the walls, but very slowly through the valley floor

46 Momentum How do we try and solve this problem?

47 Momentum How do we try and solve this problem? Introduce a new variable v, the velocity

48 Momentum How do we try and solve this problem? Introduce a new variable v, the velocity We think of v as the direction and speed by which the parameters move as the learning dynamics progresses

49 Momentum How do we try and solve this problem? Introduce a new variable v, the velocity We think of v as the direction and speed by which the parameters move as the learning dynamics progresses The velocity is an exponentially decaying moving average of the negative gradients ) v αv ɛ θ (L(f(x (i) ; θ), y (i) ) α [0, 1)Update rule: θ θ + v

50 Momentum Let s look at the velocity term: v αv ɛ θ (L(f(x (i) ; θ), y (i) ) )

51 Momentum Let s look at the velocity term: v αv ɛ θ (L(f(x (i) ; θ), y (i) ) ) The velocity accumulates the previous gradients

52 Momentum Let s look at the velocity term: ) v αv ɛ θ (L(f(x (i) ; θ), y (i) ) The velocity accumulates the previous gradients What is the role of α? If α is larger than ɛ the current update is more affected by the previous gradients

53 Momentum Let s look at the velocity term: ) v αv ɛ θ (L(f(x (i) ; θ), y (i) ) The velocity accumulates the previous gradients What is the role of α? If α is larger than ɛ the current update is more affected by the previous gradients Usually values for α are set high 0.8, 0.9

54 Momentum Gradient Step

55 Momentum Momentum Step Gradient Step

56 Momentum Momentum Step Actual Step Gradient Step

57 Momentum: Step Sizes In SGD, the step size was the norm of the gradient scaled by the learning rate ɛ g. Why?

58 Momentum: Step Sizes In SGD, the step size was the norm of the gradient scaled by the learning rate ɛ g. Why? While using momentum, the step size will also depend on the norm and alignment of a sequence of gradients

59 Momentum: Step Sizes In SGD, the step size was the norm of the gradient scaled by the learning rate ɛ g. Why? While using momentum, the step size will also depend on the norm and alignment of a sequence of gradients For example, if at each step we observed g, the step size would be (exercise!): ɛ g 1 α

60 Momentum: Step Sizes In SGD, the step size was the norm of the gradient scaled by the learning rate ɛ g. Why? While using momentum, the step size will also depend on the norm and alignment of a sequence of gradients For example, if at each step we observed g, the step size would be (exercise!): ɛ g 1 α If α = 0.9 = multiply the maximum speed by 10 relative to the current gradient direction

61 Momentum Illustration of how momentum traverses such an error surface better compared to Gradient Descent

62 SGD with Momentum Algorithm 2 Stochastic Gradient Descent with Momentum Require: Learning rate ɛ k Require: Momentum Parameter α Require: Initial Parameter θ Require: Initial Velocity v 1: while stopping criteria not met do 2: Sample example (x (i), y (i) ) from training set 3: Compute gradient estimate: 4: ĝ + θ L(f(x (i) ; θ), y (i) ) 5: Compute the velocity update: 6: v αv ɛĝ 7: Apply Update: θ θ + v 8: end while

63 Nesterov Momentum Another approach: First take a step in the direction of the accumulated gradient Then calculate the gradient and make a correction Accumulated Gradient

64 Nesterov Momentum Another approach: First take a step in the direction of the accumulated gradient Then calculate the gradient and make a correction Accumulated Gradient Correction

65 Nesterov Momentum Another approach: First take a step in the direction of the accumulated gradient Then calculate the gradient and make a correction Accumulated Gradient Correction New Accumulated Gradient

66 Nesterov Momentum Next Step

67 Nesterov Momentum Next Step

68 Nesterov Momentum Next Step

69 Nesterov Momentum Next Step

70 Let s Write it out.. Recall the velocity term in the Momentum method: ) v αv ɛ θ (L(f(x (i) ; θ), y (i) )

71 Let s Write it out.. Recall the velocity term in the Momentum method: ) v αv ɛ θ (L(f(x (i) ; θ), y (i) ) Nesterov Momentum: v αv ɛ θ (L(f(x (i) ; θ + αv), y (i) ) )

72 Let s Write it out.. Recall the velocity term in the Momentum method: ) v αv ɛ θ (L(f(x (i) ; θ), y (i) ) Nesterov Momentum: v αv ɛ θ (L(f(x (i) ; θ + αv), y (i) ) ) Update: θ θ + v

73 SGD with Nesterov Momentum Algorithm 3 SGD with Nesterov Momentum Require: Learning rate ɛ Require: Momentum Parameter α Require: Initial Parameter θ Require: Initial Velocity v 1: while stopping criteria not met do 2: Sample example (x (i), y (i) ) from training set 3: Update parameters: θ θ + αv 4: Compute gradient estimate: 5: ĝ + θl(f(x (i) ; θ), y (i) ) 6: Compute the velocity update: v αv ɛĝ 7: Apply Update: θ θ + v 8: end while

74 Adaptive Learning Rate Methods

75 Motivation Till now we assign the same learning rate to all features

76 Motivation Till now we assign the same learning rate to all features If the features vary in importance and frequency, why is this a good idea?

77 Motivation Till now we assign the same learning rate to all features If the features vary in importance and frequency, why is this a good idea? It s probably not!

78 Motivation Nice (all features are equally important)

79 Motivation Harder!

80 AdaGrad Idea: Downscale a model parameter by square-root of sum of squares of all its historical values

81 AdaGrad Idea: Downscale a model parameter by square-root of sum of squares of all its historical values Parameters that have large partial derivative of the loss learning rates for them are rapidly declined

82 AdaGrad Idea: Downscale a model parameter by square-root of sum of squares of all its historical values Parameters that have large partial derivative of the loss learning rates for them are rapidly declined Some interesting theoretical properties

83 AdaGrad Algorithm 4 AdaGrad Require: Global Learning rate ɛ, Initial Parameter θ, δ Initialize r = 0 1: while stopping criteria not met do 2: Sample example (x (i), y (i) ) from training set 3: Compute gradient estimate: ĝ + θ L(f(x (i) ; θ), y (i) ) 4: Accumulate: r r + ĝ ĝ 5: Compute update: θ ɛ δ+ r ĝ 6: Apply Update: θ θ + θ 7: end while

84 RMSProp AdaGrad is good when the objective is convex.

85 RMSProp AdaGrad is good when the objective is convex. AdaGrad might shrink the learning rate too aggressively, we want to keep the history in mind

86 RMSProp AdaGrad is good when the objective is convex. AdaGrad might shrink the learning rate too aggressively, we want to keep the history in mind We can adapt it to perform better in non-convex settings by accumulating an exponentially decaying average of the gradient

87 RMSProp AdaGrad is good when the objective is convex. AdaGrad might shrink the learning rate too aggressively, we want to keep the history in mind We can adapt it to perform better in non-convex settings by accumulating an exponentially decaying average of the gradient This is an idea that we use again and again in Neural Networks

88 RMSProp AdaGrad is good when the objective is convex. AdaGrad might shrink the learning rate too aggressively, we want to keep the history in mind We can adapt it to perform better in non-convex settings by accumulating an exponentially decaying average of the gradient This is an idea that we use again and again in Neural Networks Currently has about 500 citations on scholar, but was proposed in a slide in Geoffrey Hinton s coursera course

89 RMSProp Algorithm 5 RMSProp Require: Global Learning rate ɛ, decay parameter ρ, δ Initialize r = 0 1: while stopping criteria not met do 2: Sample example (x (i), y (i) ) from training set 3: Compute gradient estimate: ĝ + θ L(f(x (i) ; θ), y (i) ) 4: Accumulate: r ρr + (1 ρ)ĝ ĝ 5: Compute update: θ ɛ δ+ r ĝ 6: Apply Update: θ θ + θ 7: end while

90 RMSProp with Nesterov Algorithm 6 RMSProp with Nesterov Require: Global Learning rate ɛ, decay parameter ρ, δ, α, v Initialize r = 0 1: while stopping criteria not met do 2: Sample example (x (i), y (i) ) from training set 3: Compute Update: θ θ + αv 4: Compute gradient estimate: ĝ + θl(f(x (i) ; θ), y (i) ) 5: Accumulate: r ρr + (1 ρ)ĝ ĝ 6: Compute Velocity: v αv ɛ r ĝ 7: Apply Update: θ θ + v 8: end while

91 Adam We could have used RMSProp with momentum

92 Adam We could have used RMSProp with momentum Use of Momentum with rescaling is not well motivated

93 Adam We could have used RMSProp with momentum Use of Momentum with rescaling is not well motivated Adam is like RMSProp with Momentum but with bias correction terms for the first and second moments

94 Adam: ADAptive Moments Algorithm 7 RMSProp with Nesterov Require: ɛ (set to ), decay rates ρ 1 (set to 0.9), ρ 2 (set to 0.9), θ, δ Initialize moments variables s = 0 and r = 0, time step t = 0 1: while stopping criteria not met do 2: Sample example (x (i), y (i) ) from training set 3: Compute gradient estimate: ĝ + θ L(f(x (i) ; θ), y (i) ) 4: t t + 1 5: Update: s ρ 1 s + (1 ρ 1 )ĝ 6: Update: r ρ 2 r + (1 ρ 2 )ĝ ĝ 7: Correct Biases: ŝ s, ˆr r 1 ρ t 1 1 ρ t 2 ŝ 8: Compute Update: θ = ɛ ˆr+δ 9: Apply Update: θ θ + θ 10: end while

95 All your GRADs are belong to us! SGD: θ θ ɛĝ Momentum: v αv ɛĝ then θ θ + v ) Nesterov: v αv ɛ θ (L(f(x (i) ; θ + αv), y (i) ) then θ θ + v AdaGrad: r r + g g then θ ɛ δ + g then θ θ + θ r RMSProp: r ρr + (1 ρ)ĝ ĝ then θ ɛ δ + ĝ then θ θ + θ r Adam: ŝ s 1 ρ t, ˆr r ŝ 1 1 ρ t then θ = ɛ then θ θ + θ 2 ˆr + δ

96 Batch Normalization

97 A Difficulty in Training Deep Neural Networks A deep model involves composition of several functions ŷ = W T 4 (tanh(w T 3 (tanh(w T 2 (tanh(w T 1 x + b 1) + b 2 ) + b 3 )))) ŷ x 1 x 2 x 3 x 4

98 A Difficulty in Training Deep Neural Networks We have a recipe to compute gradients (Backpropagation), and update every parameter (we saw half a dozen methods)

99 A Difficulty in Training Deep Neural Networks We have a recipe to compute gradients (Backpropagation), and update every parameter (we saw half a dozen methods) Implicit Assumption: Other layers don t change i.e. other functions are fixed

100 A Difficulty in Training Deep Neural Networks We have a recipe to compute gradients (Backpropagation), and update every parameter (we saw half a dozen methods) Implicit Assumption: Other layers don t change i.e. other functions are fixed In Practice: We update all layers simultaneously

101 A Difficulty in Training Deep Neural Networks We have a recipe to compute gradients (Backpropagation), and update every parameter (we saw half a dozen methods) Implicit Assumption: Other layers don t change i.e. other functions are fixed In Practice: We update all layers simultaneously This can give rise to unexpected difficulties

102 A Difficulty in Training Deep Neural Networks We have a recipe to compute gradients (Backpropagation), and update every parameter (we saw half a dozen methods) Implicit Assumption: Other layers don t change i.e. other functions are fixed In Practice: We update all layers simultaneously This can give rise to unexpected difficulties Let s look at two illustrations

103 Intuition Consider a second order approximation of our cost function (which is a function composition) around current point θ (0) : J(θ) J(θ (0) ) + (θ θ (0) ) T g (θ θ(0) ) T H(θ θ (0) )

104 Intuition Consider a second order approximation of our cost function (which is a function composition) around current point θ (0) : J(θ) J(θ (0) ) + (θ θ (0) ) T g (θ θ(0) ) T H(θ θ (0) ) g is gradient and H the Hessian at θ (0)

105 Intuition Consider a second order approximation of our cost function (which is a function composition) around current point θ (0) : J(θ) J(θ (0) ) + (θ θ (0) ) T g (θ θ(0) ) T H(θ θ (0) ) g is gradient and H the Hessian at θ (0) If ɛ is the learning rate, the new point θ = θ (0) ɛg

106 Intuition Plugging our new point, θ = θ (0) ɛg into the approximation: J(θ (0) ɛg) = J(θ (0) ) ɛg T g gt Hg

107 Intuition Plugging our new point, θ = θ (0) ɛg into the approximation: J(θ (0) ɛg) = J(θ (0) ) ɛg T g gt Hg There are three terms here:

108 Intuition Plugging our new point, θ = θ (0) ɛg into the approximation: J(θ (0) ɛg) = J(θ (0) ) ɛg T g gt Hg There are three terms here: Value of function before update

109 Intuition Plugging our new point, θ = θ (0) ɛg into the approximation: J(θ (0) ɛg) = J(θ (0) ) ɛg T g gt Hg There are three terms here: Value of function before update Improvement using gradient (i.e. first order information)

110 Intuition Plugging our new point, θ = θ (0) ɛg into the approximation: J(θ (0) ɛg) = J(θ (0) ) ɛg T g gt Hg There are three terms here: Value of function before update Improvement using gradient (i.e. first order information) Correction factor that accounts for the curvature of the function

111 Intuition J(θ (0) ɛg) = J(θ (0) ) ɛg T g gt Hg

112 Intuition J(θ (0) ɛg) = J(θ (0) ) ɛg T g gt Hg Observations:

113 Intuition J(θ (0) ɛg) = J(θ (0) ) ɛg T g gt Hg Observations: g T Hg too large: Gradient will start moving upwards

114 Intuition J(θ (0) ɛg) = J(θ (0) ) ɛg T g gt Hg Observations: g T Hg too large: Gradient will start moving upwards g T Hg = 0: J will decrease for even large ɛ

115 Intuition J(θ (0) ɛg) = J(θ (0) ) ɛg T g gt Hg Observations: g T Hg too large: Gradient will start moving upwards g T Hg = 0: J will decrease for even large ɛ Optimal step size ɛ = g T g for zero curvature, ɛ = gt g to take into account curvature g T Hg

116 Intuition J(θ (0) ɛg) = J(θ (0) ) ɛg T g gt Hg Observations: g T Hg too large: Gradient will start moving upwards g T Hg = 0: J will decrease for even large ɛ Optimal step size ɛ = g T g for zero curvature, ɛ = gt g to take into account curvature g T Hg Conclusion: Just neglecting second order effects can cause problems (remedy: second order methods). What about higher order effects?

117 Higher Order Effects: Toy Model ŷ w l h l. h 2 w 2 h 1 w 1 x

118 Higher Order Effects: Toy Model ŷ w l h l. h 2 w 2 h 1 w 1 Just one node per layer, no non-linearity x

119 Higher Order Effects: Toy Model ŷ w l h l. h 2 w 2 h 1 w 1 Just one node per layer, no non-linearity ŷ is linear in x but non-linear in w i x

120 Higher Order Effects: Toy Model Suppose δ = 1, so we want to decrease our output ŷ

121 Higher Order Effects: Toy Model Suppose δ = 1, so we want to decrease our output ŷ Usual strategy: Using backprop find g = w (ŷ y) 2

122 Higher Order Effects: Toy Model Suppose δ = 1, so we want to decrease our output ŷ Usual strategy: Using backprop find g = w (ŷ y) 2 Update weights w := w ɛg

123 Higher Order Effects: Toy Model Suppose δ = 1, so we want to decrease our output ŷ Usual strategy: Using backprop find g = w (ŷ y) 2 Update weights w := w ɛg The first order Taylor approximation (in previous slide) says the cost will reduce by ɛg T g

124 Higher Order Effects: Toy Model Suppose δ = 1, so we want to decrease our output ŷ Usual strategy: Using backprop find g = w (ŷ y) 2 Update weights w := w ɛg The first order Taylor approximation (in previous slide) says the cost will reduce by ɛg T g If we need to reduce cost by 0.1, then learning rate should be 0.1 g T g

125 Higher Order Effects: Toy Model The new ŷ will however be: ŷ = x(w 1 ɛg 1 )(w 2 ɛg 2 )... (w l ɛg l )

126 Higher Order Effects: Toy Model The new ŷ will however be: ŷ = x(w 1 ɛg 1 )(w 2 ɛg 2 )... (w l ɛg l ) Contains terms like ɛ 3 g 1 g 2 g 3 w 4 w 5... w l

127 Higher Order Effects: Toy Model The new ŷ will however be: ŷ = x(w 1 ɛg 1 )(w 2 ɛg 2 )... (w l ɛg l ) Contains terms like ɛ 3 g 1 g 2 g 3 w 4 w 5... w l If weights w 4, w 5,..., w l are small, the term is negligible. But if large, it would explode

128 Higher Order Effects: Toy Model The new ŷ will however be: ŷ = x(w 1 ɛg 1 )(w 2 ɛg 2 )... (w l ɛg l ) Contains terms like ɛ 3 g 1 g 2 g 3 w 4 w 5... w l If weights w 4, w 5,..., w l are small, the term is negligible. But if large, it would explode Conclusion: Higher order terms make it very hard to choose the right learning rate

129 Higher Order Effects: Toy Model The new ŷ will however be: ŷ = x(w 1 ɛg 1 )(w 2 ɛg 2 )... (w l ɛg l ) Contains terms like ɛ 3 g 1 g 2 g 3 w 4 w 5... w l If weights w 4, w 5,..., w l are small, the term is negligible. But if large, it would explode Conclusion: Higher order terms make it very hard to choose the right learning rate Second Order Methods are already expensive, nth order methods are hopeless. Solution?

130 Batch Normalization Method to reparameterize a deep network to reduce co-ordination of update across layers

131 Batch Normalization Method to reparameterize a deep network to reduce co-ordination of update across layers Can be applied to input layer, or any hidden layer

132 Batch Normalization Method to reparameterize a deep network to reduce co-ordination of update across layers Can be applied to input layer, or any hidden layer Let H be a design matrix having activations in any layer for m examples in the mini-batch

133 Batch Normalization Method to reparameterize a deep network to reduce co-ordination of update across layers Can be applied to input layer, or any hidden layer Let H be a design matrix having activations in any layer for m examples in the mini-batch h 11 h 12 h h 1k h 21 h 22 h h 2k H = h m1 h m2 h m3... h mk

134 Batch Normalization h 11 h 12 h h 1k h 21 h 22 h h 2k H = h m1 h m2 h m3... h mk Each row represents all the activations in layer for one example

135 Batch Normalization h 11 h 12 h h 1k h 21 h 22 h h 2k H = h m1 h m2 h m3... h mk Each row represents all the activations in layer for one example Idea: Replace H by H such that: H = H µ σ

136 Batch Normalization h 11 h 12 h h 1k h 21 h 22 h h 2k H = h m1 h m2 h m3... h mk Each row represents all the activations in layer for one example Idea: Replace H by H such that: H = H µ σ µ is mean of each unit and σ the standard deviation

137 Batch Normalization µ is a vector with µ j the column mean

138 Batch Normalization µ is a vector with µ j the column mean σ is a vector with σ j the column standard deviation

139 Batch Normalization µ is a vector with µ j the column mean σ is a vector with σ j the column standard deviation H i,j is normalized by subtracting µ j and dividing by σ j

140 Batch Normalization During training we have: µ = 1 m j H :,j

141 Batch Normalization During training we have: µ = 1 m j H :,j σ = δ + 1 (H µ) 2 j m We then operate on H as before = we backpropagate through the normalized activations j

142 Why is this good? The update will never act to only increase the mean and standard deviation of any activation

143 Why is this good? The update will never act to only increase the mean and standard deviation of any activation Previous approaches added penalties to cost or per layer to encourage units to have standardized outputs

144 Why is this good? The update will never act to only increase the mean and standard deviation of any activation Previous approaches added penalties to cost or per layer to encourage units to have standardized outputs Batch normalization makes the reparameterization easier

145 Why is this good? The update will never act to only increase the mean and standard deviation of any activation Previous approaches added penalties to cost or per layer to encourage units to have standardized outputs Batch normalization makes the reparameterization easier At test time: Use running averages of µ and σ collected during training, use these for evaluating new input x

146 An Innovation Standardizing the output of a unit can limit the expressive power of the neural network

147 An Innovation Standardizing the output of a unit can limit the expressive power of the neural network Solution: Instead of replacing H by H, replace it will γh + β

148 An Innovation Standardizing the output of a unit can limit the expressive power of the neural network Solution: Instead of replacing H by H, replace it will γh + β γ and β are also learned by backpropagation

149 An Innovation Standardizing the output of a unit can limit the expressive power of the neural network Solution: Instead of replacing H by H, replace it will γh + β γ and β are also learned by backpropagation Normalizing for mean and standard deviation was the goal of batch normalization, why add γ and β again?

150 Initialization Strategies

151 In convex problems with good ɛ no matter what the initialization, convergence is guaranteed

152 In convex problems with good ɛ no matter what the initialization, convergence is guaranteed In the non-convex regime initialization is much more important

153 In convex problems with good ɛ no matter what the initialization, convergence is guaranteed In the non-convex regime initialization is much more important Some parameter initialization can be unstable, not converge

154 In convex problems with good ɛ no matter what the initialization, convergence is guaranteed In the non-convex regime initialization is much more important Some parameter initialization can be unstable, not converge Neural Networks are not well understood to have principled, mathematically nice initialization strategies

155 In convex problems with good ɛ no matter what the initialization, convergence is guaranteed In the non-convex regime initialization is much more important Some parameter initialization can be unstable, not converge Neural Networks are not well understood to have principled, mathematically nice initialization strategies What is known: Initialization should break symmetry (quiz!)

156 In convex problems with good ɛ no matter what the initialization, convergence is guaranteed In the non-convex regime initialization is much more important Some parameter initialization can be unstable, not converge Neural Networks are not well understood to have principled, mathematically nice initialization strategies What is known: Initialization should break symmetry (quiz!) What is known: Scale of weights is important

157 In convex problems with good ɛ no matter what the initialization, convergence is guaranteed In the non-convex regime initialization is much more important Some parameter initialization can be unstable, not converge Neural Networks are not well understood to have principled, mathematically nice initialization strategies What is known: Initialization should break symmetry (quiz!) What is known: Scale of weights is important Most initialization strategies are based on intuitions and heuristics

158 Some Heuristics For a fully connected layer with m inputs and n outputs, sample: W ij U( 1 1, ) m m

159 Some Heuristics For a fully connected layer with m inputs and n outputs, sample: W ij U( 1 1, ) m m Xavier Initialization: Sample 6 6 W ij U(, ) m + n m + n

160 Some Heuristics For a fully connected layer with m inputs and n outputs, sample: W ij U( 1 1, ) m m Xavier Initialization: Sample 6 6 W ij U(, ) m + n m + n Xavier initialization is derived considering that the network consists of matrix multiplications with no nonlinearites

161 Some Heuristics For a fully connected layer with m inputs and n outputs, sample: W ij U( 1 1, ) m m Xavier Initialization: Sample 6 6 W ij U(, ) m + n m + n Xavier initialization is derived considering that the network consists of matrix multiplications with no nonlinearites Works well in practice!

162 More Heuristics Saxe et al. 2013, recommend initialzing to random orthogonal matrices, with a carefully chosen gain g that accounts for non-linearities

163 More Heuristics Saxe et al. 2013, recommend initialzing to random orthogonal matrices, with a carefully chosen gain g that accounts for non-linearities If g could be divined, it could solve the vanishing and exploding gradients problem (more later)

164 More Heuristics Saxe et al. 2013, recommend initialzing to random orthogonal matrices, with a carefully chosen gain g that accounts for non-linearities If g could be divined, it could solve the vanishing and exploding gradients problem (more later) The idea of choosing g and initializing weights accordingly is that we want norm of activations to increase, and pass back strong gradients

165 More Heuristics Saxe et al. 2013, recommend initialzing to random orthogonal matrices, with a carefully chosen gain g that accounts for non-linearities If g could be divined, it could solve the vanishing and exploding gradients problem (more later) The idea of choosing g and initializing weights accordingly is that we want norm of activations to increase, and pass back strong gradients Martens 2010, suggested an initialization that was sparse: Each unit could only receive k non-zero weights

166 More Heuristics Saxe et al. 2013, recommend initialzing to random orthogonal matrices, with a carefully chosen gain g that accounts for non-linearities If g could be divined, it could solve the vanishing and exploding gradients problem (more later) The idea of choosing g and initializing weights accordingly is that we want norm of activations to increase, and pass back strong gradients Martens 2010, suggested an initialization that was sparse: Each unit could only receive k non-zero weights Motivation: Ir is a bad idea to have all initial weights to have the same standard deviation 1 m

167 Polyak Averaging: Motivation Gradient points towards right Consider gradient descent above with high step size ɛ

168 Polyak Averaging: Motivation Gradient points towards left

169 Polyak Averaging: Motivation Gradient points towards right

170 Polyak Averaging: Motivation Gradient points towards left

171 Polyak Averaging: Motivation Gradient points towards right

172 A Solution: Polyak Averaging Suppose in t iterations you have parameters θ (1), θ (2),..., θ (t)

173 A Solution: Polyak Averaging Suppose in t iterations you have parameters θ (1), θ (2),..., θ (t) Polyak Averaging suggests setting ˆθ (t) = 1 t i θ(i)

174 A Solution: Polyak Averaging Suppose in t iterations you have parameters θ (1), θ (2),..., θ (t) i θ(i) Polyak Averaging suggests setting ˆθ (t) = 1 t Has strong convergence guarantees in convex settings

175 A Solution: Polyak Averaging Suppose in t iterations you have parameters θ (1), θ (2),..., θ (t) i θ(i) Polyak Averaging suggests setting ˆθ (t) = 1 t Has strong convergence guarantees in convex settings Is this a good idea in non-convex problems?

176 Simple Modification In non-convex surfaces the parameter space can differ greatly in different regions

177 Simple Modification In non-convex surfaces the parameter space can differ greatly in different regions Averaging is not useful

178 Simple Modification In non-convex surfaces the parameter space can differ greatly in different regions Averaging is not useful Typical to consider the exponentially decaying average instead: ˆθ (t) = αˆθ (t 1) + (1 α)ˆθ (t) with α [0, 1]

179 Next time Convolutional Neural Networks

Optimization for neural networks

0 - : Optimization for neural networks Prof. J.C. Kao, UCLA Optimization for neural networks We previously introduced the principle of gradient descent. Now we will discuss specific modifications we make