Lecture 6 Optimization for Deep Neural Networks

Size: px
Start display at page:

Download "Lecture 6 Optimization for Deep Neural Networks"

Transcription

1 Lecture 6 Optimization for Deep Neural Networks CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 12, 2017

2 Things we will look at today Stochastic Gradient Descent

3 Things we will look at today Stochastic Gradient Descent Momentum Method and the Nesterov Variant

4 Things we will look at today Stochastic Gradient Descent Momentum Method and the Nesterov Variant Adaptive Learning Methods (AdaGrad, RMSProp, Adam)

5 Things we will look at today Stochastic Gradient Descent Momentum Method and the Nesterov Variant Adaptive Learning Methods (AdaGrad, RMSProp, Adam) Batch Normalization

6 Things we will look at today Stochastic Gradient Descent Momentum Method and the Nesterov Variant Adaptive Learning Methods (AdaGrad, RMSProp, Adam) Batch Normalization Intialization Heuristics

7 Things we will look at today Stochastic Gradient Descent Momentum Method and the Nesterov Variant Adaptive Learning Methods (AdaGrad, RMSProp, Adam) Batch Normalization Intialization Heuristics Polyak Averaging

8 Things we will look at today Stochastic Gradient Descent Momentum Method and the Nesterov Variant Adaptive Learning Methods (AdaGrad, RMSProp, Adam) Batch Normalization Intialization Heuristics Polyak Averaging On Slides but for self study: Newton and Quasi Newton Methods (BFGS, L-BFGS, Conjugate Gradient)

9 Optimization We ve seen backpropagation as a method for computing gradients

10 Optimization We ve seen backpropagation as a method for computing gradients Assignment: Was about implementation of SGD in conjunction with backprop

11 Optimization We ve seen backpropagation as a method for computing gradients Assignment: Was about implementation of SGD in conjunction with backprop Let s see a family of first order methods

12 Batch Gradient Descent Algorithm 1 Batch Gradient Descent at Iteration k Require: Learning rate ɛ k Require: Initial Parameter θ 1: while stopping criteria not met do 2: Compute gradient estimate over N examples: i L(f(x(i) ; θ), y (i) ) 3: ĝ + 1 N θ 4: Apply Update: θ θ ɛĝ 5: end while Positive: Gradient estimates are stable Negative: Need to compute gradients over the entire training for one update

13 Gradient Descent

14 Gradient Descent

15 Gradient Descent

16 Gradient Descent

17 Gradient Descent

18 Gradient Descent

19 Stochastic Gradient Descent Algorithm 2 Stochastic Gradient Descent at Iteration k Require: Learning rate ɛ k Require: Initial Parameter θ 1: while stopping criteria not met do 2: Sample example (x (i), y (i) ) from training set 3: Compute gradient estimate: 4: ĝ + θ L(f(x (i) ; θ), y (i) ) 5: Apply Update: θ θ ɛĝ 6: end while

20 Stochastic Gradient Descent Algorithm 2 Stochastic Gradient Descent at Iteration k Require: Learning rate ɛ k Require: Initial Parameter θ 1: while stopping criteria not met do 2: Sample example (x (i), y (i) ) from training set 3: Compute gradient estimate: 4: ĝ + θ L(f(x (i) ; θ), y (i) ) 5: Apply Update: θ θ ɛĝ 6: end while ɛ k is learning rate at step k

21 Stochastic Gradient Descent Algorithm 2 Stochastic Gradient Descent at Iteration k Require: Learning rate ɛ k Require: Initial Parameter θ 1: while stopping criteria not met do 2: Sample example (x (i), y (i) ) from training set 3: Compute gradient estimate: 4: ĝ + θ L(f(x (i) ; θ), y (i) ) 5: Apply Update: θ θ ɛĝ 6: end while ɛ k is learning rate at step k Sufficient condition to guarantee convergence: ɛ k = and ɛ 2 k < k=1 k=1

22 Learning Rate Schedule In practice the learning rate is decayed linearly till iteration τ

23 Learning Rate Schedule In practice the learning rate is decayed linearly till iteration τ ɛ k = (1 α)ɛ 0 + αɛ τ with α = k τ

24 Learning Rate Schedule In practice the learning rate is decayed linearly till iteration τ ɛ k = (1 α)ɛ 0 + αɛ τ with α = k τ τ is usually set to the number of iterations needed for a large number of passes through the data ɛ τ should roughly be set to 1% of ɛ 0 How to set ɛ 0?

25 Minibatching Potential Problem: Gradient estimates can be very noisy

26 Minibatching Potential Problem: Gradient estimates can be very noisy Obvious Solution: Use larger mini-batches

27 Minibatching Potential Problem: Gradient estimates can be very noisy Obvious Solution: Use larger mini-batches Advantage: Computation time per update does not depend on number of training examples N

28 Minibatching Potential Problem: Gradient estimates can be very noisy Obvious Solution: Use larger mini-batches Advantage: Computation time per update does not depend on number of training examples N This allows convergence on extremely large datasets

29 Minibatching Potential Problem: Gradient estimates can be very noisy Obvious Solution: Use larger mini-batches Advantage: Computation time per update does not depend on number of training examples N This allows convergence on extremely large datasets See: Large Scale Learning with Stochastic Gradient Descent by Leon Bottou

30 Stochastic Gradient Descent

31 Stochastic Gradient Descent

32 Stochastic Gradient Descent

33 Stochastic Gradient Descent

34 Stochastic Gradient Descent

35 Stochastic Gradient Descent

36 Stochastic Gradient Descent

37 Stochastic Gradient Descent

38 Stochastic Gradient Descent

39 So far.. Batch Gradient Descent: ĝ + 1 N θ L(f(x (i) ; θ), y (i) ) i θ θ ɛĝ

40 So far.. Batch Gradient Descent: ĝ + 1 N θ L(f(x (i) ; θ), y (i) ) i θ θ ɛĝ SGD: ĝ + θ L(f(x (i) ; θ), y (i) ) θ θ ɛĝ

41 Momentum The Momentum method is a method to accelerate learning using SGD

42 Momentum The Momentum method is a method to accelerate learning using SGD In particular SGD suffers in the following scenarios: Error surface has high curvature

43 Momentum The Momentum method is a method to accelerate learning using SGD In particular SGD suffers in the following scenarios: Error surface has high curvature We get small but consistent gradients

44 Momentum The Momentum method is a method to accelerate learning using SGD In particular SGD suffers in the following scenarios: Error surface has high curvature We get small but consistent gradients The gradients are very noisy

45 Momentum Gradient Descent would move quickly down the walls, but very slowly through the valley floor

46 Momentum How do we try and solve this problem?

47 Momentum How do we try and solve this problem? Introduce a new variable v, the velocity

48 Momentum How do we try and solve this problem? Introduce a new variable v, the velocity We think of v as the direction and speed by which the parameters move as the learning dynamics progresses

49 Momentum How do we try and solve this problem? Introduce a new variable v, the velocity We think of v as the direction and speed by which the parameters move as the learning dynamics progresses The velocity is an exponentially decaying moving average of the negative gradients ) v αv ɛ θ (L(f(x (i) ; θ), y (i) ) α [0, 1)Update rule: θ θ + v

50 Momentum Let s look at the velocity term: v αv ɛ θ (L(f(x (i) ; θ), y (i) ) )

51 Momentum Let s look at the velocity term: v αv ɛ θ (L(f(x (i) ; θ), y (i) ) ) The velocity accumulates the previous gradients

52 Momentum Let s look at the velocity term: ) v αv ɛ θ (L(f(x (i) ; θ), y (i) ) The velocity accumulates the previous gradients What is the role of α? If α is larger than ɛ the current update is more affected by the previous gradients

53 Momentum Let s look at the velocity term: ) v αv ɛ θ (L(f(x (i) ; θ), y (i) ) The velocity accumulates the previous gradients What is the role of α? If α is larger than ɛ the current update is more affected by the previous gradients Usually values for α are set high 0.8, 0.9

54 Momentum Gradient Step

55 Momentum Momentum Step Gradient Step

56 Momentum Momentum Step Actual Step Gradient Step

57 Momentum: Step Sizes In SGD, the step size was the norm of the gradient scaled by the learning rate ɛ g. Why?

58 Momentum: Step Sizes In SGD, the step size was the norm of the gradient scaled by the learning rate ɛ g. Why? While using momentum, the step size will also depend on the norm and alignment of a sequence of gradients

59 Momentum: Step Sizes In SGD, the step size was the norm of the gradient scaled by the learning rate ɛ g. Why? While using momentum, the step size will also depend on the norm and alignment of a sequence of gradients For example, if at each step we observed g, the step size would be (exercise!): ɛ g 1 α

60 Momentum: Step Sizes In SGD, the step size was the norm of the gradient scaled by the learning rate ɛ g. Why? While using momentum, the step size will also depend on the norm and alignment of a sequence of gradients For example, if at each step we observed g, the step size would be (exercise!): ɛ g 1 α If α = 0.9 = multiply the maximum speed by 10 relative to the current gradient direction

61 Momentum Illustration of how momentum traverses such an error surface better compared to Gradient Descent

62 SGD with Momentum Algorithm 2 Stochastic Gradient Descent with Momentum Require: Learning rate ɛ k Require: Momentum Parameter α Require: Initial Parameter θ Require: Initial Velocity v 1: while stopping criteria not met do 2: Sample example (x (i), y (i) ) from training set 3: Compute gradient estimate: 4: ĝ + θ L(f(x (i) ; θ), y (i) ) 5: Compute the velocity update: 6: v αv ɛĝ 7: Apply Update: θ θ + v 8: end while

63 Nesterov Momentum Another approach: First take a step in the direction of the accumulated gradient Then calculate the gradient and make a correction Accumulated Gradient

64 Nesterov Momentum Another approach: First take a step in the direction of the accumulated gradient Then calculate the gradient and make a correction Accumulated Gradient Correction

65 Nesterov Momentum Another approach: First take a step in the direction of the accumulated gradient Then calculate the gradient and make a correction Accumulated Gradient Correction New Accumulated Gradient

66 Nesterov Momentum Next Step

67 Nesterov Momentum Next Step

68 Nesterov Momentum Next Step

69 Nesterov Momentum Next Step

70 Let s Write it out.. Recall the velocity term in the Momentum method: ) v αv ɛ θ (L(f(x (i) ; θ), y (i) )

71 Let s Write it out.. Recall the velocity term in the Momentum method: ) v αv ɛ θ (L(f(x (i) ; θ), y (i) ) Nesterov Momentum: v αv ɛ θ (L(f(x (i) ; θ + αv), y (i) ) )

72 Let s Write it out.. Recall the velocity term in the Momentum method: ) v αv ɛ θ (L(f(x (i) ; θ), y (i) ) Nesterov Momentum: v αv ɛ θ (L(f(x (i) ; θ + αv), y (i) ) ) Update: θ θ + v

73 SGD with Nesterov Momentum Algorithm 3 SGD with Nesterov Momentum Require: Learning rate ɛ Require: Momentum Parameter α Require: Initial Parameter θ Require: Initial Velocity v 1: while stopping criteria not met do 2: Sample example (x (i), y (i) ) from training set 3: Update parameters: θ θ + αv 4: Compute gradient estimate: 5: ĝ + θl(f(x (i) ; θ), y (i) ) 6: Compute the velocity update: v αv ɛĝ 7: Apply Update: θ θ + v 8: end while

74 Adaptive Learning Rate Methods

75 Motivation Till now we assign the same learning rate to all features

76 Motivation Till now we assign the same learning rate to all features If the features vary in importance and frequency, why is this a good idea?

77 Motivation Till now we assign the same learning rate to all features If the features vary in importance and frequency, why is this a good idea? It s probably not!

78 Motivation Nice (all features are equally important)

79 Motivation Harder!

80 AdaGrad Idea: Downscale a model parameter by square-root of sum of squares of all its historical values

81 AdaGrad Idea: Downscale a model parameter by square-root of sum of squares of all its historical values Parameters that have large partial derivative of the loss learning rates for them are rapidly declined

82 AdaGrad Idea: Downscale a model parameter by square-root of sum of squares of all its historical values Parameters that have large partial derivative of the loss learning rates for them are rapidly declined Some interesting theoretical properties

83 AdaGrad Algorithm 4 AdaGrad Require: Global Learning rate ɛ, Initial Parameter θ, δ Initialize r = 0 1: while stopping criteria not met do 2: Sample example (x (i), y (i) ) from training set 3: Compute gradient estimate: ĝ + θ L(f(x (i) ; θ), y (i) ) 4: Accumulate: r r + ĝ ĝ 5: Compute update: θ ɛ δ+ r ĝ 6: Apply Update: θ θ + θ 7: end while

84 RMSProp AdaGrad is good when the objective is convex.

85 RMSProp AdaGrad is good when the objective is convex. AdaGrad might shrink the learning rate too aggressively, we want to keep the history in mind

86 RMSProp AdaGrad is good when the objective is convex. AdaGrad might shrink the learning rate too aggressively, we want to keep the history in mind We can adapt it to perform better in non-convex settings by accumulating an exponentially decaying average of the gradient

87 RMSProp AdaGrad is good when the objective is convex. AdaGrad might shrink the learning rate too aggressively, we want to keep the history in mind We can adapt it to perform better in non-convex settings by accumulating an exponentially decaying average of the gradient This is an idea that we use again and again in Neural Networks

88 RMSProp AdaGrad is good when the objective is convex. AdaGrad might shrink the learning rate too aggressively, we want to keep the history in mind We can adapt it to perform better in non-convex settings by accumulating an exponentially decaying average of the gradient This is an idea that we use again and again in Neural Networks Currently has about 500 citations on scholar, but was proposed in a slide in Geoffrey Hinton s coursera course

89 RMSProp Algorithm 5 RMSProp Require: Global Learning rate ɛ, decay parameter ρ, δ Initialize r = 0 1: while stopping criteria not met do 2: Sample example (x (i), y (i) ) from training set 3: Compute gradient estimate: ĝ + θ L(f(x (i) ; θ), y (i) ) 4: Accumulate: r ρr + (1 ρ)ĝ ĝ 5: Compute update: θ ɛ δ+ r ĝ 6: Apply Update: θ θ + θ 7: end while

90 RMSProp with Nesterov Algorithm 6 RMSProp with Nesterov Require: Global Learning rate ɛ, decay parameter ρ, δ, α, v Initialize r = 0 1: while stopping criteria not met do 2: Sample example (x (i), y (i) ) from training set 3: Compute Update: θ θ + αv 4: Compute gradient estimate: ĝ + θl(f(x (i) ; θ), y (i) ) 5: Accumulate: r ρr + (1 ρ)ĝ ĝ 6: Compute Velocity: v αv ɛ r ĝ 7: Apply Update: θ θ + v 8: end while

91 Adam We could have used RMSProp with momentum

92 Adam We could have used RMSProp with momentum Use of Momentum with rescaling is not well motivated

93 Adam We could have used RMSProp with momentum Use of Momentum with rescaling is not well motivated Adam is like RMSProp with Momentum but with bias correction terms for the first and second moments

94 Adam: ADAptive Moments Algorithm 7 RMSProp with Nesterov Require: ɛ (set to ), decay rates ρ 1 (set to 0.9), ρ 2 (set to 0.9), θ, δ Initialize moments variables s = 0 and r = 0, time step t = 0 1: while stopping criteria not met do 2: Sample example (x (i), y (i) ) from training set 3: Compute gradient estimate: ĝ + θ L(f(x (i) ; θ), y (i) ) 4: t t + 1 5: Update: s ρ 1 s + (1 ρ 1 )ĝ 6: Update: r ρ 2 r + (1 ρ 2 )ĝ ĝ 7: Correct Biases: ŝ s, ˆr r 1 ρ t 1 1 ρ t 2 ŝ 8: Compute Update: θ = ɛ ˆr+δ 9: Apply Update: θ θ + θ 10: end while

95 All your GRADs are belong to us! SGD: θ θ ɛĝ Momentum: v αv ɛĝ then θ θ + v ) Nesterov: v αv ɛ θ (L(f(x (i) ; θ + αv), y (i) ) then θ θ + v AdaGrad: r r + g g then θ ɛ δ + g then θ θ + θ r RMSProp: r ρr + (1 ρ)ĝ ĝ then θ ɛ δ + ĝ then θ θ + θ r Adam: ŝ s 1 ρ t, ˆr r ŝ 1 1 ρ t then θ = ɛ then θ θ + θ 2 ˆr + δ

96 Batch Normalization

97 A Difficulty in Training Deep Neural Networks A deep model involves composition of several functions ŷ = W T 4 (tanh(w T 3 (tanh(w T 2 (tanh(w T 1 x + b 1) + b 2 ) + b 3 )))) ŷ x 1 x 2 x 3 x 4

98 A Difficulty in Training Deep Neural Networks We have a recipe to compute gradients (Backpropagation), and update every parameter (we saw half a dozen methods)

99 A Difficulty in Training Deep Neural Networks We have a recipe to compute gradients (Backpropagation), and update every parameter (we saw half a dozen methods) Implicit Assumption: Other layers don t change i.e. other functions are fixed

100 A Difficulty in Training Deep Neural Networks We have a recipe to compute gradients (Backpropagation), and update every parameter (we saw half a dozen methods) Implicit Assumption: Other layers don t change i.e. other functions are fixed In Practice: We update all layers simultaneously

101 A Difficulty in Training Deep Neural Networks We have a recipe to compute gradients (Backpropagation), and update every parameter (we saw half a dozen methods) Implicit Assumption: Other layers don t change i.e. other functions are fixed In Practice: We update all layers simultaneously This can give rise to unexpected difficulties

102 A Difficulty in Training Deep Neural Networks We have a recipe to compute gradients (Backpropagation), and update every parameter (we saw half a dozen methods) Implicit Assumption: Other layers don t change i.e. other functions are fixed In Practice: We update all layers simultaneously This can give rise to unexpected difficulties Let s look at two illustrations

103 Intuition Consider a second order approximation of our cost function (which is a function composition) around current point θ (0) : J(θ) J(θ (0) ) + (θ θ (0) ) T g (θ θ(0) ) T H(θ θ (0) )

104 Intuition Consider a second order approximation of our cost function (which is a function composition) around current point θ (0) : J(θ) J(θ (0) ) + (θ θ (0) ) T g (θ θ(0) ) T H(θ θ (0) ) g is gradient and H the Hessian at θ (0)

105 Intuition Consider a second order approximation of our cost function (which is a function composition) around current point θ (0) : J(θ) J(θ (0) ) + (θ θ (0) ) T g (θ θ(0) ) T H(θ θ (0) ) g is gradient and H the Hessian at θ (0) If ɛ is the learning rate, the new point θ = θ (0) ɛg

106 Intuition Plugging our new point, θ = θ (0) ɛg into the approximation: J(θ (0) ɛg) = J(θ (0) ) ɛg T g gt Hg

107 Intuition Plugging our new point, θ = θ (0) ɛg into the approximation: J(θ (0) ɛg) = J(θ (0) ) ɛg T g gt Hg There are three terms here:

108 Intuition Plugging our new point, θ = θ (0) ɛg into the approximation: J(θ (0) ɛg) = J(θ (0) ) ɛg T g gt Hg There are three terms here: Value of function before update

109 Intuition Plugging our new point, θ = θ (0) ɛg into the approximation: J(θ (0) ɛg) = J(θ (0) ) ɛg T g gt Hg There are three terms here: Value of function before update Improvement using gradient (i.e. first order information)

110 Intuition Plugging our new point, θ = θ (0) ɛg into the approximation: J(θ (0) ɛg) = J(θ (0) ) ɛg T g gt Hg There are three terms here: Value of function before update Improvement using gradient (i.e. first order information) Correction factor that accounts for the curvature of the function

111 Intuition J(θ (0) ɛg) = J(θ (0) ) ɛg T g gt Hg

112 Intuition J(θ (0) ɛg) = J(θ (0) ) ɛg T g gt Hg Observations:

113 Intuition J(θ (0) ɛg) = J(θ (0) ) ɛg T g gt Hg Observations: g T Hg too large: Gradient will start moving upwards

114 Intuition J(θ (0) ɛg) = J(θ (0) ) ɛg T g gt Hg Observations: g T Hg too large: Gradient will start moving upwards g T Hg = 0: J will decrease for even large ɛ

115 Intuition J(θ (0) ɛg) = J(θ (0) ) ɛg T g gt Hg Observations: g T Hg too large: Gradient will start moving upwards g T Hg = 0: J will decrease for even large ɛ Optimal step size ɛ = g T g for zero curvature, ɛ = gt g to take into account curvature g T Hg

116 Intuition J(θ (0) ɛg) = J(θ (0) ) ɛg T g gt Hg Observations: g T Hg too large: Gradient will start moving upwards g T Hg = 0: J will decrease for even large ɛ Optimal step size ɛ = g T g for zero curvature, ɛ = gt g to take into account curvature g T Hg Conclusion: Just neglecting second order effects can cause problems (remedy: second order methods). What about higher order effects?

117 Higher Order Effects: Toy Model ŷ w l h l. h 2 w 2 h 1 w 1 x

118 Higher Order Effects: Toy Model ŷ w l h l. h 2 w 2 h 1 w 1 Just one node per layer, no non-linearity x

119 Higher Order Effects: Toy Model ŷ w l h l. h 2 w 2 h 1 w 1 Just one node per layer, no non-linearity ŷ is linear in x but non-linear in w i x

120 Higher Order Effects: Toy Model Suppose δ = 1, so we want to decrease our output ŷ

121 Higher Order Effects: Toy Model Suppose δ = 1, so we want to decrease our output ŷ Usual strategy: Using backprop find g = w (ŷ y) 2

122 Higher Order Effects: Toy Model Suppose δ = 1, so we want to decrease our output ŷ Usual strategy: Using backprop find g = w (ŷ y) 2 Update weights w := w ɛg

123 Higher Order Effects: Toy Model Suppose δ = 1, so we want to decrease our output ŷ Usual strategy: Using backprop find g = w (ŷ y) 2 Update weights w := w ɛg The first order Taylor approximation (in previous slide) says the cost will reduce by ɛg T g

124 Higher Order Effects: Toy Model Suppose δ = 1, so we want to decrease our output ŷ Usual strategy: Using backprop find g = w (ŷ y) 2 Update weights w := w ɛg The first order Taylor approximation (in previous slide) says the cost will reduce by ɛg T g If we need to reduce cost by 0.1, then learning rate should be 0.1 g T g

125 Higher Order Effects: Toy Model The new ŷ will however be: ŷ = x(w 1 ɛg 1 )(w 2 ɛg 2 )... (w l ɛg l )

126 Higher Order Effects: Toy Model The new ŷ will however be: ŷ = x(w 1 ɛg 1 )(w 2 ɛg 2 )... (w l ɛg l ) Contains terms like ɛ 3 g 1 g 2 g 3 w 4 w 5... w l

127 Higher Order Effects: Toy Model The new ŷ will however be: ŷ = x(w 1 ɛg 1 )(w 2 ɛg 2 )... (w l ɛg l ) Contains terms like ɛ 3 g 1 g 2 g 3 w 4 w 5... w l If weights w 4, w 5,..., w l are small, the term is negligible. But if large, it would explode

128 Higher Order Effects: Toy Model The new ŷ will however be: ŷ = x(w 1 ɛg 1 )(w 2 ɛg 2 )... (w l ɛg l ) Contains terms like ɛ 3 g 1 g 2 g 3 w 4 w 5... w l If weights w 4, w 5,..., w l are small, the term is negligible. But if large, it would explode Conclusion: Higher order terms make it very hard to choose the right learning rate

129 Higher Order Effects: Toy Model The new ŷ will however be: ŷ = x(w 1 ɛg 1 )(w 2 ɛg 2 )... (w l ɛg l ) Contains terms like ɛ 3 g 1 g 2 g 3 w 4 w 5... w l If weights w 4, w 5,..., w l are small, the term is negligible. But if large, it would explode Conclusion: Higher order terms make it very hard to choose the right learning rate Second Order Methods are already expensive, nth order methods are hopeless. Solution?

130 Batch Normalization Method to reparameterize a deep network to reduce co-ordination of update across layers

131 Batch Normalization Method to reparameterize a deep network to reduce co-ordination of update across layers Can be applied to input layer, or any hidden layer

132 Batch Normalization Method to reparameterize a deep network to reduce co-ordination of update across layers Can be applied to input layer, or any hidden layer Let H be a design matrix having activations in any layer for m examples in the mini-batch

133 Batch Normalization Method to reparameterize a deep network to reduce co-ordination of update across layers Can be applied to input layer, or any hidden layer Let H be a design matrix having activations in any layer for m examples in the mini-batch h 11 h 12 h h 1k h 21 h 22 h h 2k H = h m1 h m2 h m3... h mk

134 Batch Normalization h 11 h 12 h h 1k h 21 h 22 h h 2k H = h m1 h m2 h m3... h mk Each row represents all the activations in layer for one example

135 Batch Normalization h 11 h 12 h h 1k h 21 h 22 h h 2k H = h m1 h m2 h m3... h mk Each row represents all the activations in layer for one example Idea: Replace H by H such that: H = H µ σ

136 Batch Normalization h 11 h 12 h h 1k h 21 h 22 h h 2k H = h m1 h m2 h m3... h mk Each row represents all the activations in layer for one example Idea: Replace H by H such that: H = H µ σ µ is mean of each unit and σ the standard deviation

137 Batch Normalization µ is a vector with µ j the column mean

138 Batch Normalization µ is a vector with µ j the column mean σ is a vector with σ j the column standard deviation

139 Batch Normalization µ is a vector with µ j the column mean σ is a vector with σ j the column standard deviation H i,j is normalized by subtracting µ j and dividing by σ j

140 Batch Normalization During training we have: µ = 1 m j H :,j

141 Batch Normalization During training we have: µ = 1 m j H :,j σ = δ + 1 (H µ) 2 j m We then operate on H as before = we backpropagate through the normalized activations j

142 Why is this good? The update will never act to only increase the mean and standard deviation of any activation

143 Why is this good? The update will never act to only increase the mean and standard deviation of any activation Previous approaches added penalties to cost or per layer to encourage units to have standardized outputs

144 Why is this good? The update will never act to only increase the mean and standard deviation of any activation Previous approaches added penalties to cost or per layer to encourage units to have standardized outputs Batch normalization makes the reparameterization easier

145 Why is this good? The update will never act to only increase the mean and standard deviation of any activation Previous approaches added penalties to cost or per layer to encourage units to have standardized outputs Batch normalization makes the reparameterization easier At test time: Use running averages of µ and σ collected during training, use these for evaluating new input x

146 An Innovation Standardizing the output of a unit can limit the expressive power of the neural network

147 An Innovation Standardizing the output of a unit can limit the expressive power of the neural network Solution: Instead of replacing H by H, replace it will γh + β

148 An Innovation Standardizing the output of a unit can limit the expressive power of the neural network Solution: Instead of replacing H by H, replace it will γh + β γ and β are also learned by backpropagation

149 An Innovation Standardizing the output of a unit can limit the expressive power of the neural network Solution: Instead of replacing H by H, replace it will γh + β γ and β are also learned by backpropagation Normalizing for mean and standard deviation was the goal of batch normalization, why add γ and β again?

150 Initialization Strategies

151 In convex problems with good ɛ no matter what the initialization, convergence is guaranteed

152 In convex problems with good ɛ no matter what the initialization, convergence is guaranteed In the non-convex regime initialization is much more important

153 In convex problems with good ɛ no matter what the initialization, convergence is guaranteed In the non-convex regime initialization is much more important Some parameter initialization can be unstable, not converge

154 In convex problems with good ɛ no matter what the initialization, convergence is guaranteed In the non-convex regime initialization is much more important Some parameter initialization can be unstable, not converge Neural Networks are not well understood to have principled, mathematically nice initialization strategies

155 In convex problems with good ɛ no matter what the initialization, convergence is guaranteed In the non-convex regime initialization is much more important Some parameter initialization can be unstable, not converge Neural Networks are not well understood to have principled, mathematically nice initialization strategies What is known: Initialization should break symmetry (quiz!)

156 In convex problems with good ɛ no matter what the initialization, convergence is guaranteed In the non-convex regime initialization is much more important Some parameter initialization can be unstable, not converge Neural Networks are not well understood to have principled, mathematically nice initialization strategies What is known: Initialization should break symmetry (quiz!) What is known: Scale of weights is important

157 In convex problems with good ɛ no matter what the initialization, convergence is guaranteed In the non-convex regime initialization is much more important Some parameter initialization can be unstable, not converge Neural Networks are not well understood to have principled, mathematically nice initialization strategies What is known: Initialization should break symmetry (quiz!) What is known: Scale of weights is important Most initialization strategies are based on intuitions and heuristics

158 Some Heuristics For a fully connected layer with m inputs and n outputs, sample: W ij U( 1 1, ) m m

159 Some Heuristics For a fully connected layer with m inputs and n outputs, sample: W ij U( 1 1, ) m m Xavier Initialization: Sample 6 6 W ij U(, ) m + n m + n

160 Some Heuristics For a fully connected layer with m inputs and n outputs, sample: W ij U( 1 1, ) m m Xavier Initialization: Sample 6 6 W ij U(, ) m + n m + n Xavier initialization is derived considering that the network consists of matrix multiplications with no nonlinearites

161 Some Heuristics For a fully connected layer with m inputs and n outputs, sample: W ij U( 1 1, ) m m Xavier Initialization: Sample 6 6 W ij U(, ) m + n m + n Xavier initialization is derived considering that the network consists of matrix multiplications with no nonlinearites Works well in practice!

162 More Heuristics Saxe et al. 2013, recommend initialzing to random orthogonal matrices, with a carefully chosen gain g that accounts for non-linearities

163 More Heuristics Saxe et al. 2013, recommend initialzing to random orthogonal matrices, with a carefully chosen gain g that accounts for non-linearities If g could be divined, it could solve the vanishing and exploding gradients problem (more later)

164 More Heuristics Saxe et al. 2013, recommend initialzing to random orthogonal matrices, with a carefully chosen gain g that accounts for non-linearities If g could be divined, it could solve the vanishing and exploding gradients problem (more later) The idea of choosing g and initializing weights accordingly is that we want norm of activations to increase, and pass back strong gradients

165 More Heuristics Saxe et al. 2013, recommend initialzing to random orthogonal matrices, with a carefully chosen gain g that accounts for non-linearities If g could be divined, it could solve the vanishing and exploding gradients problem (more later) The idea of choosing g and initializing weights accordingly is that we want norm of activations to increase, and pass back strong gradients Martens 2010, suggested an initialization that was sparse: Each unit could only receive k non-zero weights

166 More Heuristics Saxe et al. 2013, recommend initialzing to random orthogonal matrices, with a carefully chosen gain g that accounts for non-linearities If g could be divined, it could solve the vanishing and exploding gradients problem (more later) The idea of choosing g and initializing weights accordingly is that we want norm of activations to increase, and pass back strong gradients Martens 2010, suggested an initialization that was sparse: Each unit could only receive k non-zero weights Motivation: Ir is a bad idea to have all initial weights to have the same standard deviation 1 m

167 Polyak Averaging: Motivation Gradient points towards right Consider gradient descent above with high step size ɛ

168 Polyak Averaging: Motivation Gradient points towards left

169 Polyak Averaging: Motivation Gradient points towards right

170 Polyak Averaging: Motivation Gradient points towards left

171 Polyak Averaging: Motivation Gradient points towards right

172 A Solution: Polyak Averaging Suppose in t iterations you have parameters θ (1), θ (2),..., θ (t)

173 A Solution: Polyak Averaging Suppose in t iterations you have parameters θ (1), θ (2),..., θ (t) Polyak Averaging suggests setting ˆθ (t) = 1 t i θ(i)

174 A Solution: Polyak Averaging Suppose in t iterations you have parameters θ (1), θ (2),..., θ (t) i θ(i) Polyak Averaging suggests setting ˆθ (t) = 1 t Has strong convergence guarantees in convex settings

175 A Solution: Polyak Averaging Suppose in t iterations you have parameters θ (1), θ (2),..., θ (t) i θ(i) Polyak Averaging suggests setting ˆθ (t) = 1 t Has strong convergence guarantees in convex settings Is this a good idea in non-convex problems?

176 Simple Modification In non-convex surfaces the parameter space can differ greatly in different regions

177 Simple Modification In non-convex surfaces the parameter space can differ greatly in different regions Averaging is not useful

178 Simple Modification In non-convex surfaces the parameter space can differ greatly in different regions Averaging is not useful Typical to consider the exponentially decaying average instead: ˆθ (t) = αˆθ (t 1) + (1 α)ˆθ (t) with α [0, 1]

179 Next time Convolutional Neural Networks

Optimization for neural networks

Optimization for neural networks 0 - : Optimization for neural networks Prof. J.C. Kao, UCLA Optimization for neural networks We previously introduced the principle of gradient descent. Now we will discuss specific modifications we make

More information

Optimization for Training I. First-Order Methods Training algorithm

Optimization for Training I. First-Order Methods Training algorithm Optimization for Training I First-Order Methods Training algorithm 2 OPTIMIZATION METHODS Topics: Types of optimization methods. Practical optimization methods breakdown into two categories: 1. First-order

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

CSC321 Lecture 8: Optimization

CSC321 Lecture 8: Optimization CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:

More information

CSC321 Lecture 7: Optimization

CSC321 Lecture 7: Optimization CSC321 Lecture 7: Optimization Roger Grosse Roger Grosse CSC321 Lecture 7: Optimization 1 / 25 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:

More information

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on Overview of gradient descent optimization algorithms HYUNG IL KOO Based on http://sebastianruder.com/optimizing-gradient-descent/ Problem Statement Machine Learning Optimization Problem Training samples:

More information

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function

More information

Lecture 4 Backpropagation

Lecture 4 Backpropagation Lecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 5, 2017 Things we will look at today More Backpropagation Still more backpropagation Quiz

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic

More information

Regularization and Optimization of Backpropagation

Regularization and Optimization of Backpropagation Regularization and Optimization of Backpropagation The Norwegian University of Science and Technology (NTNU) Trondheim, Norway keithd@idi.ntnu.no October 17, 2017 Regularization Definition of Regularization

More information

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Steve Renals Machine Learning Practical MLP Lecture 5 16 October 2018 MLP Lecture 5 / 16 October 2018 Deep Neural Networks

More information

Nonlinear Optimization Methods for Machine Learning

Nonlinear Optimization Methods for Machine Learning Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks

More information

OPTIMIZATION METHODS IN DEEP LEARNING

OPTIMIZATION METHODS IN DEEP LEARNING Tutorial outline OPTIMIZATION METHODS IN DEEP LEARNING Based on Deep Learning, chapter 8 by Ian Goodfellow, Yoshua Bengio and Aaron Courville Presented By Nadav Bhonker Optimization vs Learning Surrogate

More information

Neural Networks: Optimization & Regularization

Neural Networks: Optimization & Regularization Neural Networks: Optimization & Regularization Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) NN Opt & Reg

More information

Deep Learning & Artificial Intelligence WS 2018/2019

Deep Learning & Artificial Intelligence WS 2018/2019 Deep Learning & Artificial Intelligence WS 2018/2019 Linear Regression Model Model Error Function: Squared Error Has no special meaning except it makes gradients look nicer Prediction Ground truth / target

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation

More information

Deep Learning & Neural Networks Lecture 4

Deep Learning & Neural Networks Lecture 4 Deep Learning & Neural Networks Lecture 4 Kevin Duh Graduate School of Information Science Nara Institute of Science and Technology Jan 23, 2014 2/20 3/20 Advanced Topics in Optimization Today we ll briefly

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Weihang Chen, Xingchen Chen, Jinxiu Liang, Cheng Xu, Zehao Chen and Donglin He March 26, 2017 Outline What is Stochastic Gradient Descent Comparison between BGD and SGD Analysis

More information

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Hiroaki Hayashi 1,* Jayanth Koushik 1,* Graham Neubig 1 arxiv:1611.01505v3 [cs.lg] 11 Jun 2018 Abstract Adaptive

More information

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017 Non-Convex Optimization CS6787 Lecture 7 Fall 2017 First some words about grading I sent out a bunch of grades on the course management system Everyone should have all their grades in Not including paper

More information

Machine Learning Lecture 14

Machine Learning Lecture 14 Machine Learning Lecture 14 Tricks of the Trade 07.12.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory Probability

More information

arxiv: v1 [cs.lg] 16 Jun 2017

arxiv: v1 [cs.lg] 16 Jun 2017 L 2 Regularization versus Batch and Weight Normalization arxiv:1706.05350v1 [cs.lg] 16 Jun 2017 Twan van Laarhoven Institute for Computer Science Radboud University Postbus 9010, 6500GL Nijmegen, The Netherlands

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

The K-FAC method for neural network optimization

The K-FAC method for neural network optimization The K-FAC method for neural network optimization James Martens Thanks to my various collaborators on K-FAC research and engineering: Roger Grosse, Jimmy Ba, Vikram Tankasali, Matthew Johnson, Daniel Duckworth,

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Training Neural Networks Practical Issues

Training Neural Networks Practical Issues Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology Fall 2017 Most slides have been adapted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017, and some from

More information

Rapid Introduction to Machine Learning/ Deep Learning

Rapid Introduction to Machine Learning/ Deep Learning Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University 1/59 Lecture 4a Feedforward neural network October 30, 2015 2/59 Table of contents 1 1. Objectives of Lecture

More information

Normalization Techniques in Training of Deep Neural Networks

Normalization Techniques in Training of Deep Neural Networks Normalization Techniques in Training of Deep Neural Networks Lei Huang ( 黄雷 ) State Key Laboratory of Software Development Environment, Beihang University Mail:huanglei@nlsde.buaa.edu.cn August 17 th,

More information

Understanding Neural Networks : Part I

Understanding Neural Networks : Part I TensorFlow Workshop 2018 Understanding Neural Networks Part I : Artificial Neurons and Network Optimization Nick Winovich Department of Mathematics Purdue University July 2018 Outline 1 Neural Networks

More information

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017 Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017 Step Sizes and Convergence Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

Least Mean Squares Regression

Least Mean Squares Regression Least Mean Squares Regression Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 Lecture Overview Linear classifiers What functions do linear classifiers express? Least Squares Method

More information

CSC321 Lecture 9: Generalization

CSC321 Lecture 9: Generalization CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 26 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions

More information

CSC321 Lecture 15: Exploding and Vanishing Gradients

CSC321 Lecture 15: Exploding and Vanishing Gradients CSC321 Lecture 15: Exploding and Vanishing Gradients Roger Grosse Roger Grosse CSC321 Lecture 15: Exploding and Vanishing Gradients 1 / 23 Overview Yesterday, we saw how to compute the gradient descent

More information

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS LAST TIME Intro to cudnn Deep neural nets using cublas and cudnn TODAY Building a better model for image classification Overfitting

More information

Lecture 3: Deep Learning Optimizations

Lecture 3: Deep Learning Optimizations Lecture 3: Deep Learning Optimizations Deep Learning @ UvA UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES DEEP LEARNING OPTIMIZATIONS - 1 Lecture overview o How to define our model and optimize it in practice

More information

Deep Learning II: Momentum & Adaptive Step Size

Deep Learning II: Momentum & Adaptive Step Size Deep Learning II: Momentum & Adaptive Step Size CS 760: Machine Learning Spring 2018 Mark Craven and David Page www.biostat.wisc.edu/~craven/cs760 1 Goals for the Lecture You should understand the following

More information

Tips for Deep Learning

Tips for Deep Learning Tips for Deep Learning Recipe of Deep Learning Step : define a set of function Step : goodness of function Step 3: pick the best function NO Overfitting! NO YES Good Results on Testing Data? YES Good Results

More information

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:

More information

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning Lecture 0 Neural networks and optimization Machine Learning and Data Mining November 2009 UBC Gradient Searching for a good solution can be interpreted as looking for a minimum of some error (loss) function

More information

Conjugate Directions for Stochastic Gradient Descent

Conjugate Directions for Stochastic Gradient Descent Conjugate Directions for Stochastic Gradient Descent Nicol N Schraudolph Thore Graepel Institute of Computational Science ETH Zürich, Switzerland {schraudo,graepel}@infethzch Abstract The method of conjugate

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Nov 2, 2016 Outline SGD-typed algorithms for Deep Learning Parallel SGD for deep learning Perceptron Prediction value for a training data: prediction

More information

Introduction to gradient descent

Introduction to gradient descent 6-1: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction to gradient descent Derivation and intuitions Hessian 6-2: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction Our

More information

Tips for Deep Learning

Tips for Deep Learning Tips for Deep Learning Recipe of Deep Learning Step : define a set of function Step : goodness of function Step 3: pick the best function NO Overfitting! NO YES Good Results on Testing Data? YES Good Results

More information

Deep Learning Lab Course 2017 (Deep Learning Practical)

Deep Learning Lab Course 2017 (Deep Learning Practical) Deep Learning Lab Course 207 (Deep Learning Practical) Labs: (Computer Vision) Thomas Brox, (Robotics) Wolfram Burgard, (Machine Learning) Frank Hutter, (Neurorobotics) Joschka Boedecker University of

More information

Non-Linearity. CS 188: Artificial Intelligence. Non-Linear Separators. Non-Linear Separators. Deep Learning I

Non-Linearity. CS 188: Artificial Intelligence. Non-Linear Separators. Non-Linear Separators. Deep Learning I Non-Linearity CS 188: Artificial Intelligence Deep Learning I Instructors: Pieter Abbeel & Anca Dragan --- University of California, Berkeley [These slides were created by Dan Klein, Pieter Abbeel, Anca

More information

Lecture 3 Feedforward Networks and Backpropagation

Lecture 3 Feedforward Networks and Backpropagation Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression

More information

CSC321 Lecture 9: Generalization

CSC321 Lecture 9: Generalization CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 27 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions

More information

CS260: Machine Learning Algorithms

CS260: Machine Learning Algorithms CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {

More information

CSC 578 Neural Networks and Deep Learning

CSC 578 Neural Networks and Deep Learning CSC 578 Neural Networks and Deep Learning Fall 2018/19 3. Improving Neural Networks (Some figures adapted from NNDL book) 1 Various Approaches to Improve Neural Networks 1. Cost functions Quadratic Cross

More information

Lecture 9: Policy Gradient II (Post lecture) 2

Lecture 9: Policy Gradient II (Post lecture) 2 Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp. 13 2 With many slides from or derived from David Silver

More information

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!! CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!! November 18, 2015 THE EXAM IS CLOSED BOOK. Once the exam has started, SORRY, NO TALKING!!! No, you can t even say see ya

More information

Hamiltonian Descent Methods

Hamiltonian Descent Methods Hamiltonian Descent Methods Chris J. Maddison 1,2 with Daniel Paulin 1, Yee Whye Teh 1,2, Brendan O Donoghue 2, Arnaud Doucet 1 Department of Statistics University of Oxford 1 DeepMind London, UK 2 The

More information

Lecture 3 Feedforward Networks and Backpropagation

Lecture 3 Feedforward Networks and Backpropagation Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

Mini-Course 1: SGD Escapes Saddle Points

Mini-Course 1: SGD Escapes Saddle Points Mini-Course 1: SGD Escapes Saddle Points Yang Yuan Computer Science Department Cornell University Gradient Descent (GD) Task: min x f (x) GD does iterative updates x t+1 = x t η t f (x t ) Gradient Descent

More information

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science Topics covered

More information

Appendix A.1 Derivation of Nesterov s Accelerated Gradient as a Momentum Method

Appendix A.1 Derivation of Nesterov s Accelerated Gradient as a Momentum Method for all t is su cient to obtain the same theoretical guarantees. This method for choosing the learning rate assumes that f is not noisy, and will result in too-large learning rates if the objective is

More information

Lecture 16 Deep Neural Generative Models

Lecture 16 Deep Neural Generative Models Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed

More information

4. Multilayer Perceptrons

4. Multilayer Perceptrons 4. Multilayer Perceptrons This is a supervised error-correction learning algorithm. 1 4.1 Introduction A multilayer feedforward network consists of an input layer, one or more hidden layers, and an output

More information

Optimizing CNNs. Timothy Dozat Stanford. Abstract. 1. Introduction. 2. Background Momentum

Optimizing CNNs. Timothy Dozat Stanford. Abstract. 1. Introduction. 2. Background Momentum Optimizing CNNs Timothy Dozat Stanford tdozat@stanford.edu Abstract This work aims to explore the performance of a popular class of related optimization algorithms in the context of convolutional neural

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Computational statistics

Computational statistics Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial

More information

Higher-Order Methods

Higher-Order Methods Higher-Order Methods Stephen J. Wright 1 2 Computer Sciences Department, University of Wisconsin-Madison. PCMI, July 2016 Stephen Wright (UW-Madison) Higher-Order Methods PCMI, July 2016 1 / 25 Smooth

More information

Optimization Methods for Machine Learning

Optimization Methods for Machine Learning Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction

More information

Least Mean Squares Regression. Machine Learning Fall 2018

Least Mean Squares Regression. Machine Learning Fall 2018 Least Mean Squares Regression Machine Learning Fall 2018 1 Where are we? Least Squares Method for regression Examples The LMS objective Gradient descent Incremental/stochastic gradient descent Exercises

More information

Computing Neural Network Gradients

Computing Neural Network Gradients Computing Neural Network Gradients Kevin Clark 1 Introduction The purpose of these notes is to demonstrate how to quickly compute neural network gradients in a completely vectorized way. It is complementary

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Lecture 9: Policy Gradient II 1

Lecture 9: Policy Gradient II 1 Lecture 9: Policy Gradient II 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2019 Additional reading: Sutton and Barto 2018 Chp. 13 1 With many slides from or derived from David Silver and John

More information

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random

More information

Recurrent neural networks

Recurrent neural networks 12-1: Recurrent neural networks Prof. J.C. Kao, UCLA Recurrent neural networks Motivation Network unrollwing Backpropagation through time Vanishing and exploding gradients LSTMs GRUs 12-2: Recurrent neural

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Lecture 15: Exploding and Vanishing Gradients

Lecture 15: Exploding and Vanishing Gradients Lecture 15: Exploding and Vanishing Gradients Roger Grosse 1 Introduction Last lecture, we introduced RNNs and saw how to derive the gradients using backprop through time. In principle, this lets us train

More information

Lecture 11 Recurrent Neural Networks I

Lecture 11 Recurrent Neural Networks I Lecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 01, 2017 Introduction Sequence Learning with Neural Networks Some Sequence Tasks

More information

Neural Networks: Backpropagation

Neural Networks: Backpropagation Neural Networks: Backpropagation Seung-Hoon Na 1 1 Department of Computer Science Chonbuk National University 2018.10.25 eung-hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

CSC2541 Lecture 5 Natural Gradient

CSC2541 Lecture 5 Natural Gradient CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse CSC2541 Lecture 5 Natural Gradient 1 / 12 Motivation Two classes of optimization procedures used throughout ML (stochastic) gradient descent,

More information

DeepLearning on FPGAs

DeepLearning on FPGAs DeepLearning on FPGAs Introduction to Deep Learning Sebastian Buschäger Technische Universität Dortmund - Fakultät Informatik - Lehrstuhl 8 October 21, 2017 1 Recap Computer Science Approach Technical

More information

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Proximal-Gradient Mark Schmidt University of British Columbia Winter 2018 Admin Auditting/registration forms: Pick up after class today. Assignment 1: 2 late days to hand in

More information

Neural Networks: Backpropagation

Neural Networks: Backpropagation Neural Networks: Backpropagation Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others

More information

Advanced Training Techniques. Prajit Ramachandran

Advanced Training Techniques. Prajit Ramachandran Advanced Training Techniques Prajit Ramachandran Outline Optimization Regularization Initialization Optimization Optimization Outline Gradient Descent Momentum RMSProp Adam Distributed SGD Gradient Noise

More information

Don t Decay the Learning Rate, Increase the Batch Size. Samuel L. Smith, Pieter-Jan Kindermans, Quoc V. Le December 9 th 2017

Don t Decay the Learning Rate, Increase the Batch Size. Samuel L. Smith, Pieter-Jan Kindermans, Quoc V. Le December 9 th 2017 Don t Decay the Learning Rate, Increase the Batch Size Samuel L. Smith, Pieter-Jan Kindermans, Quoc V. Le December 9 th 2017 slsmith@ Google Brain Three related questions: What properties control generalization?

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

More Optimization. Optimization Methods. Methods

More Optimization. Optimization Methods. Methods More More Optimization Optimization Methods Methods Yann YannLeCun LeCun Courant CourantInstitute Institute http://yann.lecun.com http://yann.lecun.com (almost) (almost) everything everything you've you've

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Improving the Convergence of Back-Propogation Learning with Second Order Methods

Improving the Convergence of Back-Propogation Learning with Second Order Methods the of Back-Propogation Learning with Second Order Methods Sue Becker and Yann le Cun, Sept 1988 Kasey Bray, October 2017 Table of Contents 1 with Back-Propagation 2 the of BP 3 A Computationally Feasible

More information

Adam: A Method for Stochastic Optimization

Adam: A Method for Stochastic Optimization Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Content Background Supervised ML theory and the importance of optimum finding Gradient descent and its variants Limitations

More information

Lecture 22. r i+1 = b Ax i+1 = b A(x i + α i r i ) =(b Ax i ) α i Ar i = r i α i Ar i

Lecture 22. r i+1 = b Ax i+1 = b A(x i + α i r i ) =(b Ax i ) α i Ar i = r i α i Ar i 8.409 An Algorithmist s oolkit December, 009 Lecturer: Jonathan Kelner Lecture Last time Last time, we reduced solving sparse systems of linear equations Ax = b where A is symmetric and positive definite

More information

More Tips for Training Neural Network. Hung-yi Lee

More Tips for Training Neural Network. Hung-yi Lee More Tips for Training Neural Network Hung-yi ee Outline Activation Function Cost Function Data Preprocessing Training Generalization Review: Training Neural Network Neural network: f ; θ : input (vector)

More information

More on Neural Networks

More on Neural Networks More on Neural Networks Yujia Yan Fall 2018 Outline Linear Regression y = Wx + b (1) Linear Regression y = Wx + b (1) Polynomial Regression y = Wφ(x) + b (2) where φ(x) gives the polynomial basis, e.g.,

More information

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari Machine Learning Basics: Stochastic Gradient Descent Sargur N. srihari@cedar.buffalo.edu 1 Topics 1. Learning Algorithms 2. Capacity, Overfitting and Underfitting 3. Hyperparameters and Validation Sets

More information

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16 Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16 Slides adapted from Jordan Boyd-Graber, Justin Johnson, Andrej Karpathy, Chris Ketelsen, Fei-Fei Li, Mike Mozer, Michael Nielson

More information

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer. University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: Multi-Layer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x

More information