Lecture 6 Optimization for Deep Neural Networks
|
|
- Oswin Jordan
- 5 years ago
- Views:
Transcription
1 Lecture 6 Optimization for Deep Neural Networks CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 12, 2017
2 Things we will look at today Stochastic Gradient Descent
3 Things we will look at today Stochastic Gradient Descent Momentum Method and the Nesterov Variant
4 Things we will look at today Stochastic Gradient Descent Momentum Method and the Nesterov Variant Adaptive Learning Methods (AdaGrad, RMSProp, Adam)
5 Things we will look at today Stochastic Gradient Descent Momentum Method and the Nesterov Variant Adaptive Learning Methods (AdaGrad, RMSProp, Adam) Batch Normalization
6 Things we will look at today Stochastic Gradient Descent Momentum Method and the Nesterov Variant Adaptive Learning Methods (AdaGrad, RMSProp, Adam) Batch Normalization Intialization Heuristics
7 Things we will look at today Stochastic Gradient Descent Momentum Method and the Nesterov Variant Adaptive Learning Methods (AdaGrad, RMSProp, Adam) Batch Normalization Intialization Heuristics Polyak Averaging
8 Things we will look at today Stochastic Gradient Descent Momentum Method and the Nesterov Variant Adaptive Learning Methods (AdaGrad, RMSProp, Adam) Batch Normalization Intialization Heuristics Polyak Averaging On Slides but for self study: Newton and Quasi Newton Methods (BFGS, L-BFGS, Conjugate Gradient)
9 Optimization We ve seen backpropagation as a method for computing gradients
10 Optimization We ve seen backpropagation as a method for computing gradients Assignment: Was about implementation of SGD in conjunction with backprop
11 Optimization We ve seen backpropagation as a method for computing gradients Assignment: Was about implementation of SGD in conjunction with backprop Let s see a family of first order methods
12 Batch Gradient Descent Algorithm 1 Batch Gradient Descent at Iteration k Require: Learning rate ɛ k Require: Initial Parameter θ 1: while stopping criteria not met do 2: Compute gradient estimate over N examples: i L(f(x(i) ; θ), y (i) ) 3: ĝ + 1 N θ 4: Apply Update: θ θ ɛĝ 5: end while Positive: Gradient estimates are stable Negative: Need to compute gradients over the entire training for one update
13 Gradient Descent
14 Gradient Descent
15 Gradient Descent
16 Gradient Descent
17 Gradient Descent
18 Gradient Descent
19 Stochastic Gradient Descent Algorithm 2 Stochastic Gradient Descent at Iteration k Require: Learning rate ɛ k Require: Initial Parameter θ 1: while stopping criteria not met do 2: Sample example (x (i), y (i) ) from training set 3: Compute gradient estimate: 4: ĝ + θ L(f(x (i) ; θ), y (i) ) 5: Apply Update: θ θ ɛĝ 6: end while
20 Stochastic Gradient Descent Algorithm 2 Stochastic Gradient Descent at Iteration k Require: Learning rate ɛ k Require: Initial Parameter θ 1: while stopping criteria not met do 2: Sample example (x (i), y (i) ) from training set 3: Compute gradient estimate: 4: ĝ + θ L(f(x (i) ; θ), y (i) ) 5: Apply Update: θ θ ɛĝ 6: end while ɛ k is learning rate at step k
21 Stochastic Gradient Descent Algorithm 2 Stochastic Gradient Descent at Iteration k Require: Learning rate ɛ k Require: Initial Parameter θ 1: while stopping criteria not met do 2: Sample example (x (i), y (i) ) from training set 3: Compute gradient estimate: 4: ĝ + θ L(f(x (i) ; θ), y (i) ) 5: Apply Update: θ θ ɛĝ 6: end while ɛ k is learning rate at step k Sufficient condition to guarantee convergence: ɛ k = and ɛ 2 k < k=1 k=1
22 Learning Rate Schedule In practice the learning rate is decayed linearly till iteration τ
23 Learning Rate Schedule In practice the learning rate is decayed linearly till iteration τ ɛ k = (1 α)ɛ 0 + αɛ τ with α = k τ
24 Learning Rate Schedule In practice the learning rate is decayed linearly till iteration τ ɛ k = (1 α)ɛ 0 + αɛ τ with α = k τ τ is usually set to the number of iterations needed for a large number of passes through the data ɛ τ should roughly be set to 1% of ɛ 0 How to set ɛ 0?
25 Minibatching Potential Problem: Gradient estimates can be very noisy
26 Minibatching Potential Problem: Gradient estimates can be very noisy Obvious Solution: Use larger mini-batches
27 Minibatching Potential Problem: Gradient estimates can be very noisy Obvious Solution: Use larger mini-batches Advantage: Computation time per update does not depend on number of training examples N
28 Minibatching Potential Problem: Gradient estimates can be very noisy Obvious Solution: Use larger mini-batches Advantage: Computation time per update does not depend on number of training examples N This allows convergence on extremely large datasets
29 Minibatching Potential Problem: Gradient estimates can be very noisy Obvious Solution: Use larger mini-batches Advantage: Computation time per update does not depend on number of training examples N This allows convergence on extremely large datasets See: Large Scale Learning with Stochastic Gradient Descent by Leon Bottou
30 Stochastic Gradient Descent
31 Stochastic Gradient Descent
32 Stochastic Gradient Descent
33 Stochastic Gradient Descent
34 Stochastic Gradient Descent
35 Stochastic Gradient Descent
36 Stochastic Gradient Descent
37 Stochastic Gradient Descent
38 Stochastic Gradient Descent
39 So far.. Batch Gradient Descent: ĝ + 1 N θ L(f(x (i) ; θ), y (i) ) i θ θ ɛĝ
40 So far.. Batch Gradient Descent: ĝ + 1 N θ L(f(x (i) ; θ), y (i) ) i θ θ ɛĝ SGD: ĝ + θ L(f(x (i) ; θ), y (i) ) θ θ ɛĝ
41 Momentum The Momentum method is a method to accelerate learning using SGD
42 Momentum The Momentum method is a method to accelerate learning using SGD In particular SGD suffers in the following scenarios: Error surface has high curvature
43 Momentum The Momentum method is a method to accelerate learning using SGD In particular SGD suffers in the following scenarios: Error surface has high curvature We get small but consistent gradients
44 Momentum The Momentum method is a method to accelerate learning using SGD In particular SGD suffers in the following scenarios: Error surface has high curvature We get small but consistent gradients The gradients are very noisy
45 Momentum Gradient Descent would move quickly down the walls, but very slowly through the valley floor
46 Momentum How do we try and solve this problem?
47 Momentum How do we try and solve this problem? Introduce a new variable v, the velocity
48 Momentum How do we try and solve this problem? Introduce a new variable v, the velocity We think of v as the direction and speed by which the parameters move as the learning dynamics progresses
49 Momentum How do we try and solve this problem? Introduce a new variable v, the velocity We think of v as the direction and speed by which the parameters move as the learning dynamics progresses The velocity is an exponentially decaying moving average of the negative gradients ) v αv ɛ θ (L(f(x (i) ; θ), y (i) ) α [0, 1)Update rule: θ θ + v
50 Momentum Let s look at the velocity term: v αv ɛ θ (L(f(x (i) ; θ), y (i) ) )
51 Momentum Let s look at the velocity term: v αv ɛ θ (L(f(x (i) ; θ), y (i) ) ) The velocity accumulates the previous gradients
52 Momentum Let s look at the velocity term: ) v αv ɛ θ (L(f(x (i) ; θ), y (i) ) The velocity accumulates the previous gradients What is the role of α? If α is larger than ɛ the current update is more affected by the previous gradients
53 Momentum Let s look at the velocity term: ) v αv ɛ θ (L(f(x (i) ; θ), y (i) ) The velocity accumulates the previous gradients What is the role of α? If α is larger than ɛ the current update is more affected by the previous gradients Usually values for α are set high 0.8, 0.9
54 Momentum Gradient Step
55 Momentum Momentum Step Gradient Step
56 Momentum Momentum Step Actual Step Gradient Step
57 Momentum: Step Sizes In SGD, the step size was the norm of the gradient scaled by the learning rate ɛ g. Why?
58 Momentum: Step Sizes In SGD, the step size was the norm of the gradient scaled by the learning rate ɛ g. Why? While using momentum, the step size will also depend on the norm and alignment of a sequence of gradients
59 Momentum: Step Sizes In SGD, the step size was the norm of the gradient scaled by the learning rate ɛ g. Why? While using momentum, the step size will also depend on the norm and alignment of a sequence of gradients For example, if at each step we observed g, the step size would be (exercise!): ɛ g 1 α
60 Momentum: Step Sizes In SGD, the step size was the norm of the gradient scaled by the learning rate ɛ g. Why? While using momentum, the step size will also depend on the norm and alignment of a sequence of gradients For example, if at each step we observed g, the step size would be (exercise!): ɛ g 1 α If α = 0.9 = multiply the maximum speed by 10 relative to the current gradient direction
61 Momentum Illustration of how momentum traverses such an error surface better compared to Gradient Descent
62 SGD with Momentum Algorithm 2 Stochastic Gradient Descent with Momentum Require: Learning rate ɛ k Require: Momentum Parameter α Require: Initial Parameter θ Require: Initial Velocity v 1: while stopping criteria not met do 2: Sample example (x (i), y (i) ) from training set 3: Compute gradient estimate: 4: ĝ + θ L(f(x (i) ; θ), y (i) ) 5: Compute the velocity update: 6: v αv ɛĝ 7: Apply Update: θ θ + v 8: end while
63 Nesterov Momentum Another approach: First take a step in the direction of the accumulated gradient Then calculate the gradient and make a correction Accumulated Gradient
64 Nesterov Momentum Another approach: First take a step in the direction of the accumulated gradient Then calculate the gradient and make a correction Accumulated Gradient Correction
65 Nesterov Momentum Another approach: First take a step in the direction of the accumulated gradient Then calculate the gradient and make a correction Accumulated Gradient Correction New Accumulated Gradient
66 Nesterov Momentum Next Step
67 Nesterov Momentum Next Step
68 Nesterov Momentum Next Step
69 Nesterov Momentum Next Step
70 Let s Write it out.. Recall the velocity term in the Momentum method: ) v αv ɛ θ (L(f(x (i) ; θ), y (i) )
71 Let s Write it out.. Recall the velocity term in the Momentum method: ) v αv ɛ θ (L(f(x (i) ; θ), y (i) ) Nesterov Momentum: v αv ɛ θ (L(f(x (i) ; θ + αv), y (i) ) )
72 Let s Write it out.. Recall the velocity term in the Momentum method: ) v αv ɛ θ (L(f(x (i) ; θ), y (i) ) Nesterov Momentum: v αv ɛ θ (L(f(x (i) ; θ + αv), y (i) ) ) Update: θ θ + v
73 SGD with Nesterov Momentum Algorithm 3 SGD with Nesterov Momentum Require: Learning rate ɛ Require: Momentum Parameter α Require: Initial Parameter θ Require: Initial Velocity v 1: while stopping criteria not met do 2: Sample example (x (i), y (i) ) from training set 3: Update parameters: θ θ + αv 4: Compute gradient estimate: 5: ĝ + θl(f(x (i) ; θ), y (i) ) 6: Compute the velocity update: v αv ɛĝ 7: Apply Update: θ θ + v 8: end while
74 Adaptive Learning Rate Methods
75 Motivation Till now we assign the same learning rate to all features
76 Motivation Till now we assign the same learning rate to all features If the features vary in importance and frequency, why is this a good idea?
77 Motivation Till now we assign the same learning rate to all features If the features vary in importance and frequency, why is this a good idea? It s probably not!
78 Motivation Nice (all features are equally important)
79 Motivation Harder!
80 AdaGrad Idea: Downscale a model parameter by square-root of sum of squares of all its historical values
81 AdaGrad Idea: Downscale a model parameter by square-root of sum of squares of all its historical values Parameters that have large partial derivative of the loss learning rates for them are rapidly declined
82 AdaGrad Idea: Downscale a model parameter by square-root of sum of squares of all its historical values Parameters that have large partial derivative of the loss learning rates for them are rapidly declined Some interesting theoretical properties
83 AdaGrad Algorithm 4 AdaGrad Require: Global Learning rate ɛ, Initial Parameter θ, δ Initialize r = 0 1: while stopping criteria not met do 2: Sample example (x (i), y (i) ) from training set 3: Compute gradient estimate: ĝ + θ L(f(x (i) ; θ), y (i) ) 4: Accumulate: r r + ĝ ĝ 5: Compute update: θ ɛ δ+ r ĝ 6: Apply Update: θ θ + θ 7: end while
84 RMSProp AdaGrad is good when the objective is convex.
85 RMSProp AdaGrad is good when the objective is convex. AdaGrad might shrink the learning rate too aggressively, we want to keep the history in mind
86 RMSProp AdaGrad is good when the objective is convex. AdaGrad might shrink the learning rate too aggressively, we want to keep the history in mind We can adapt it to perform better in non-convex settings by accumulating an exponentially decaying average of the gradient
87 RMSProp AdaGrad is good when the objective is convex. AdaGrad might shrink the learning rate too aggressively, we want to keep the history in mind We can adapt it to perform better in non-convex settings by accumulating an exponentially decaying average of the gradient This is an idea that we use again and again in Neural Networks
88 RMSProp AdaGrad is good when the objective is convex. AdaGrad might shrink the learning rate too aggressively, we want to keep the history in mind We can adapt it to perform better in non-convex settings by accumulating an exponentially decaying average of the gradient This is an idea that we use again and again in Neural Networks Currently has about 500 citations on scholar, but was proposed in a slide in Geoffrey Hinton s coursera course
89 RMSProp Algorithm 5 RMSProp Require: Global Learning rate ɛ, decay parameter ρ, δ Initialize r = 0 1: while stopping criteria not met do 2: Sample example (x (i), y (i) ) from training set 3: Compute gradient estimate: ĝ + θ L(f(x (i) ; θ), y (i) ) 4: Accumulate: r ρr + (1 ρ)ĝ ĝ 5: Compute update: θ ɛ δ+ r ĝ 6: Apply Update: θ θ + θ 7: end while
90 RMSProp with Nesterov Algorithm 6 RMSProp with Nesterov Require: Global Learning rate ɛ, decay parameter ρ, δ, α, v Initialize r = 0 1: while stopping criteria not met do 2: Sample example (x (i), y (i) ) from training set 3: Compute Update: θ θ + αv 4: Compute gradient estimate: ĝ + θl(f(x (i) ; θ), y (i) ) 5: Accumulate: r ρr + (1 ρ)ĝ ĝ 6: Compute Velocity: v αv ɛ r ĝ 7: Apply Update: θ θ + v 8: end while
91 Adam We could have used RMSProp with momentum
92 Adam We could have used RMSProp with momentum Use of Momentum with rescaling is not well motivated
93 Adam We could have used RMSProp with momentum Use of Momentum with rescaling is not well motivated Adam is like RMSProp with Momentum but with bias correction terms for the first and second moments
94 Adam: ADAptive Moments Algorithm 7 RMSProp with Nesterov Require: ɛ (set to ), decay rates ρ 1 (set to 0.9), ρ 2 (set to 0.9), θ, δ Initialize moments variables s = 0 and r = 0, time step t = 0 1: while stopping criteria not met do 2: Sample example (x (i), y (i) ) from training set 3: Compute gradient estimate: ĝ + θ L(f(x (i) ; θ), y (i) ) 4: t t + 1 5: Update: s ρ 1 s + (1 ρ 1 )ĝ 6: Update: r ρ 2 r + (1 ρ 2 )ĝ ĝ 7: Correct Biases: ŝ s, ˆr r 1 ρ t 1 1 ρ t 2 ŝ 8: Compute Update: θ = ɛ ˆr+δ 9: Apply Update: θ θ + θ 10: end while
95 All your GRADs are belong to us! SGD: θ θ ɛĝ Momentum: v αv ɛĝ then θ θ + v ) Nesterov: v αv ɛ θ (L(f(x (i) ; θ + αv), y (i) ) then θ θ + v AdaGrad: r r + g g then θ ɛ δ + g then θ θ + θ r RMSProp: r ρr + (1 ρ)ĝ ĝ then θ ɛ δ + ĝ then θ θ + θ r Adam: ŝ s 1 ρ t, ˆr r ŝ 1 1 ρ t then θ = ɛ then θ θ + θ 2 ˆr + δ
96 Batch Normalization
97 A Difficulty in Training Deep Neural Networks A deep model involves composition of several functions ŷ = W T 4 (tanh(w T 3 (tanh(w T 2 (tanh(w T 1 x + b 1) + b 2 ) + b 3 )))) ŷ x 1 x 2 x 3 x 4
98 A Difficulty in Training Deep Neural Networks We have a recipe to compute gradients (Backpropagation), and update every parameter (we saw half a dozen methods)
99 A Difficulty in Training Deep Neural Networks We have a recipe to compute gradients (Backpropagation), and update every parameter (we saw half a dozen methods) Implicit Assumption: Other layers don t change i.e. other functions are fixed
100 A Difficulty in Training Deep Neural Networks We have a recipe to compute gradients (Backpropagation), and update every parameter (we saw half a dozen methods) Implicit Assumption: Other layers don t change i.e. other functions are fixed In Practice: We update all layers simultaneously
101 A Difficulty in Training Deep Neural Networks We have a recipe to compute gradients (Backpropagation), and update every parameter (we saw half a dozen methods) Implicit Assumption: Other layers don t change i.e. other functions are fixed In Practice: We update all layers simultaneously This can give rise to unexpected difficulties
102 A Difficulty in Training Deep Neural Networks We have a recipe to compute gradients (Backpropagation), and update every parameter (we saw half a dozen methods) Implicit Assumption: Other layers don t change i.e. other functions are fixed In Practice: We update all layers simultaneously This can give rise to unexpected difficulties Let s look at two illustrations
103 Intuition Consider a second order approximation of our cost function (which is a function composition) around current point θ (0) : J(θ) J(θ (0) ) + (θ θ (0) ) T g (θ θ(0) ) T H(θ θ (0) )
104 Intuition Consider a second order approximation of our cost function (which is a function composition) around current point θ (0) : J(θ) J(θ (0) ) + (θ θ (0) ) T g (θ θ(0) ) T H(θ θ (0) ) g is gradient and H the Hessian at θ (0)
105 Intuition Consider a second order approximation of our cost function (which is a function composition) around current point θ (0) : J(θ) J(θ (0) ) + (θ θ (0) ) T g (θ θ(0) ) T H(θ θ (0) ) g is gradient and H the Hessian at θ (0) If ɛ is the learning rate, the new point θ = θ (0) ɛg
106 Intuition Plugging our new point, θ = θ (0) ɛg into the approximation: J(θ (0) ɛg) = J(θ (0) ) ɛg T g gt Hg
107 Intuition Plugging our new point, θ = θ (0) ɛg into the approximation: J(θ (0) ɛg) = J(θ (0) ) ɛg T g gt Hg There are three terms here:
108 Intuition Plugging our new point, θ = θ (0) ɛg into the approximation: J(θ (0) ɛg) = J(θ (0) ) ɛg T g gt Hg There are three terms here: Value of function before update
109 Intuition Plugging our new point, θ = θ (0) ɛg into the approximation: J(θ (0) ɛg) = J(θ (0) ) ɛg T g gt Hg There are three terms here: Value of function before update Improvement using gradient (i.e. first order information)
110 Intuition Plugging our new point, θ = θ (0) ɛg into the approximation: J(θ (0) ɛg) = J(θ (0) ) ɛg T g gt Hg There are three terms here: Value of function before update Improvement using gradient (i.e. first order information) Correction factor that accounts for the curvature of the function
111 Intuition J(θ (0) ɛg) = J(θ (0) ) ɛg T g gt Hg
112 Intuition J(θ (0) ɛg) = J(θ (0) ) ɛg T g gt Hg Observations:
113 Intuition J(θ (0) ɛg) = J(θ (0) ) ɛg T g gt Hg Observations: g T Hg too large: Gradient will start moving upwards
114 Intuition J(θ (0) ɛg) = J(θ (0) ) ɛg T g gt Hg Observations: g T Hg too large: Gradient will start moving upwards g T Hg = 0: J will decrease for even large ɛ
115 Intuition J(θ (0) ɛg) = J(θ (0) ) ɛg T g gt Hg Observations: g T Hg too large: Gradient will start moving upwards g T Hg = 0: J will decrease for even large ɛ Optimal step size ɛ = g T g for zero curvature, ɛ = gt g to take into account curvature g T Hg
116 Intuition J(θ (0) ɛg) = J(θ (0) ) ɛg T g gt Hg Observations: g T Hg too large: Gradient will start moving upwards g T Hg = 0: J will decrease for even large ɛ Optimal step size ɛ = g T g for zero curvature, ɛ = gt g to take into account curvature g T Hg Conclusion: Just neglecting second order effects can cause problems (remedy: second order methods). What about higher order effects?
117 Higher Order Effects: Toy Model ŷ w l h l. h 2 w 2 h 1 w 1 x
118 Higher Order Effects: Toy Model ŷ w l h l. h 2 w 2 h 1 w 1 Just one node per layer, no non-linearity x
119 Higher Order Effects: Toy Model ŷ w l h l. h 2 w 2 h 1 w 1 Just one node per layer, no non-linearity ŷ is linear in x but non-linear in w i x
120 Higher Order Effects: Toy Model Suppose δ = 1, so we want to decrease our output ŷ
121 Higher Order Effects: Toy Model Suppose δ = 1, so we want to decrease our output ŷ Usual strategy: Using backprop find g = w (ŷ y) 2
122 Higher Order Effects: Toy Model Suppose δ = 1, so we want to decrease our output ŷ Usual strategy: Using backprop find g = w (ŷ y) 2 Update weights w := w ɛg
123 Higher Order Effects: Toy Model Suppose δ = 1, so we want to decrease our output ŷ Usual strategy: Using backprop find g = w (ŷ y) 2 Update weights w := w ɛg The first order Taylor approximation (in previous slide) says the cost will reduce by ɛg T g
124 Higher Order Effects: Toy Model Suppose δ = 1, so we want to decrease our output ŷ Usual strategy: Using backprop find g = w (ŷ y) 2 Update weights w := w ɛg The first order Taylor approximation (in previous slide) says the cost will reduce by ɛg T g If we need to reduce cost by 0.1, then learning rate should be 0.1 g T g
125 Higher Order Effects: Toy Model The new ŷ will however be: ŷ = x(w 1 ɛg 1 )(w 2 ɛg 2 )... (w l ɛg l )
126 Higher Order Effects: Toy Model The new ŷ will however be: ŷ = x(w 1 ɛg 1 )(w 2 ɛg 2 )... (w l ɛg l ) Contains terms like ɛ 3 g 1 g 2 g 3 w 4 w 5... w l
127 Higher Order Effects: Toy Model The new ŷ will however be: ŷ = x(w 1 ɛg 1 )(w 2 ɛg 2 )... (w l ɛg l ) Contains terms like ɛ 3 g 1 g 2 g 3 w 4 w 5... w l If weights w 4, w 5,..., w l are small, the term is negligible. But if large, it would explode
128 Higher Order Effects: Toy Model The new ŷ will however be: ŷ = x(w 1 ɛg 1 )(w 2 ɛg 2 )... (w l ɛg l ) Contains terms like ɛ 3 g 1 g 2 g 3 w 4 w 5... w l If weights w 4, w 5,..., w l are small, the term is negligible. But if large, it would explode Conclusion: Higher order terms make it very hard to choose the right learning rate
129 Higher Order Effects: Toy Model The new ŷ will however be: ŷ = x(w 1 ɛg 1 )(w 2 ɛg 2 )... (w l ɛg l ) Contains terms like ɛ 3 g 1 g 2 g 3 w 4 w 5... w l If weights w 4, w 5,..., w l are small, the term is negligible. But if large, it would explode Conclusion: Higher order terms make it very hard to choose the right learning rate Second Order Methods are already expensive, nth order methods are hopeless. Solution?
130 Batch Normalization Method to reparameterize a deep network to reduce co-ordination of update across layers
131 Batch Normalization Method to reparameterize a deep network to reduce co-ordination of update across layers Can be applied to input layer, or any hidden layer
132 Batch Normalization Method to reparameterize a deep network to reduce co-ordination of update across layers Can be applied to input layer, or any hidden layer Let H be a design matrix having activations in any layer for m examples in the mini-batch
133 Batch Normalization Method to reparameterize a deep network to reduce co-ordination of update across layers Can be applied to input layer, or any hidden layer Let H be a design matrix having activations in any layer for m examples in the mini-batch h 11 h 12 h h 1k h 21 h 22 h h 2k H = h m1 h m2 h m3... h mk
134 Batch Normalization h 11 h 12 h h 1k h 21 h 22 h h 2k H = h m1 h m2 h m3... h mk Each row represents all the activations in layer for one example
135 Batch Normalization h 11 h 12 h h 1k h 21 h 22 h h 2k H = h m1 h m2 h m3... h mk Each row represents all the activations in layer for one example Idea: Replace H by H such that: H = H µ σ
136 Batch Normalization h 11 h 12 h h 1k h 21 h 22 h h 2k H = h m1 h m2 h m3... h mk Each row represents all the activations in layer for one example Idea: Replace H by H such that: H = H µ σ µ is mean of each unit and σ the standard deviation
137 Batch Normalization µ is a vector with µ j the column mean
138 Batch Normalization µ is a vector with µ j the column mean σ is a vector with σ j the column standard deviation
139 Batch Normalization µ is a vector with µ j the column mean σ is a vector with σ j the column standard deviation H i,j is normalized by subtracting µ j and dividing by σ j
140 Batch Normalization During training we have: µ = 1 m j H :,j
141 Batch Normalization During training we have: µ = 1 m j H :,j σ = δ + 1 (H µ) 2 j m We then operate on H as before = we backpropagate through the normalized activations j
142 Why is this good? The update will never act to only increase the mean and standard deviation of any activation
143 Why is this good? The update will never act to only increase the mean and standard deviation of any activation Previous approaches added penalties to cost or per layer to encourage units to have standardized outputs
144 Why is this good? The update will never act to only increase the mean and standard deviation of any activation Previous approaches added penalties to cost or per layer to encourage units to have standardized outputs Batch normalization makes the reparameterization easier
145 Why is this good? The update will never act to only increase the mean and standard deviation of any activation Previous approaches added penalties to cost or per layer to encourage units to have standardized outputs Batch normalization makes the reparameterization easier At test time: Use running averages of µ and σ collected during training, use these for evaluating new input x
146 An Innovation Standardizing the output of a unit can limit the expressive power of the neural network
147 An Innovation Standardizing the output of a unit can limit the expressive power of the neural network Solution: Instead of replacing H by H, replace it will γh + β
148 An Innovation Standardizing the output of a unit can limit the expressive power of the neural network Solution: Instead of replacing H by H, replace it will γh + β γ and β are also learned by backpropagation
149 An Innovation Standardizing the output of a unit can limit the expressive power of the neural network Solution: Instead of replacing H by H, replace it will γh + β γ and β are also learned by backpropagation Normalizing for mean and standard deviation was the goal of batch normalization, why add γ and β again?
150 Initialization Strategies
151 In convex problems with good ɛ no matter what the initialization, convergence is guaranteed
152 In convex problems with good ɛ no matter what the initialization, convergence is guaranteed In the non-convex regime initialization is much more important
153 In convex problems with good ɛ no matter what the initialization, convergence is guaranteed In the non-convex regime initialization is much more important Some parameter initialization can be unstable, not converge
154 In convex problems with good ɛ no matter what the initialization, convergence is guaranteed In the non-convex regime initialization is much more important Some parameter initialization can be unstable, not converge Neural Networks are not well understood to have principled, mathematically nice initialization strategies
155 In convex problems with good ɛ no matter what the initialization, convergence is guaranteed In the non-convex regime initialization is much more important Some parameter initialization can be unstable, not converge Neural Networks are not well understood to have principled, mathematically nice initialization strategies What is known: Initialization should break symmetry (quiz!)
156 In convex problems with good ɛ no matter what the initialization, convergence is guaranteed In the non-convex regime initialization is much more important Some parameter initialization can be unstable, not converge Neural Networks are not well understood to have principled, mathematically nice initialization strategies What is known: Initialization should break symmetry (quiz!) What is known: Scale of weights is important
157 In convex problems with good ɛ no matter what the initialization, convergence is guaranteed In the non-convex regime initialization is much more important Some parameter initialization can be unstable, not converge Neural Networks are not well understood to have principled, mathematically nice initialization strategies What is known: Initialization should break symmetry (quiz!) What is known: Scale of weights is important Most initialization strategies are based on intuitions and heuristics
158 Some Heuristics For a fully connected layer with m inputs and n outputs, sample: W ij U( 1 1, ) m m
159 Some Heuristics For a fully connected layer with m inputs and n outputs, sample: W ij U( 1 1, ) m m Xavier Initialization: Sample 6 6 W ij U(, ) m + n m + n
160 Some Heuristics For a fully connected layer with m inputs and n outputs, sample: W ij U( 1 1, ) m m Xavier Initialization: Sample 6 6 W ij U(, ) m + n m + n Xavier initialization is derived considering that the network consists of matrix multiplications with no nonlinearites
161 Some Heuristics For a fully connected layer with m inputs and n outputs, sample: W ij U( 1 1, ) m m Xavier Initialization: Sample 6 6 W ij U(, ) m + n m + n Xavier initialization is derived considering that the network consists of matrix multiplications with no nonlinearites Works well in practice!
162 More Heuristics Saxe et al. 2013, recommend initialzing to random orthogonal matrices, with a carefully chosen gain g that accounts for non-linearities
163 More Heuristics Saxe et al. 2013, recommend initialzing to random orthogonal matrices, with a carefully chosen gain g that accounts for non-linearities If g could be divined, it could solve the vanishing and exploding gradients problem (more later)
164 More Heuristics Saxe et al. 2013, recommend initialzing to random orthogonal matrices, with a carefully chosen gain g that accounts for non-linearities If g could be divined, it could solve the vanishing and exploding gradients problem (more later) The idea of choosing g and initializing weights accordingly is that we want norm of activations to increase, and pass back strong gradients
165 More Heuristics Saxe et al. 2013, recommend initialzing to random orthogonal matrices, with a carefully chosen gain g that accounts for non-linearities If g could be divined, it could solve the vanishing and exploding gradients problem (more later) The idea of choosing g and initializing weights accordingly is that we want norm of activations to increase, and pass back strong gradients Martens 2010, suggested an initialization that was sparse: Each unit could only receive k non-zero weights
166 More Heuristics Saxe et al. 2013, recommend initialzing to random orthogonal matrices, with a carefully chosen gain g that accounts for non-linearities If g could be divined, it could solve the vanishing and exploding gradients problem (more later) The idea of choosing g and initializing weights accordingly is that we want norm of activations to increase, and pass back strong gradients Martens 2010, suggested an initialization that was sparse: Each unit could only receive k non-zero weights Motivation: Ir is a bad idea to have all initial weights to have the same standard deviation 1 m
167 Polyak Averaging: Motivation Gradient points towards right Consider gradient descent above with high step size ɛ
168 Polyak Averaging: Motivation Gradient points towards left
169 Polyak Averaging: Motivation Gradient points towards right
170 Polyak Averaging: Motivation Gradient points towards left
171 Polyak Averaging: Motivation Gradient points towards right
172 A Solution: Polyak Averaging Suppose in t iterations you have parameters θ (1), θ (2),..., θ (t)
173 A Solution: Polyak Averaging Suppose in t iterations you have parameters θ (1), θ (2),..., θ (t) Polyak Averaging suggests setting ˆθ (t) = 1 t i θ(i)
174 A Solution: Polyak Averaging Suppose in t iterations you have parameters θ (1), θ (2),..., θ (t) i θ(i) Polyak Averaging suggests setting ˆθ (t) = 1 t Has strong convergence guarantees in convex settings
175 A Solution: Polyak Averaging Suppose in t iterations you have parameters θ (1), θ (2),..., θ (t) i θ(i) Polyak Averaging suggests setting ˆθ (t) = 1 t Has strong convergence guarantees in convex settings Is this a good idea in non-convex problems?
176 Simple Modification In non-convex surfaces the parameter space can differ greatly in different regions
177 Simple Modification In non-convex surfaces the parameter space can differ greatly in different regions Averaging is not useful
178 Simple Modification In non-convex surfaces the parameter space can differ greatly in different regions Averaging is not useful Typical to consider the exponentially decaying average instead: ˆθ (t) = αˆθ (t 1) + (1 α)ˆθ (t) with α [0, 1]
179 Next time Convolutional Neural Networks
Optimization for neural networks
0 - : Optimization for neural networks Prof. J.C. Kao, UCLA Optimization for neural networks We previously introduced the principle of gradient descent. Now we will discuss specific modifications we make
More informationOptimization for Training I. First-Order Methods Training algorithm
Optimization for Training I First-Order Methods Training algorithm 2 OPTIMIZATION METHODS Topics: Types of optimization methods. Practical optimization methods breakdown into two categories: 1. First-order
More informationDay 3 Lecture 3. Optimizing deep networks
Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient
More informationDeep Feedforward Networks
Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3
More informationCSC321 Lecture 8: Optimization
CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:
More informationCSC321 Lecture 7: Optimization
CSC321 Lecture 7: Optimization Roger Grosse Roger Grosse CSC321 Lecture 7: Optimization 1 / 25 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:
More informationOverview of gradient descent optimization algorithms. HYUNG IL KOO Based on
Overview of gradient descent optimization algorithms HYUNG IL KOO Based on http://sebastianruder.com/optimizing-gradient-descent/ Problem Statement Machine Learning Optimization Problem Training samples:
More informationMachine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function
More informationLecture 4 Backpropagation
Lecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 5, 2017 Things we will look at today More Backpropagation Still more backpropagation Quiz
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationCSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent
CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic
More informationRegularization and Optimization of Backpropagation
Regularization and Optimization of Backpropagation The Norwegian University of Science and Technology (NTNU) Trondheim, Norway keithd@idi.ntnu.no October 17, 2017 Regularization Definition of Regularization
More informationDeep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation
Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Steve Renals Machine Learning Practical MLP Lecture 5 16 October 2018 MLP Lecture 5 / 16 October 2018 Deep Neural Networks
More informationNonlinear Optimization Methods for Machine Learning
Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks
More informationOPTIMIZATION METHODS IN DEEP LEARNING
Tutorial outline OPTIMIZATION METHODS IN DEEP LEARNING Based on Deep Learning, chapter 8 by Ian Goodfellow, Yoshua Bengio and Aaron Courville Presented By Nadav Bhonker Optimization vs Learning Surrogate
More informationNeural Networks: Optimization & Regularization
Neural Networks: Optimization & Regularization Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) NN Opt & Reg
More informationDeep Learning & Artificial Intelligence WS 2018/2019
Deep Learning & Artificial Intelligence WS 2018/2019 Linear Regression Model Model Error Function: Squared Error Has no special meaning except it makes gradients look nicer Prediction Ground truth / target
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationTutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba
Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation
More informationDeep Learning & Neural Networks Lecture 4
Deep Learning & Neural Networks Lecture 4 Kevin Duh Graduate School of Information Science Nara Institute of Science and Technology Jan 23, 2014 2/20 3/20 Advanced Topics in Optimization Today we ll briefly
More informationSupport Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017
Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationStochastic Gradient Descent
Stochastic Gradient Descent Weihang Chen, Xingchen Chen, Jinxiu Liang, Cheng Xu, Zehao Chen and Donglin He March 26, 2017 Outline What is Stochastic Gradient Descent Comparison between BGD and SGD Analysis
More informationEve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates
Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Hiroaki Hayashi 1,* Jayanth Koushik 1,* Graham Neubig 1 arxiv:1611.01505v3 [cs.lg] 11 Jun 2018 Abstract Adaptive
More informationNon-Convex Optimization. CS6787 Lecture 7 Fall 2017
Non-Convex Optimization CS6787 Lecture 7 Fall 2017 First some words about grading I sent out a bunch of grades on the course management system Everyone should have all their grades in Not including paper
More informationMachine Learning Lecture 14
Machine Learning Lecture 14 Tricks of the Trade 07.12.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory Probability
More informationarxiv: v1 [cs.lg] 16 Jun 2017
L 2 Regularization versus Batch and Weight Normalization arxiv:1706.05350v1 [cs.lg] 16 Jun 2017 Twan van Laarhoven Institute for Computer Science Radboud University Postbus 9010, 6500GL Nijmegen, The Netherlands
More informationStochastic Optimization Algorithms Beyond SG
Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods
More informationThe K-FAC method for neural network optimization
The K-FAC method for neural network optimization James Martens Thanks to my various collaborators on K-FAC research and engineering: Roger Grosse, Jimmy Ba, Vikram Tankasali, Matthew Johnson, Daniel Duckworth,
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationTraining Neural Networks Practical Issues
Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology Fall 2017 Most slides have been adapted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017, and some from
More informationRapid Introduction to Machine Learning/ Deep Learning
Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University 1/59 Lecture 4a Feedforward neural network October 30, 2015 2/59 Table of contents 1 1. Objectives of Lecture
More informationNormalization Techniques in Training of Deep Neural Networks
Normalization Techniques in Training of Deep Neural Networks Lei Huang ( 黄雷 ) State Key Laboratory of Software Development Environment, Beihang University Mail:huanglei@nlsde.buaa.edu.cn August 17 th,
More informationUnderstanding Neural Networks : Part I
TensorFlow Workshop 2018 Understanding Neural Networks Part I : Artificial Neurons and Network Optimization Nick Winovich Department of Mathematics Purdue University July 2018 Outline 1 Neural Networks
More informationSimple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017
Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017 Step Sizes and Convergence Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient
More informationECE521 lecture 4: 19 January Optimization, MLE, regularization
ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity
More informationLeast Mean Squares Regression
Least Mean Squares Regression Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 Lecture Overview Linear classifiers What functions do linear classifiers express? Least Squares Method
More informationCSC321 Lecture 9: Generalization
CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 26 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions
More informationCSC321 Lecture 15: Exploding and Vanishing Gradients
CSC321 Lecture 15: Exploding and Vanishing Gradients Roger Grosse Roger Grosse CSC321 Lecture 15: Exploding and Vanishing Gradients 1 / 23 Overview Yesterday, we saw how to compute the gradient descent
More informationCS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS
CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS LAST TIME Intro to cudnn Deep neural nets using cublas and cudnn TODAY Building a better model for image classification Overfitting
More informationLecture 3: Deep Learning Optimizations
Lecture 3: Deep Learning Optimizations Deep Learning @ UvA UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES DEEP LEARNING OPTIMIZATIONS - 1 Lecture overview o How to define our model and optimize it in practice
More informationDeep Learning II: Momentum & Adaptive Step Size
Deep Learning II: Momentum & Adaptive Step Size CS 760: Machine Learning Spring 2018 Mark Craven and David Page www.biostat.wisc.edu/~craven/cs760 1 Goals for the Lecture You should understand the following
More informationTips for Deep Learning
Tips for Deep Learning Recipe of Deep Learning Step : define a set of function Step : goodness of function Step 3: pick the best function NO Overfitting! NO YES Good Results on Testing Data? YES Good Results
More informationComments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms
Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:
More informationLecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning
Lecture 0 Neural networks and optimization Machine Learning and Data Mining November 2009 UBC Gradient Searching for a good solution can be interpreted as looking for a minimum of some error (loss) function
More informationConjugate Directions for Stochastic Gradient Descent
Conjugate Directions for Stochastic Gradient Descent Nicol N Schraudolph Thore Graepel Institute of Computational Science ETH Zürich, Switzerland {schraudo,graepel}@infethzch Abstract The method of conjugate
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Nov 2, 2016 Outline SGD-typed algorithms for Deep Learning Parallel SGD for deep learning Perceptron Prediction value for a training data: prediction
More informationIntroduction to gradient descent
6-1: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction to gradient descent Derivation and intuitions Hessian 6-2: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction Our
More informationTips for Deep Learning
Tips for Deep Learning Recipe of Deep Learning Step : define a set of function Step : goodness of function Step 3: pick the best function NO Overfitting! NO YES Good Results on Testing Data? YES Good Results
More informationDeep Learning Lab Course 2017 (Deep Learning Practical)
Deep Learning Lab Course 207 (Deep Learning Practical) Labs: (Computer Vision) Thomas Brox, (Robotics) Wolfram Burgard, (Machine Learning) Frank Hutter, (Neurorobotics) Joschka Boedecker University of
More informationNon-Linearity. CS 188: Artificial Intelligence. Non-Linear Separators. Non-Linear Separators. Deep Learning I
Non-Linearity CS 188: Artificial Intelligence Deep Learning I Instructors: Pieter Abbeel & Anca Dragan --- University of California, Berkeley [These slides were created by Dan Klein, Pieter Abbeel, Anca
More informationLecture 3 Feedforward Networks and Backpropagation
Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression
More informationCSC321 Lecture 9: Generalization
CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 27 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions
More informationCS260: Machine Learning Algorithms
CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {
More informationCSC 578 Neural Networks and Deep Learning
CSC 578 Neural Networks and Deep Learning Fall 2018/19 3. Improving Neural Networks (Some figures adapted from NNDL book) 1 Various Approaches to Improve Neural Networks 1. Cost functions Quadratic Cross
More informationLecture 9: Policy Gradient II (Post lecture) 2
Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp. 13 2 With many slides from or derived from David Silver
More informationCSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!
CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!! November 18, 2015 THE EXAM IS CLOSED BOOK. Once the exam has started, SORRY, NO TALKING!!! No, you can t even say see ya
More informationHamiltonian Descent Methods
Hamiltonian Descent Methods Chris J. Maddison 1,2 with Daniel Paulin 1, Yee Whye Teh 1,2, Brendan O Donoghue 2, Arnaud Doucet 1 Department of Statistics University of Oxford 1 DeepMind London, UK 2 The
More informationLecture 3 Feedforward Networks and Backpropagation
Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression
More informationMachine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6
Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)
More informationMini-Course 1: SGD Escapes Saddle Points
Mini-Course 1: SGD Escapes Saddle Points Yang Yuan Computer Science Department Cornell University Gradient Descent (GD) Task: min x f (x) GD does iterative updates x t+1 = x t η t f (x t ) Gradient Descent
More informationStatistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks
Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science Topics covered
More informationAppendix A.1 Derivation of Nesterov s Accelerated Gradient as a Momentum Method
for all t is su cient to obtain the same theoretical guarantees. This method for choosing the learning rate assumes that f is not noisy, and will result in too-large learning rates if the objective is
More informationLecture 16 Deep Neural Generative Models
Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed
More information4. Multilayer Perceptrons
4. Multilayer Perceptrons This is a supervised error-correction learning algorithm. 1 4.1 Introduction A multilayer feedforward network consists of an input layer, one or more hidden layers, and an output
More informationOptimizing CNNs. Timothy Dozat Stanford. Abstract. 1. Introduction. 2. Background Momentum
Optimizing CNNs Timothy Dozat Stanford tdozat@stanford.edu Abstract This work aims to explore the performance of a popular class of related optimization algorithms in the context of convolutional neural
More informationStochastic Gradient Descent. Ryan Tibshirani Convex Optimization
Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so
More informationLecture 5: Logistic Regression. Neural Networks
Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture
More informationComputational statistics
Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial
More informationHigher-Order Methods
Higher-Order Methods Stephen J. Wright 1 2 Computer Sciences Department, University of Wisconsin-Madison. PCMI, July 2016 Stephen Wright (UW-Madison) Higher-Order Methods PCMI, July 2016 1 / 25 Smooth
More informationOptimization Methods for Machine Learning
Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction
More informationLeast Mean Squares Regression. Machine Learning Fall 2018
Least Mean Squares Regression Machine Learning Fall 2018 1 Where are we? Least Squares Method for regression Examples The LMS objective Gradient descent Incremental/stochastic gradient descent Exercises
More informationComputing Neural Network Gradients
Computing Neural Network Gradients Kevin Clark 1 Introduction The purpose of these notes is to demonstrate how to quickly compute neural network gradients in a completely vectorized way. It is complementary
More informationNeural Network Training
Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification
More informationLecture 9: Policy Gradient II 1
Lecture 9: Policy Gradient II 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2019 Additional reading: Sutton and Barto 2018 Chp. 13 1 With many slides from or derived from David Silver and John
More informationAdaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade
Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random
More informationRecurrent neural networks
12-1: Recurrent neural networks Prof. J.C. Kao, UCLA Recurrent neural networks Motivation Network unrollwing Backpropagation through time Vanishing and exploding gradients LSTMs GRUs 12-2: Recurrent neural
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationLecture 15: Exploding and Vanishing Gradients
Lecture 15: Exploding and Vanishing Gradients Roger Grosse 1 Introduction Last lecture, we introduced RNNs and saw how to derive the gradients using backprop through time. In principle, this lets us train
More informationLecture 11 Recurrent Neural Networks I
Lecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 01, 2017 Introduction Sequence Learning with Neural Networks Some Sequence Tasks
More informationNeural Networks: Backpropagation
Neural Networks: Backpropagation Seung-Hoon Na 1 1 Department of Computer Science Chonbuk National University 2018.10.25 eung-hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationCSC2541 Lecture 5 Natural Gradient
CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse CSC2541 Lecture 5 Natural Gradient 1 / 12 Motivation Two classes of optimization procedures used throughout ML (stochastic) gradient descent,
More informationDeepLearning on FPGAs
DeepLearning on FPGAs Introduction to Deep Learning Sebastian Buschäger Technische Universität Dortmund - Fakultät Informatik - Lehrstuhl 8 October 21, 2017 1 Recap Computer Science Approach Technical
More informationIFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent
IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Proximal-Gradient Mark Schmidt University of British Columbia Winter 2018 Admin Auditting/registration forms: Pick up after class today. Assignment 1: 2 late days to hand in
More informationNeural Networks: Backpropagation
Neural Networks: Backpropagation Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others
More informationAdvanced Training Techniques. Prajit Ramachandran
Advanced Training Techniques Prajit Ramachandran Outline Optimization Regularization Initialization Optimization Optimization Outline Gradient Descent Momentum RMSProp Adam Distributed SGD Gradient Noise
More informationDon t Decay the Learning Rate, Increase the Batch Size. Samuel L. Smith, Pieter-Jan Kindermans, Quoc V. Le December 9 th 2017
Don t Decay the Learning Rate, Increase the Batch Size Samuel L. Smith, Pieter-Jan Kindermans, Quoc V. Le December 9 th 2017 slsmith@ Google Brain Three related questions: What properties control generalization?
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationMore Optimization. Optimization Methods. Methods
More More Optimization Optimization Methods Methods Yann YannLeCun LeCun Courant CourantInstitute Institute http://yann.lecun.com http://yann.lecun.com (almost) (almost) everything everything you've you've
More informationAccelerating Stochastic Optimization
Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz
More informationImproving the Convergence of Back-Propogation Learning with Second Order Methods
the of Back-Propogation Learning with Second Order Methods Sue Becker and Yann le Cun, Sept 1988 Kasey Bray, October 2017 Table of Contents 1 with Back-Propagation 2 the of BP 3 A Computationally Feasible
More informationAdam: A Method for Stochastic Optimization
Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Content Background Supervised ML theory and the importance of optimum finding Gradient descent and its variants Limitations
More informationLecture 22. r i+1 = b Ax i+1 = b A(x i + α i r i ) =(b Ax i ) α i Ar i = r i α i Ar i
8.409 An Algorithmist s oolkit December, 009 Lecturer: Jonathan Kelner Lecture Last time Last time, we reduced solving sparse systems of linear equations Ax = b where A is symmetric and positive definite
More informationMore Tips for Training Neural Network. Hung-yi Lee
More Tips for Training Neural Network Hung-yi ee Outline Activation Function Cost Function Data Preprocessing Training Generalization Review: Training Neural Network Neural network: f ; θ : input (vector)
More informationMore on Neural Networks
More on Neural Networks Yujia Yan Fall 2018 Outline Linear Regression y = Wx + b (1) Linear Regression y = Wx + b (1) Polynomial Regression y = Wφ(x) + b (2) where φ(x) gives the polynomial basis, e.g.,
More informationMachine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari
Machine Learning Basics: Stochastic Gradient Descent Sargur N. srihari@cedar.buffalo.edu 1 Topics 1. Learning Algorithms 2. Capacity, Overfitting and Underfitting 3. Hyperparameters and Validation Sets
More informationMachine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16
Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16 Slides adapted from Jordan Boyd-Graber, Justin Johnson, Andrej Karpathy, Chris Ketelsen, Fei-Fei Li, Mike Mozer, Michael Nielson
More informationMark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.
University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: Multi-Layer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x
More information