Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16
|
|
- Tracy Small
- 5 years ago
- Views:
Transcription
1 Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16 Slides adapted from Jordan Boyd-Graber, Justin Johnson, Andrej Karpathy, Chris Ketelsen, Fei-Fei Li, Mike Mozer, Michael Nielson Machine Learning: Chenhao Tan Boulder 1 of 58
2 Logistics Prelim 1 grades and solutions are available Homework 3 is due next week Feedback from the second homework: Include more examples and spend more time on explaining the equations in class Make questions bold CLEAR open house right after the class (free food)! Machine Learning: Chenhao Tan Boulder 2 of 58
3 Overview Forward propagation recap Back propagation Chain rule Back propagation Full algorithm Practical issues of back propagation Unstable gradients Weight Initialization Alternative regularization Batch size Machine Learning: Chenhao Tan Boulder 3 of 58
4 Forward propagation recap Outline Forward propagation recap Back propagation Chain rule Back propagation Full algorithm Practical issues of back propagation Unstable gradients Weight Initialization Alternative regularization Batch size Machine Learning: Chenhao Tan Boulder 4 of 58
5 Forward propagation recap Forward propagation algorithm How do we make predictions based on a multi-layer neural network? Store the biases for layer l in b l, weight matrix in W l W 1, b 1 W 2, b 2 W 3, b 3 W 4, b 4 x 1 x 2 o x d Machine Learning: Chenhao Tan Boulder 5 of 58
6 Forward propagation recap Forward propagation algorithm Suppose your network has L layers Make prediction for an instance x 1: Initialize a 0 = x 2: for l = 1 to L do 3: z l = W l a l 1 + b l 4: a l = g(z l ) // g represents the nonlinear activation 5: end for 6: The prediction ŷ is simply a L Machine Learning: Chenhao Tan Boulder 6 of 58
7 Forward propagation recap Neural networks in a nutshell Training data S train = {(x, y)} Network architecture (model) Loss function (objective function) How do we learn the parameters? ŷ = f w (x) W l, b l, l = 1,..., L L (y, ŷ) Machine Learning: Chenhao Tan Boulder 7 of 58
8 Forward propagation recap Neural networks in a nutshell Training data S train = {(x, y)} Network architecture (model) Loss function (objective function) How do we learn the parameters? Stochastic gradient descent, ŷ = f w (x) W l, b l, l = 1,..., L L (y, ŷ) W l W l L (y, ŷ) η W l Machine Learning: Chenhao Tan Boulder 7 of 58
9 Forward propagation recap Reminder of gradient descent Machine Learning: Chenhao Tan Boulder 8 of 58
10 Forward propagation recap Challenge Challenge: How do we compute derivatives of the loss function with respect to weights and biases? Solution: Back propagation Machine Learning: Chenhao Tan Boulder 9 of 58
11 Back propagation Outline Forward propagation recap Back propagation Chain rule Back propagation Full algorithm Practical issues of back propagation Unstable gradients Weight Initialization Alternative regularization Batch size Machine Learning: Chenhao Tan Boulder 10 of 58
12 Back propagation Chain rule The Chain Rule The chain rule allows us to take derivatives of nested functions. Machine Learning: Chenhao Tan Boulder 11 of 58
13 Back propagation Chain rule The Chain Rule The chain rule allows us to take derivatives of nested functions. Univariate chain rule: d dx f (g(x)) = f (g(x)) g (x) = df dg dg dx Machine Learning: Chenhao Tan Boulder 11 of 58
14 Back propagation Chain rule The Chain Rule The chain rule allows us to take derivatives of nested functions. Univariate chain rule: Example: d dx 1 1+exp( x) d dx f (g(x)) = f (g(x)) g (x) = df dg dg dx Machine Learning: Chenhao Tan Boulder 11 of 58
15 Back propagation Chain rule The Chain Rule The chain rule allows us to take derivatives of nested functions. Univariate chain rule: d dx f (g(x)) = f (g(x)) g (x) = df dg dg dx Example: d 1 dx 1+exp( x) = 1 exp( x) 1 (1+exp( x)) 2 dσ(x) = σ(x)(1 σ(x)) dx Machine Learning: Chenhao Tan Boulder 11 of 58
16 Back propagation Chain rule The Chain Rule Multivariate chain rule: x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) Machine Learning: Chenhao Tan Boulder 12 of 58
17 Back propagation Chain rule The Chain Rule Multivariate chain rule: x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) Derivative of L with respect to x: Similarly, f y, f z f x Machine Learning: Chenhao Tan Boulder 12 of 58
18 Back propagation Chain rule The Chain Rule What is the derivative of f with respect to r? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) Machine Learning: Chenhao Tan Boulder 13 of 58
19 Back propagation Chain rule The Chain Rule What is the derivative of f with respect to r? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) f r = f x x r + f y y r + f z z r Machine Learning: Chenhao Tan Boulder 13 of 58
20 Back propagation Chain rule The Chain Rule What is the derivative of f with respect to s? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) Machine Learning: Chenhao Tan Boulder 14 of 58
21 Back propagation Chain rule The Chain Rule What is the derivative of f with respect to s? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) f s = f x x s + f y y s + f z z s Machine Learning: Chenhao Tan Boulder 14 of 58
22 Back propagation Chain rule The Chain Rule What is the derivative of f with respect to s? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) Example: Let f = xyz, x = r, y = rs, and z = s. Find f / s f s = f x x s + f y y s + f z z s Machine Learning: Chenhao Tan Boulder 15 of 58
23 Back propagation Chain rule The Chain Rule What is the derivative of f with respect to s? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) Example: Let f = xyz, x = r, y = rs, and z = s. Find f / s f = yz 0 + xz r + xy 1 s Machine Learning: Chenhao Tan Boulder 15 of 58
24 Back propagation Chain rule The Chain Rule What is the derivative of f with respect to s? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) Example: Let f = xyz, x = r, y = rs, and z = s. Find f / s f s = rs2 0 + rs r + r 2 s 1 Machine Learning: Chenhao Tan Boulder 15 of 58
25 Back propagation Chain rule The Chain Rule What is the derivative of f with respect to s? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) Example: Let f = xyz, x = r, y = rs, and z = s. Find f / s f s = 2r2 s Machine Learning: Chenhao Tan Boulder 15 of 58
26 Back propagation Chain rule The Chain Rule What is the derivative of f with respect to s? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) Example: Let f = xyz, x = r, y = rs, and z = s. Find f / s f (r, s) = r rs s = r 2 s 2 f s = 2r2 s Machine Learning: Chenhao Tan Boulder 16 of 58
27 Back propagation Chain rule The Chain Rule What is the derivative of f with respect to u? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) Machine Learning: Chenhao Tan Boulder 17 of 58
28 Back propagation Chain rule The Chain Rule What is the derivative of f with respect to u? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) f u = f r r u + f s s u Machine Learning: Chenhao Tan Boulder 17 of 58
29 Back propagation Chain rule The Chain Rule What is the derivative of f with respect to u? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) Crux: If you know the derivative of objective w.r.t. intermediate value in the chain, can eliminate everything in between. Machine Learning: Chenhao Tan Boulder 18 of 58
30 Back propagation Chain rule The Chain Rule What is the derivative of f with respect to u? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) Crux: If you know the derivative of objective w.r.t. intermediate value in the chain, can eliminate everything in between. This is the cornerstone of the back propagation algorithm. Machine Learning: Chenhao Tan Boulder 18 of 58
31 Back propagation Back propagation Back Propagation W 1, b 1 W 2, b 2 W 3, b 3 W 4, b 4 x 1 x 2... o 1... x d Machine Learning: Chenhao Tan Boulder 19 of 58
32 Back propagation Back propagation Back Propagation For the derivation, we ll consider a simplified network a 0 W 1 z 1 a 1 W 2 z2 a 2 L (y, a 2 ) We want to use back propagation to compute partial derivative of L w.r.t. the weights and biases L, for l = 1, 2 w l ij w 2 ij is the weight from node j in layer l 1 to node i in layer l. Machine Learning: Chenhao Tan Boulder 20 of 58
33 Back propagation Back propagation Back Propagation For the derivation, we ll consider a simplified network a 0 W 1 z 1 a 1 W 2 z2 a 2 L (y, a 2 ) We need to choose an intermediate term that lives on the nodes, that we can easily compute derivative with respect to. Could choose a s, but we ll choose z s because math is easier. Machine Learning: Chenhao Tan Boulder 20 of 58
34 Back propagation Back propagation Back Propagation For the derivation, we ll consider a simplified network a 0 W 1 z 1 a 1 W 2 z2 a 2 L (y, a 2 ) Define the derivative w.r.t. the z s by δ: δ l j = L z l j Note that δ l has the same size as z l and a l. Machine Learning: Chenhao Tan Boulder 20 of 58
35 Back propagation Back propagation Back Propagation For the derivation, we ll consider a simplified network a 0 W 1 z 1 a 1 W 2 z2 a 2 L (y, a 2 ) Let s compute δ L for output layer L: δ L j = L z L j = L a L j da L j dz L j Machine Learning: Chenhao Tan Boulder 20 of 58
36 Back propagation Back propagation Back Propagation δ L j = L z L j = L a L j da L j dz L j We know that a L j = g(z L j ), so dal j dz L j = g (z L j ) δ L j = L a L j g (z L j ) Note: The first term is j th entry of gradient of L. Machine Learning: Chenhao Tan Boulder 21 of 58
37 Back propagation Back propagation Back Propagation δ L j = L a L j g (z L j ) We can combine all of these into a vector operation δ L = L a L g (z L ) Where g (z L ) is the activation function applied elementwise to z L. The symbol indicates element-wise multiplication of vectors. Machine Learning: Chenhao Tan Boulder 22 of 58
38 Back propagation Back propagation Back Propagation δ L j = L a L j g (z L j ) We can combine all of these into a vector operation δ L = L a L g (z L ) Where g (z L ) is the activation function applied elementwise to z L. The symbol indicates element-wise multiplication of vectors. Notice that computing δ L requires knowing activations. This means that before we can compute derivatives for SGD through back propagation, we first run forward propagation through the network. Machine Learning: Chenhao Tan Boulder 22 of 58
39 Back propagation Back propagation Back Propagation Example: Suppose we re in regression setting and choose a sigmoid activation function: L = 1 (y j a L j ) 2 and a L j = σ(z) 2 j L a L j = (a L j y j ), da L j dz L j = σ (z L j ) = σ(z L j )(1 σ(z L j )) So δ L = (a L y) σ(z L ) (1 σ(z L )) Machine Learning: Chenhao Tan Boulder 23 of 58
40 Back propagation Back propagation Back Propagation OK Great! Now we can easily-ish compute the δ s for the output layer. But really we re after partials w.r.t. to weights and biases. a 0 W 1 z 1 a 1 W 2 z2 a 2 L (y, a 2 ) Machine Learning: Chenhao Tan Boulder 24 of 58
41 Back propagation Back propagation Back Propagation OK Great! Now we can easily-ish compute the δ s for the output layer. But really we re after partials w.r.t. to weights and biases. a 0 W 1 z 1 a 1 W 2 z2 a 2 L (y, a 2 ) Question: What do you notice? Machine Learning: Chenhao Tan Boulder 24 of 58
42 Back propagation Back propagation Back Propagation We want to find derivative L w.r.t. to weights and biases a 0 W 1 z 1 a 1 W 2 z2 a 2 L (y, a 2 ) Every weight connected to a node in layer L depends on a single δ L j Machine Learning: Chenhao Tan Boulder 25 of 58
43 Back propagation Back propagation Back Propagation L (y, a 2 ) a 0 W 1 z 1 a 1 W 2 z2 a 2 So we have L w L jk = L z L j z L j w L jk = δ L j z L j w L jk Need to compute zl j w L. Recall z L = W L a L 1 + b L jk j th entry in vector z L j = i w L jia L 1 i + b L j Machine Learning: Chenhao Tan Boulder 26 of 58
44 Back propagation Back propagation Back Propagation L (y, a 2 ) a 0 W 1 z 1 a 1 W 2 z2 a 2 So we have L w L jk = L z L j Taking derivative w.r.t. w L jk gives z L j w L jk = δ L j z L j w L jk zl j w L jk = a L 1 k L w L jk = a L 1 k δj L Machine Learning: Chenhao Tan Boulder 26 of 58
45 Back propagation Back propagation Back Propagation L (y, a 2 ) a 0 W 1 z 1 a 1 W 2 z2 a 2 So we have L w L jk = a L 1 k δj L Easy expression for derivative w.r.t. every weight leading into layer L. Machine Learning: Chenhao Tan Boulder 27 of 58
46 Back propagation Back propagation Back Propagation Let s make the notation a little more practical. [ W 2 w 2 = 11 w 2 12 w 2 ] 13 w 2 21 w 2 22 w 2 23 Machine Learning: Chenhao Tan Boulder 28 of 58
47 Back propagation Back propagation Back Propagation Let s make the notation a little more practical. [ W 2 w 2 = 11 w 2 12 w 2 ] 13 w 2 21 w 2 22 w 2 23 [ L ] L L [ L W 2 = w 2 11 w 2 12 w 2 13 δ 2 = 1 a 1 1 δ1 2a1 2 δ1 2 ] a1 3 δ2 2a1 1 δ2 2a1 2 δ2 2a1 3 L w 2 21 L w 2 22 L w 2 23 Now we can write this as an outer-product of δ 2 and a 1, (Exercise for yourself, L b 2 ) L W 2 = δ2 (a 1 ) T Machine Learning: Chenhao Tan Boulder 28 of 58
48 Back propagation Back propagation Intermediate summary For a giving training example x, perform forward propagation to get z l and a l on each layer. Then to get the partial derivatives for W 2 or W L : 1. Compute δ L = L a L j g (z L ) 2. Compute L = δ L (a L 1 ) T and L = δ L W L b L OK, that wasn t so bad! We found very simple expressions for the derivatives with respect to the weights in the last hidden layer! Machine Learning: Chenhao Tan Boulder 29 of 58
49 Back propagation Back propagation Intermediate summary For a giving training example x, perform forward propagation to get z l and a l on each layer. Then to get the partial derivatives for W 2 or W L : 1. Compute δ L = L a L j g (z L ) 2. Compute L = δ L (a L 1 ) T and L = δ L W L b L OK, that wasn t so bad! We found very simple expressions for the derivatives with respect to the weights in the last hidden layer! Problem: How do we do the other layers? Machine Learning: Chenhao Tan Boulder 29 of 58
50 Back propagation Back propagation Intermediate summary For a giving training example x, perform forward propagation to get z l and a l on each layer. Then to get the partial derivatives for W 2 or W L : 1. Compute δ L = L a L j g (z L ) 2. Compute L = δ L (a L 1 ) T and L = δ L W L b L OK, that wasn t so bad! We found very simple expressions for the derivatives with respect to the weights in the last hidden layer! Problem: How do we do the other layers? Since the formulas were so nice once we knew the adjacent δ l, it sure would be nice if we could easily compute the δ l s on earlier layers. Machine Learning: Chenhao Tan Boulder 29 of 58
51 Back propagation Back propagation Back Propagation But the relationship between L and z 1 is really complicated because of multiple passes through the activation functions. Machine Learning: Chenhao Tan Boulder 30 of 58
52 Back propagation Back propagation Back Propagation But the relationship between L and z 1 is really complicated because of multiple passes through the activation functions. It is OK! Back propagation comes to rescue! Machine Learning: Chenhao Tan Boulder 30 of 58
53 Back propagation Back propagation Back Propagation But the relationship between L and z 1 is really complicated because of multiple passes through the activation functions. It is OK! Back propagation comes to rescue! Notice that δ 1 depends on δ 2. a 0 W 1 z 1 a 1 W 2 z2 a 2 L (y, a 2 ) Machine Learning: Chenhao Tan Boulder 30 of 58
54 Back propagation Back propagation Back Propagation Notice that δ 1 depends on δ 2. L (y, a 2 ) a 0 W 1 z 1 a 1 W 2 z2 a 2 By multivariate chain rule, L z l 1 k = j L z l j z l j z l 1 k Machine Learning: Chenhao Tan Boulder 31 of 58
55 Back propagation Back propagation Back Propagation Notice that δ 1 depends on δ 2. L (y, a 2 ) a 0 W 1 z 1 a 1 W 2 z2 a 2 By multivariate chain rule, δ l 1 k = L z l 1 k = j L z l j z l j z l 1 k = j δj l z l j z l 1 k Machine Learning: Chenhao Tan Boulder 31 of 58
56 Back propagation Back propagation Back Propagation Notice that δ 1 depends on δ 2. L (y, a 2 ) a 0 W 1 z 1 a 1 W 2 z2 a 2 By multivariate chain rule, δ 1 2 = L z 1 2 = δ1 2 z 2 1 z 1 + δ 2 z z 1 2 Machine Learning: Chenhao Tan Boulder 31 of 58
57 Back propagation Back propagation Back Propagation δ 1 2 = L z 1 2 = δ1 2 z 2 1 z 1 + δ 2 z z 1 2 Machine Learning: Chenhao Tan Boulder 32 of 58
58 Back propagation Back propagation Back Propagation δ 1 2 = L z 1 2 = δ1 2 z 2 1 z 1 + δ 2 z z 1 2 Recall that z 2 = W 2 a 1 + b 2, it follows that Taking the derivative z2 i z 1 2 z 2 i = w 2 i1a w 2 i2a w 2 i3a b 2 i = w 2 i2 g (z 1 2 ), and plugging in gives δ 1 2 = L z 1 2 = δ 2 1w 2 12g (z 1 2) + δ 2 2w 2 22g (z 1 2) Machine Learning: Chenhao Tan Boulder 32 of 58
59 Back propagation Back propagation Back Propagation If we do this for each of the 3 δi 1 s, something nice happens: (Exercise for yourself: work out δ1 1 and δ1 3 for yourself) δ 1 1 = δ 2 1w 2 11g (z 1 1) + δ 2 2w 2 21g (z 1 1) δ 1 2 = δ 2 1w 2 12g (z 1 2) + δ 2 2w 2 22g (z 1 2) δ 1 3 = δ 2 1w 2 13g (z 1 3) + δ 2 2w 2 23g (z 1 3) Machine Learning: Chenhao Tan Boulder 33 of 58
60 Back propagation Back propagation Back Propagation If we do this for each of the 3 δi 1 s, something nice happens: (Exercise for yourself: work out δ1 1 and δ1 3 for yourself) δ 1 1 = δ 2 1w 2 11g (z 1 1) + δ 2 2w 2 21g (z 1 1) δ 1 2 = δ 2 1w 2 12g (z 1 2) + δ 2 2w 2 22g (z 1 2) δ 1 3 = δ 2 1w 2 13g (z 1 3) + δ 2 2w 2 23g (z 1 3) Notice that each row of the system gets multiplied by g (z 1 i ), so let s factor those out. Machine Learning: Chenhao Tan Boulder 33 of 58
61 Back propagation Back propagation Back Propagation If we do this for each of the 3 δi 2 s, something nice happens: δ 1 1 = (δ 2 1w δ 2 2w 2 21) g (z 1 1) δ 1 2 = (δ 2 1w δ 2 2w 2 22) g (z 1 2) δ 1 3 = (δ 2 1w δ 2 2w 2 23) g (z 1 3) Machine Learning: Chenhao Tan Boulder 34 of 58
62 Back propagation Back propagation Back Propagation If we do this for each of the 3 δi 2 s, something nice happens: δ 1 1 = (δ 2 1w δ 2 2w 2 21) g (z 1 1) δ 1 2 = (δ 2 1w δ 2 2w 2 22) g (z 1 2) δ3 1 = (δ1w δ2w ) g (z 1 3) [ ] [ Remember δ 2 δ 2 = 1 δ2 2, W 2 w 2 = 11 w 2 12 w 2 ] 13 w 2 21 w 2 22 w 2 23 Do you see δ 2 and W 2 lurking anywhere in the above system? Machine Learning: Chenhao Tan Boulder 34 of 58
63 Back propagation Back propagation Back Propagation If we do this for each of the 3 δi 2 s, something nice happens: Does this help? w 2 (W 2 ) T 11 w 2 [ ] 21 = w 2 12 w 2 22, δ 2 δ 2 = 1 w 2 13 w 2 δ δ 1 1 = (δ 2 1w δ 2 2w 2 21) g (z 1 1) δ 2 2 = (δ 2 1w δ 2 2w 2 22) g (z 1 2) δ 2 3 = (δ 2 1w δ 2 2w 2 23) g (z 1 3) Machine Learning: Chenhao Tan Boulder 35 of 58
64 Back propagation Back propagation Back Propagation If we do this for each of the 3 δi 2 s, something nice happens: δ 1 1 = (δ 2 1w δ 2 2w 2 21) g (z 1 1) δ 1 2 = (δ 2 1w δ 2 2w 2 22) g (z 1 2) δ 1 3 = (δ 2 1w δ 2 2w 2 23) g (z 1 3) δ 1 = (W 2 ) T δ 2 g (z 1 ) Machine Learning: Chenhao Tan Boulder 36 of 58
65 Back propagation Back propagation Back Propagation OK Great! We can easily compute δ 1 from δ 2 Then we can compute derivatives of L w.r.t. weights W 1 and biases b 1 exactly the way we did for W 2 and biases b 2 1. Compute δ 1 = (W 2 ) T δ 2 g (z 1 ) 2. Compute L W 1 = δ 1 (a 0 ) T and L b 1 = δ 1 Machine Learning: Chenhao Tan Boulder 37 of 58
66 Back propagation Back propagation Back Propagation OK Great! We can easily compute δ 1 from δ 2 Then we can compute derivatives of L w.r.t. weights W 1 and biases b 1 exactly the way we did for W 2 and biases b 2 1. Compute δ 1 = (W 2 ) T δ 2 g (z 1 ) 2. Compute L W 1 = δ 1 (a 0 ) T and L b 1 = δ 1 We ve worked this out for a simple network with one hidden layer. Nothing we ve done assumed anything about the number of layers, so we can apply the same procedure recursively with any number of layers. Machine Learning: Chenhao Tan Boulder 37 of 58
67 Back propagation Full algorithm Back Propagation δ L = L g (z L ) # Compute δ s on output layer a L j For l = L,..., 1 L = δ l (a l 1 ) T # Compute weight derivatives W l L b l = δl # Compute bias derivatives δ l 1 = ( W l) T δ l g (z l 1 ) # Back prop δ s to previous layer (After this, ready to do a SGD update on weights/biases) Machine Learning: Chenhao Tan Boulder 38 of 58
68 Back propagation Full algorithm Reminder of gradient descent Machine Learning: Chenhao Tan Boulder 39 of 58
69 Back propagation Full algorithm Training a Feed-Forward Neural Network Given initial guess for weights and biases. Loop over each training example in random order: 1. Forward propagate to get activations on each layer 2. Back propagate to get derivatives 3. Update weights and biases via stochastic gradient descent 4. Repeat Machine Learning: Chenhao Tan Boulder 40 of 58
70 Practical issues of back propagation Outline Forward propagation recap Back propagation Chain rule Back propagation Full algorithm Practical issues of back propagation Unstable gradients Weight Initialization Alternative regularization Batch size Machine Learning: Chenhao Tan Boulder 41 of 58
71 Practical issues of back propagation Back Propagation In practice, many remaining questions may arise. δ L = L g (z L ) # Compute δ s on output layer a L j For l = L,..., 1 L = δ l (a l 1 ) T # Compute weight derivatives W l L b l = δl # Compute bias derivatives δ l 1 = ( W l) T δ l g (z l 1 ) # Back prop δ s to previous layer Machine Learning: Chenhao Tan Boulder 42 of 58
72 Practical issues of back propagation Unstable gradients Unstable gradients x h 1 h 2... h L o Machine Learning: Chenhao Tan Boulder 43 of 58
73 Practical issues of back propagation Unstable gradients Unstable gradients x h 1 h 2... h L o L b 1 = g (z 1 ) w 2 g (z 2 ) w 3... g (z L ) L a L Machine Learning: Chenhao Tan Boulder 43 of 58
74 Practical issues of back propagation Unstable gradients Unstable gradients sigmoid derivative Machine Learning: Chenhao Tan Boulder 44 of 58
75 Practical issues of back propagation Unstable gradients Vanishing gradients If we use Gaussian initialization for weights, w j N (0, 1), w j < 1 w j σ (z j ) < 1 4 L decay to zero exponentially b1 Machine Learning: Chenhao Tan Boulder 45 of 58
76 Practical issues of back propagation Unstable gradients Vanishing gradients ReLu ReLu derivative Machine Learning: Chenhao Tan Boulder 46 of 58
77 Practical issues of back propagation Unstable gradients Exploding gradients If w j = 100, w j σ (z j ) k > 1 Machine Learning: Chenhao Tan Boulder 47 of 58
78 Practical issues of back propagation Unstable gradients Training a Feed-Forward Neural Network In practice, many remaining questions may arise, more examples: 1. How do we initialize weights and biases? 2. How do we regularize? 3. Can I batch this? Machine Learning: Chenhao Tan Boulder 48 of 58
79 Practical issues of back propagation Weight Initialization Non-convexity Machine Learning: Chenhao Tan Boulder 49 of 58
80 Practical issues of back propagation Weight Initialization Weight initialization Old idea: W = 0, what happens? Machine Learning: Chenhao Tan Boulder 50 of 58
81 Practical issues of back propagation Weight Initialization Weight initialization Old idea: W = 0, what happens? There is no source of asymmetry. (Every neuron looks the same and leads to a slow start.) Machine Learning: Chenhao Tan Boulder 50 of 58
82 Practical issues of back propagation Weight Initialization Weight initialization Old idea: W = 0, what happens? There is no source of asymmetry. (Every neuron looks the same and leads to a slow start.) δ L = a LL g (z L ) # Compute δ s on output layer For l = L,..., 1 L = δ l (a l 1 ) T # Compute weight derivatives W l L b l = δl # Compute bias derivatives δ l 1 = ( W l) T δ l g (z l 1 ) # Back prop δ s to previous layer Machine Learning: Chenhao Tan Boulder 50 of 58
83 Practical issues of back propagation Weight Initialization Weight initialization First idea: small random numbers, W N (0, 0.01) Machine Learning: Chenhao Tan Boulder 51 of 58
84 Practical issues of back propagation Weight Initialization Weight initialization Var(z) = Var( i w i x i ) = nvar(w i )Var(x i ) Machine Learning: Chenhao Tan Boulder 52 of 58
85 Practical issues of back propagation Weight Initialization Weight initialization Xavier initialization [Glorot and Bengio, 2010] W N (0, 2 n in + n out ) Machine Learning: Chenhao Tan Boulder 53 of 58
86 Practical issues of back propagation Weight Initialization Weight initialization Xavier initialization [Glorot and Bengio, 2010] He initialization [He et al., 2015] W N (0, W N (0, 2 n in + n out ) 2 n in ) Machine Learning: Chenhao Tan Boulder 53 of 58
87 Practical issues of back propagation Weight Initialization Weight initialization Xavier initialization [Glorot and Bengio, 2010] He initialization [He et al., 2015] W N (0, W N (0, 2 n in + n out ) 2 n in ) This is an actively research area and next great idea may come from you! Machine Learning: Chenhao Tan Boulder 53 of 58
88 Practical issues of back propagation Alternative regularization Dropout layer "randomly set some neurons to zero in the forward pass" [Srivastava et al., 2014] Machine Learning: Chenhao Tan Boulder 54 of 58
89 Practical issues of back propagation Alternative regularization Dropout layer Forces the network to have a redundant representation. Machine Learning: Chenhao Tan Boulder 55 of 58
90 Practical issues of back propagation Alternative regularization Dropout layer Another interpretation: Dropout is training a large ensemble of models. Machine Learning: Chenhao Tan Boulder 55 of 58
91 Practical issues of back propagation Batch size Batch size We have so far learned gradient descent which uses all training data to compute gradients. Machine Learning: Chenhao Tan Boulder 56 of 58
92 Practical issues of back propagation Batch size Batch size We have so far learned gradient descent which uses all training data to compute gradients. Alternatively, we use a single instance to compute gradients in stochastic gradient descent. Machine Learning: Chenhao Tan Boulder 56 of 58
93 Practical issues of back propagation Batch size Batch size We have so far learned gradient descent which uses all training data to compute gradients. Alternatively, we use a single instance to compute gradients in stochastic gradient descent. In general, we can use a parameter batch size to compute the gradients from a few instances. N (the entire training data) 1 (a single instance) More common values: 16, 32, 64, 128 Machine Learning: Chenhao Tan Boulder 56 of 58
94 Practical issues of back propagation Batch size Wrap up Back propagation allows for computing the gradients of the parameters and watch out for unstable gradients! δ L = L g (z L ) # Compute δ s on output layer a L j For l = L,..., 1 L = δ l (a l 1 ) T # Compute weight derivatives W l L b l = δl # Compute bias derivatives δ l 1 = ( W l) T δ l g (z l 1 ) # Back prop δ s to previous layer Machine Learning: Chenhao Tan Boulder 57 of 58
95 Practical issues of back propagation Batch size References (1) Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Yee Whye Teh and Mike Titterington, editors, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pages , Chia Laguna Resort, Sardinia, Italy, May PMLR. URL Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In The IEEE International Conference on Computer Vision (ICCV), December Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15: , URL Machine Learning: Chenhao Tan Boulder 58 of 58
Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box
ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton Motivation Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses
More informationECE521 Lecture 7/8. Logistic Regression
ECE521 Lecture 7/8 Logistic Regression Outline Logistic regression (Continue) A single neuron Learning neural networks Multi-class classification 2 Logistic regression The output of a logistic regression
More informationComments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms
Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:
More informationMachine Learning Lecture 14
Machine Learning Lecture 14 Tricks of the Trade 07.12.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory Probability
More informationCSC321 Lecture 5: Multilayer Perceptrons
CSC321 Lecture 5: Multilayer Perceptrons Roger Grosse Roger Grosse CSC321 Lecture 5: Multilayer Perceptrons 1 / 21 Overview Recall the simple neuron-like unit: y output output bias i'th weight w 1 w2 w3
More informationLecture 13 Back-propagation
Lecture 13 Bac-propagation 02 March 2016 Taylor B. Arnold Yale Statistics STAT 365/665 1/21 Notes: Problem set 4 is due this Friday Problem set 5 is due a wee from Monday (for those of you with a midterm
More informationCSC 411 Lecture 10: Neural Networks
CSC 411 Lecture 10: Neural Networks Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 10-Neural Networks 1 / 35 Inspiration: The Brain Our brain has 10 11
More informationSGD and Deep Learning
SGD and Deep Learning Subgradients Lets make the gradient cheating more formal. Recall that the gradient is the slope of the tangent. f(w 1 )+rf(w 1 ) (w w 1 ) Non differentiable case? w 1 Subgradients
More informationDeep Neural Networks (1) Hidden layers; Back-propagation
Deep Neural Networs (1) Hidden layers; Bac-propagation Steve Renals Machine Learning Practical MLP Lecture 3 4 October 2017 / 9 October 2017 MLP Lecture 3 Deep Neural Networs (1) 1 Recap: Softmax single
More informationDeep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation
Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Steve Renals Machine Learning Practical MLP Lecture 5 16 October 2018 MLP Lecture 5 / 16 October 2018 Deep Neural Networks
More informationLecture 2: Learning with neural networks
Lecture 2: Learning with neural networks Deep Learning @ UvA LEARNING WITH NEURAL NETWORKS - PAGE 1 Lecture Overview o Machine Learning Paradigm for Neural Networks o The Backpropagation algorithm for
More informationMachine Learning Basics III
Machine Learning Basics III Benjamin Roth CIS LMU München Benjamin Roth (CIS LMU München) Machine Learning Basics III 1 / 62 Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient
More informationMachine Learning: Chenhao Tan University of Colorado Boulder LECTURE 6
Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 6 Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan Boulder 1 of 39 HW1 turned in HW2 released Office
More informationLecture 17: Neural Networks and Deep Learning
UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions
More informationLecture 4 Backpropagation
Lecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 5, 2017 Things we will look at today More Backpropagation Still more backpropagation Quiz
More informationCOMP 551 Applied Machine Learning Lecture 14: Neural Networks
COMP 551 Applied Machine Learning Lecture 14: Neural Networks Instructor: Ryan Lowe (ryan.lowe@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted,
More informationDeep Feedforward Networks
Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3
More informationLecture 3 Feedforward Networks and Backpropagation
Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression
More informationData Mining & Machine Learning
Data Mining & Machine Learning CS57300 Purdue University March 1, 2018 1 Recap of Last Class (Model Search) Forward and Backward passes 2 Feedforward Neural Networks Neural Networks: Architectures 2-layer
More informationMachine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5
Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5 Slides adapted from Jordan Boyd-Graber, Tom Mitchell, Ziv Bar-Joseph Machine Learning: Chenhao Tan Boulder 1 of 27 Quiz question For
More informationMachine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6
Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)
More informationPattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore
Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Lecture - 27 Multilayer Feedforward Neural networks with Sigmoidal
More informationDeep Neural Networks (1) Hidden layers; Back-propagation
Deep Neural Networs (1) Hidden layers; Bac-propagation Steve Renals Machine Learning Practical MLP Lecture 3 2 October 2018 http://www.inf.ed.ac.u/teaching/courses/mlp/ MLP Lecture 3 / 2 October 2018 Deep
More informationarxiv: v1 [cs.lg] 11 May 2015
Improving neural networks with bunches of neurons modeled by Kumaraswamy units: Preliminary study Jakub M. Tomczak JAKUB.TOMCZAK@PWR.EDU.PL Wrocław University of Technology, wybrzeże Wyspiańskiego 7, 5-37,
More informationSolutions. Part I Logistic regression backpropagation with a single training example
Solutions Part I Logistic regression backpropagation with a single training example In this part, you are using the Stochastic Gradient Optimizer to train your Logistic Regression. Consequently, the gradients
More informationDeep Learning (CNNs)
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Deep Learning (CNNs) Deep Learning Readings: Murphy 28 Bishop - - HTF - - Mitchell
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/41 Outline
More informationDEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY
DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY 1 On-line Resources http://neuralnetworksanddeeplearning.com/index.html Online book by Michael Nielsen http://matlabtricks.com/post-5/3x3-convolution-kernelswith-online-demo
More informationECE521 Lectures 9 Fully Connected Neural Networks
ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance
More informationNeural Networks: Backpropagation
Neural Networks: Backpropagation Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others
More informationComputational statistics
Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial
More informationLecture 3 Feedforward Networks and Backpropagation
Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression
More informationTraining Neural Networks Practical Issues
Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology Fall 2017 Most slides have been adapted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017, and some from
More informationCSC321 Lecture 9: Generalization
CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 26 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions
More informationCSC242: Intro to AI. Lecture 21
CSC242: Intro to AI Lecture 21 Administrivia Project 4 (homeworks 18 & 19) due Mon Apr 16 11:59PM Posters Apr 24 and 26 You need an idea! You need to present it nicely on 2-wide by 4-high landscape pages
More informationDeep Feedforward Networks. Lecture slides for Chapter 6 of Deep Learning Ian Goodfellow Last updated
Deep Feedforward Networks Lecture slides for Chapter 6 of Deep Learning www.deeplearningbook.org Ian Goodfellow Last updated 2016-10-04 Roadmap Example: Learning XOR Gradient-Based Learning Hidden Units
More informationDeep Feedforward Networks. Seung-Hoon Na Chonbuk National University
Deep Feedforward Networks Seung-Hoon Na Chonbuk National University Neural Network: Types Feedforward neural networks (FNN) = Deep feedforward networks = multilayer perceptrons (MLP) No feedback connections
More informationConvolutional Neural Networks II. Slides from Dr. Vlad Morariu
Convolutional Neural Networks II Slides from Dr. Vlad Morariu 1 Optimization Example of optimization progress while training a neural network. (Loss over mini-batches goes down over time.) 2 Learning rate
More informationDeep Feedforward Networks
Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3
More informationNeural Nets Supervised learning
6.034 Artificial Intelligence Big idea: Learning as acquiring a function on feature vectors Background Nearest Neighbors Identification Trees Neural Nets Neural Nets Supervised learning y s(z) w w 0 w
More informationIntroduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis
Introduction to Natural Computation Lecture 9 Multilayer Perceptrons and Backpropagation Peter Lewis 1 / 25 Overview of the Lecture Why multilayer perceptrons? Some applications of multilayer perceptrons.
More informationComputing Neural Network Gradients
Computing Neural Network Gradients Kevin Clark 1 Introduction The purpose of these notes is to demonstrate how to quickly compute neural network gradients in a completely vectorized way. It is complementary
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationCS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes
CS 6501: Deep Learning for Computer Graphics Basics of Neural Networks Connelly Barnes Overview Simple neural networks Perceptron Feedforward neural networks Multilayer perceptron and properties Autoencoders
More informationNeural Networks. Nicholas Ruozzi University of Texas at Dallas
Neural Networks Nicholas Ruozzi University of Texas at Dallas Handwritten Digit Recognition Given a collection of handwritten digits and their corresponding labels, we d like to be able to correctly classify
More informationDeep Learning Lab Course 2017 (Deep Learning Practical)
Deep Learning Lab Course 207 (Deep Learning Practical) Labs: (Computer Vision) Thomas Brox, (Robotics) Wolfram Burgard, (Machine Learning) Frank Hutter, (Neurorobotics) Joschka Boedecker University of
More informationCSC321 Lecture 15: Exploding and Vanishing Gradients
CSC321 Lecture 15: Exploding and Vanishing Gradients Roger Grosse Roger Grosse CSC321 Lecture 15: Exploding and Vanishing Gradients 1 / 23 Overview Yesterday, we saw how to compute the gradient descent
More informationJakub Hajic Artificial Intelligence Seminar I
Jakub Hajic Artificial Intelligence Seminar I. 11. 11. 2014 Outline Key concepts Deep Belief Networks Convolutional Neural Networks A couple of questions Convolution Perceptron Feedforward Neural Network
More informationTopics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)
Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound Lecture 3: Introduction to Deep Learning (continued) Course Logistics - Update on course registrations - 6 seats left now -
More informationFeedforward Neural Networks. Michael Collins, Columbia University
Feedforward Neural Networks Michael Collins, Columbia University Recap: Log-linear Models A log-linear model takes the following form: p(y x; v) = exp (v f(x, y)) y Y exp (v f(x, y )) f(x, y) is the representation
More informationCSCI567 Machine Learning (Fall 2018)
CSCI567 Machine Learning (Fall 2018) Prof. Haipeng Luo U of Southern California Sep 12, 2018 September 12, 2018 1 / 49 Administration GitHub repos are setup (ask TA Chi Zhang for any issues) HW 1 is due
More informationIntroduction to Deep Neural Networks
Introduction to Deep Neural Networks Presenter: Chunyuan Li Pattern Classification and Recognition (ECE 681.01) Duke University April, 2016 Outline 1 Background and Preliminaries Why DNNs? Model: Logistic
More informationIntroduction to Neural Networks
Introduction to Neural Networks Philipp Koehn 3 October 207 Linear Models We used before weighted linear combination of feature values h j and weights λ j score(λ, d i ) = j λ j h j (d i ) Such models
More information8-1: Backpropagation Prof. J.C. Kao, UCLA. Backpropagation. Chain rule for the derivatives Backpropagation graphs Examples
8-1: Backpropagation Prof. J.C. Kao, UCLA Backpropagation Chain rule for the derivatives Backpropagation graphs Examples 8-2: Backpropagation Prof. J.C. Kao, UCLA Motivation for backpropagation To do gradient
More informationNeural Networks and the Back-propagation Algorithm
Neural Networks and the Back-propagation Algorithm Francisco S. Melo In these notes, we provide a brief overview of the main concepts concerning neural networks and the back-propagation algorithm. We closely
More informationCSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!
CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!! November 18, 2015 THE EXAM IS CLOSED BOOK. Once the exam has started, SORRY, NO TALKING!!! No, you can t even say see ya
More informationBackpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Backpropagation Matt Gormley Lecture 12 Feb 23, 2018 1 Neural Networks Outline
More informationNeural Networks and Deep Learning
Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost
More informationCSC321 Lecture 6: Backpropagation
CSC321 Lecture 6: Backpropagation Roger Grosse Roger Grosse CSC321 Lecture 6: Backpropagation 1 / 21 Overview We ve seen that multilayer neural networks are powerful. But how can we actually learn them?
More information<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)
Learning for Deep Neural Networks (Back-propagation) Outline Summary of Previous Standford Lecture Universal Approximation Theorem Inference vs Training Gradient Descent Back-Propagation
More informationNeural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann
Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable
More informationNeural Networks Learning the network: Backprop , Fall 2018 Lecture 4
Neural Networks Learning the network: Backprop 11-785, Fall 2018 Lecture 4 1 Recap: The MLP can represent any function The MLP can be constructed to represent anything But how do we construct it? 2 Recap:
More informationDeepLearning on FPGAs
DeepLearning on FPGAs Introduction to Deep Learning Sebastian Buschäger Technische Universität Dortmund - Fakultät Informatik - Lehrstuhl 8 October 21, 2017 1 Recap Computer Science Approach Technical
More informationNeural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35
Neural Networks David Rosenberg New York University July 26, 2017 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 1 / 35 Neural Networks Overview Objectives What are neural networks? How
More informationRapid Introduction to Machine Learning/ Deep Learning
Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University 1/59 Lecture 4a Feedforward neural network October 30, 2015 2/59 Table of contents 1 1. Objectives of Lecture
More informationA summary of Deep Learning without Poor Local Minima
A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016 Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given
More informationMachine Learning
Machine Learning 10-315 Maria Florina Balcan Machine Learning Department Carnegie Mellon University 03/29/2019 Today: Artificial neural networks Backpropagation Reading: Mitchell: Chapter 4 Bishop: Chapter
More informationDeep Learning & Artificial Intelligence WS 2018/2019
Deep Learning & Artificial Intelligence WS 2018/2019 Linear Regression Model Model Error Function: Squared Error Has no special meaning except it makes gradients look nicer Prediction Ground truth / target
More informationDeep Learning II: Momentum & Adaptive Step Size
Deep Learning II: Momentum & Adaptive Step Size CS 760: Machine Learning Spring 2018 Mark Craven and David Page www.biostat.wisc.edu/~craven/cs760 1 Goals for the Lecture You should understand the following
More informationOnline Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?
Online Videos FERPA Sign waiver or sit on the sides or in the back Off camera question time before and after lecture Questions? Lecture 1, Slide 1 CS224d Deep NLP Lecture 4: Word Window Classification
More informationCS60010: Deep Learning
CS60010: Deep Learning Sudeshna Sarkar Spring 2018 16 Jan 2018 FFN Goal: Approximate some unknown ideal function f : X! Y Ideal classifier: y = f*(x) with x and category y Feedforward Network: Define parametric
More informationLecture 15: Exploding and Vanishing Gradients
Lecture 15: Exploding and Vanishing Gradients Roger Grosse 1 Introduction Last lecture, we introduced RNNs and saw how to derive the gradients using backprop through time. In principle, this lets us train
More informationCh.6 Deep Feedforward Networks (2/3)
Ch.6 Deep Feedforward Networks (2/3) 16. 10. 17. (Mon.) System Software Lab., Dept. of Mechanical & Information Eng. Woonggy Kim 1 Contents 6.3. Hidden Units 6.3.1. Rectified Linear Units and Their Generalizations
More informationAdvanced Machine Learning
Advanced Machine Learning Lecture 4: Deep Learning Essentials Pierre Geurts, Gilles Louppe, Louis Wehenkel 1 / 52 Outline Goal: explain and motivate the basic constructs of neural networks. From linear
More informationStanford Machine Learning - Week V
Stanford Machine Learning - Week V Eric N Johnson August 13, 2016 1 Neural Networks: Learning What learning algorithm is used by a neural network to produce parameters for a model? Suppose we have a neural
More informationCSC321 Lecture 4: Learning a Classifier
CSC321 Lecture 4: Learning a Classifier Roger Grosse Roger Grosse CSC321 Lecture 4: Learning a Classifier 1 / 31 Overview Last time: binary classification, perceptron algorithm Limitations of the perceptron
More informationNeural Network Language Modeling
Neural Network Language Modeling Instructor: Wei Xu Ohio State University CSE 5525 Many slides from Marek Rei, Philipp Koehn and Noah Smith Course Project Sign up your course project In-class presentation
More informationThe connection of dropout and Bayesian statistics
The connection of dropout and Bayesian statistics Interpretation of dropout as approximate Bayesian modelling of NN http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf Dropout Geoffrey Hinton Google, University
More informationStatistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks
Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science Topics covered
More informationLecture 35: Optimization and Neural Nets
Lecture 35: Optimization and Neural Nets CS 4670/5670 Sean Bell DeepDream [Google, Inceptionism: Going Deeper into Neural Networks, blog 2015] Aside: CNN vs ConvNet Note: There are many papers that use
More informationClassification: Logistic Regression from Data
Classification: Logistic Regression from Data Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3 Slides adapted from Emily Fox Machine Learning: Jordan Boyd-Graber Boulder Classification:
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationMachine Learning (CSE 446): Neural Networks
Machine Learning (CSE 446): Neural Networks Noah Smith c 2017 University of Washington nasmith@cs.washington.edu November 6, 2017 1 / 22 Admin No Wednesday office hours for Noah; no lecture Friday. 2 /
More informationLecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets)
COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets) Sanjeev Arora Elad Hazan Recap: Structure of a deep
More information1 What a Neural Network Computes
Neural Networks 1 What a Neural Network Computes To begin with, we will discuss fully connected feed-forward neural networks, also known as multilayer perceptrons. A feedforward neural network consists
More informationLecture 2: Modular Learning
Lecture 2: Modular Learning Deep Learning @ UvA MODULAR LEARNING - PAGE 1 Announcement o C. Maddison is coming to the Deep Vision Seminars on the 21 st of Sep Author of concrete distribution, co-author
More informationCSC321 Lecture 4: Learning a Classifier
CSC321 Lecture 4: Learning a Classifier Roger Grosse Roger Grosse CSC321 Lecture 4: Learning a Classifier 1 / 28 Overview Last time: binary classification, perceptron algorithm Limitations of the perceptron
More informationWeek 5: Logistic Regression & Neural Networks
Week 5: Logistic Regression & Neural Networks Instructor: Sergey Levine 1 Summary: Logistic Regression In the previous lecture, we covered logistic regression. To recap, logistic regression models and
More informationIntroduction to Neural Networks
Introduction to Neural Networks Philipp Koehn 4 April 205 Linear Models We used before weighted linear combination of feature values h j and weights λ j score(λ, d i ) = j λ j h j (d i ) Such models can
More informationNeural Network Tutorial & Application in Nuclear Physics. Weiguang Jiang ( 蒋炜光 ) UTK / ORNL
Neural Network Tutorial & Application in Nuclear Physics Weiguang Jiang ( 蒋炜光 ) UTK / ORNL Machine Learning Logistic Regression Gaussian Processes Neural Network Support vector machine Random Forest Genetic
More informationError Functions & Linear Regression (1)
Error Functions & Linear Regression (1) John Kelleher & Brian Mac Namee Machine Learning @ DIT Overview 1 Introduction Overview 2 Univariate Linear Regression Linear Regression Analytical Solution Gradient
More informationCSC 578 Neural Networks and Deep Learning
CSC 578 Neural Networks and Deep Learning Fall 2018/19 3. Improving Neural Networks (Some figures adapted from NNDL book) 1 Various Approaches to Improve Neural Networks 1. Cost functions Quadratic Cross
More informationDeep Learning for NLP
Deep Learning for NLP CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard Socher and Yoshua Bengio) Machine Learning and NLP NER WordNet Usually machine learning
More informationNeural Networks for Machine Learning. Lecture 2a An overview of the main types of neural network architecture
Neural Networks for Machine Learning Lecture 2a An overview of the main types of neural network architecture Geoffrey Hinton with Nitish Srivastava Kevin Swersky Feed-forward neural networks These are
More informationCSE 591: Introduction to Deep Learning in Visual Computing. - Parag S. Chandakkar - Instructors: Dr. Baoxin Li and Ragav Venkatesan
CSE 591: Introduction to Deep Learning in Visual Computing - Parag S. Chandakkar - Instructors: Dr. Baoxin Li and Ragav Venkatesan Overview Background Why another network structure? Vanishing and exploding
More informationNeural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav
Neural Networks 30.11.2015 Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav 1 Talk Outline Perceptron Combining neurons to a network Neural network, processing input to an output Learning Cost
More informationMore on Neural Networks
More on Neural Networks Yujia Yan Fall 2018 Outline Linear Regression y = Wx + b (1) Linear Regression y = Wx + b (1) Polynomial Regression y = Wφ(x) + b (2) where φ(x) gives the polynomial basis, e.g.,
More information11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)
11/3/15 Machine Learning and NLP Deep Learning for NLP Usually machine learning works well because of human-designed representations and input features CS224N WordNet SRL Parser Machine learning becomes
More informationStatistical NLP for the Web
Statistical NLP for the Web Neural Networks, Deep Belief Networks Sameer Maskey Week 8, October 24, 2012 *some slides from Andrew Rosenberg Announcements Please ask HW2 related questions in courseworks
More informationANALYSIS ON GRADIENT PROPAGATION IN BATCH NORMALIZED RESIDUAL NETWORKS
ANALYSIS ON GRADIENT PROPAGATION IN BATCH NORMALIZED RESIDUAL NETWORKS Anonymous authors Paper under double-blind review ABSTRACT We conduct mathematical analysis on the effect of batch normalization (BN)
More information