Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16

Size: px

Start display at page:

Download "Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16"

Tracy Small
5 years ago
Views:

1 Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16 Slides adapted from Jordan Boyd-Graber, Justin Johnson, Andrej Karpathy, Chris Ketelsen, Fei-Fei Li, Mike Mozer, Michael Nielson Machine Learning: Chenhao Tan Boulder 1 of 58

2 Logistics Prelim 1 grades and solutions are available Homework 3 is due next week Feedback from the second homework: Include more examples and spend more time on explaining the equations in class Make questions bold CLEAR open house right after the class (free food)! Machine Learning: Chenhao Tan Boulder 2 of 58

3 Overview Forward propagation recap Back propagation Chain rule Back propagation Full algorithm Practical issues of back propagation Unstable gradients Weight Initialization Alternative regularization Batch size Machine Learning: Chenhao Tan Boulder 3 of 58

4 Forward propagation recap Outline Forward propagation recap Back propagation Chain rule Back propagation Full algorithm Practical issues of back propagation Unstable gradients Weight Initialization Alternative regularization Batch size Machine Learning: Chenhao Tan Boulder 4 of 58

5 Forward propagation recap Forward propagation algorithm How do we make predictions based on a multi-layer neural network? Store the biases for layer l in b l, weight matrix in W l W 1, b 1 W 2, b 2 W 3, b 3 W 4, b 4 x 1 x 2 o x d Machine Learning: Chenhao Tan Boulder 5 of 58

6 Forward propagation recap Forward propagation algorithm Suppose your network has L layers Make prediction for an instance x 1: Initialize a 0 = x 2: for l = 1 to L do 3: z l = W l a l 1 + b l 4: a l = g(z l ) // g represents the nonlinear activation 5: end for 6: The prediction ŷ is simply a L Machine Learning: Chenhao Tan Boulder 6 of 58

7 Forward propagation recap Neural networks in a nutshell Training data S train = {(x, y)} Network architecture (model) Loss function (objective function) How do we learn the parameters? ŷ = f w (x) W l, b l, l = 1,..., L L (y, ŷ) Machine Learning: Chenhao Tan Boulder 7 of 58

8 Forward propagation recap Neural networks in a nutshell Training data S train = {(x, y)} Network architecture (model) Loss function (objective function) How do we learn the parameters? Stochastic gradient descent, ŷ = f w (x) W l, b l, l = 1,..., L L (y, ŷ) W l W l L (y, ŷ) η W l Machine Learning: Chenhao Tan Boulder 7 of 58

9 Forward propagation recap Reminder of gradient descent Machine Learning: Chenhao Tan Boulder 8 of 58

10 Forward propagation recap Challenge Challenge: How do we compute derivatives of the loss function with respect to weights and biases? Solution: Back propagation Machine Learning: Chenhao Tan Boulder 9 of 58

11 Back propagation Outline Forward propagation recap Back propagation Chain rule Back propagation Full algorithm Practical issues of back propagation Unstable gradients Weight Initialization Alternative regularization Batch size Machine Learning: Chenhao Tan Boulder 10 of 58

12 Back propagation Chain rule The Chain Rule The chain rule allows us to take derivatives of nested functions. Machine Learning: Chenhao Tan Boulder 11 of 58

13 Back propagation Chain rule The Chain Rule The chain rule allows us to take derivatives of nested functions. Univariate chain rule: d dx f (g(x)) = f (g(x)) g (x) = df dg dg dx Machine Learning: Chenhao Tan Boulder 11 of 58

14 Back propagation Chain rule The Chain Rule The chain rule allows us to take derivatives of nested functions. Univariate chain rule: Example: d dx 1 1+exp( x) d dx f (g(x)) = f (g(x)) g (x) = df dg dg dx Machine Learning: Chenhao Tan Boulder 11 of 58

15 Back propagation Chain rule The Chain Rule The chain rule allows us to take derivatives of nested functions. Univariate chain rule: d dx f (g(x)) = f (g(x)) g (x) = df dg dg dx Example: d 1 dx 1+exp( x) = 1 exp( x) 1 (1+exp( x)) 2 dσ(x) = σ(x)(1 σ(x)) dx Machine Learning: Chenhao Tan Boulder 11 of 58

16 Back propagation Chain rule The Chain Rule Multivariate chain rule: x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) Machine Learning: Chenhao Tan Boulder 12 of 58

17 Back propagation Chain rule The Chain Rule Multivariate chain rule: x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) Derivative of L with respect to x: Similarly, f y, f z f x Machine Learning: Chenhao Tan Boulder 12 of 58

18 Back propagation Chain rule The Chain Rule What is the derivative of f with respect to r? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) Machine Learning: Chenhao Tan Boulder 13 of 58

19 Back propagation Chain rule The Chain Rule What is the derivative of f with respect to r? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) f r = f x x r + f y y r + f z z r Machine Learning: Chenhao Tan Boulder 13 of 58

20 Back propagation Chain rule The Chain Rule What is the derivative of f with respect to s? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) Machine Learning: Chenhao Tan Boulder 14 of 58

21 Back propagation Chain rule The Chain Rule What is the derivative of f with respect to s? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) f s = f x x s + f y y s + f z z s Machine Learning: Chenhao Tan Boulder 14 of 58

22 Back propagation Chain rule The Chain Rule What is the derivative of f with respect to s? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) Example: Let f = xyz, x = r, y = rs, and z = s. Find f / s f s = f x x s + f y y s + f z z s Machine Learning: Chenhao Tan Boulder 15 of 58

23 Back propagation Chain rule The Chain Rule What is the derivative of f with respect to s? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) Example: Let f = xyz, x = r, y = rs, and z = s. Find f / s f = yz 0 + xz r + xy 1 s Machine Learning: Chenhao Tan Boulder 15 of 58

24 Back propagation Chain rule The Chain Rule What is the derivative of f with respect to s? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) Example: Let f = xyz, x = r, y = rs, and z = s. Find f / s f s = rs2 0 + rs r + r 2 s 1 Machine Learning: Chenhao Tan Boulder 15 of 58

25 Back propagation Chain rule The Chain Rule What is the derivative of f with respect to s? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) Example: Let f = xyz, x = r, y = rs, and z = s. Find f / s f s = 2r2 s Machine Learning: Chenhao Tan Boulder 15 of 58

26 Back propagation Chain rule The Chain Rule What is the derivative of f with respect to s? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) Example: Let f = xyz, x = r, y = rs, and z = s. Find f / s f (r, s) = r rs s = r 2 s 2 f s = 2r2 s Machine Learning: Chenhao Tan Boulder 16 of 58

27 Back propagation Chain rule The Chain Rule What is the derivative of f with respect to u? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) Machine Learning: Chenhao Tan Boulder 17 of 58

28 Back propagation Chain rule The Chain Rule What is the derivative of f with respect to u? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) f u = f r r u + f s s u Machine Learning: Chenhao Tan Boulder 17 of 58

29 Back propagation Chain rule The Chain Rule What is the derivative of f with respect to u? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) Crux: If you know the derivative of objective w.r.t. intermediate value in the chain, can eliminate everything in between. Machine Learning: Chenhao Tan Boulder 18 of 58

30 Back propagation Chain rule The Chain Rule What is the derivative of f with respect to u? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) Crux: If you know the derivative of objective w.r.t. intermediate value in the chain, can eliminate everything in between. This is the cornerstone of the back propagation algorithm. Machine Learning: Chenhao Tan Boulder 18 of 58

31 Back propagation Back propagation Back Propagation W 1, b 1 W 2, b 2 W 3, b 3 W 4, b 4 x 1 x 2... o 1... x d Machine Learning: Chenhao Tan Boulder 19 of 58

32 Back propagation Back propagation Back Propagation For the derivation, we ll consider a simplified network a 0 W 1 z 1 a 1 W 2 z2 a 2 L (y, a 2 ) We want to use back propagation to compute partial derivative of L w.r.t. the weights and biases L, for l = 1, 2 w l ij w 2 ij is the weight from node j in layer l 1 to node i in layer l. Machine Learning: Chenhao Tan Boulder 20 of 58

33 Back propagation Back propagation Back Propagation For the derivation, we ll consider a simplified network a 0 W 1 z 1 a 1 W 2 z2 a 2 L (y, a 2 ) We need to choose an intermediate term that lives on the nodes, that we can easily compute derivative with respect to. Could choose a s, but we ll choose z s because math is easier. Machine Learning: Chenhao Tan Boulder 20 of 58

34 Back propagation Back propagation Back Propagation For the derivation, we ll consider a simplified network a 0 W 1 z 1 a 1 W 2 z2 a 2 L (y, a 2 ) Define the derivative w.r.t. the z s by δ: δ l j = L z l j Note that δ l has the same size as z l and a l. Machine Learning: Chenhao Tan Boulder 20 of 58

35 Back propagation Back propagation Back Propagation For the derivation, we ll consider a simplified network a 0 W 1 z 1 a 1 W 2 z2 a 2 L (y, a 2 ) Let s compute δ L for output layer L: δ L j = L z L j = L a L j da L j dz L j Machine Learning: Chenhao Tan Boulder 20 of 58

36 Back propagation Back propagation Back Propagation δ L j = L z L j = L a L j da L j dz L j We know that a L j = g(z L j ), so dal j dz L j = g (z L j ) δ L j = L a L j g (z L j ) Note: The first term is j th entry of gradient of L. Machine Learning: Chenhao Tan Boulder 21 of 58

37 Back propagation Back propagation Back Propagation δ L j = L a L j g (z L j ) We can combine all of these into a vector operation δ L = L a L g (z L ) Where g (z L ) is the activation function applied elementwise to z L. The symbol indicates element-wise multiplication of vectors. Machine Learning: Chenhao Tan Boulder 22 of 58

38 Back propagation Back propagation Back Propagation δ L j = L a L j g (z L j ) We can combine all of these into a vector operation δ L = L a L g (z L ) Where g (z L ) is the activation function applied elementwise to z L. The symbol indicates element-wise multiplication of vectors. Notice that computing δ L requires knowing activations. This means that before we can compute derivatives for SGD through back propagation, we first run forward propagation through the network. Machine Learning: Chenhao Tan Boulder 22 of 58

39 Back propagation Back propagation Back Propagation Example: Suppose we re in regression setting and choose a sigmoid activation function: L = 1 (y j a L j ) 2 and a L j = σ(z) 2 j L a L j = (a L j y j ), da L j dz L j = σ (z L j ) = σ(z L j )(1 σ(z L j )) So δ L = (a L y) σ(z L ) (1 σ(z L )) Machine Learning: Chenhao Tan Boulder 23 of 58

40 Back propagation Back propagation Back Propagation OK Great! Now we can easily-ish compute the δ s for the output layer. But really we re after partials w.r.t. to weights and biases. a 0 W 1 z 1 a 1 W 2 z2 a 2 L (y, a 2 ) Machine Learning: Chenhao Tan Boulder 24 of 58

41 Back propagation Back propagation Back Propagation OK Great! Now we can easily-ish compute the δ s for the output layer. But really we re after partials w.r.t. to weights and biases. a 0 W 1 z 1 a 1 W 2 z2 a 2 L (y, a 2 ) Question: What do you notice? Machine Learning: Chenhao Tan Boulder 24 of 58

42 Back propagation Back propagation Back Propagation We want to find derivative L w.r.t. to weights and biases a 0 W 1 z 1 a 1 W 2 z2 a 2 L (y, a 2 ) Every weight connected to a node in layer L depends on a single δ L j Machine Learning: Chenhao Tan Boulder 25 of 58

43 Back propagation Back propagation Back Propagation L (y, a 2 ) a 0 W 1 z 1 a 1 W 2 z2 a 2 So we have L w L jk = L z L j z L j w L jk = δ L j z L j w L jk Need to compute zl j w L. Recall z L = W L a L 1 + b L jk j th entry in vector z L j = i w L jia L 1 i + b L j Machine Learning: Chenhao Tan Boulder 26 of 58

44 Back propagation Back propagation Back Propagation L (y, a 2 ) a 0 W 1 z 1 a 1 W 2 z2 a 2 So we have L w L jk = L z L j Taking derivative w.r.t. w L jk gives z L j w L jk = δ L j z L j w L jk zl j w L jk = a L 1 k L w L jk = a L 1 k δj L Machine Learning: Chenhao Tan Boulder 26 of 58

45 Back propagation Back propagation Back Propagation L (y, a 2 ) a 0 W 1 z 1 a 1 W 2 z2 a 2 So we have L w L jk = a L 1 k δj L Easy expression for derivative w.r.t. every weight leading into layer L. Machine Learning: Chenhao Tan Boulder 27 of 58

46 Back propagation Back propagation Back Propagation Let s make the notation a little more practical. [ W 2 w 2 = 11 w 2 12 w 2 ] 13 w 2 21 w 2 22 w 2 23 Machine Learning: Chenhao Tan Boulder 28 of 58

47 Back propagation Back propagation Back Propagation Let s make the notation a little more practical. [ W 2 w 2 = 11 w 2 12 w 2 ] 13 w 2 21 w 2 22 w 2 23 [ L ] L L [ L W 2 = w 2 11 w 2 12 w 2 13 δ 2 = 1 a 1 1 δ1 2a1 2 δ1 2 ] a1 3 δ2 2a1 1 δ2 2a1 2 δ2 2a1 3 L w 2 21 L w 2 22 L w 2 23 Now we can write this as an outer-product of δ 2 and a 1, (Exercise for yourself, L b 2 ) L W 2 = δ2 (a 1 ) T Machine Learning: Chenhao Tan Boulder 28 of 58

48 Back propagation Back propagation Intermediate summary For a giving training example x, perform forward propagation to get z l and a l on each layer. Then to get the partial derivatives for W 2 or W L : 1. Compute δ L = L a L j g (z L ) 2. Compute L = δ L (a L 1 ) T and L = δ L W L b L OK, that wasn t so bad! We found very simple expressions for the derivatives with respect to the weights in the last hidden layer! Machine Learning: Chenhao Tan Boulder 29 of 58

49 Back propagation Back propagation Intermediate summary For a giving training example x, perform forward propagation to get z l and a l on each layer. Then to get the partial derivatives for W 2 or W L : 1. Compute δ L = L a L j g (z L ) 2. Compute L = δ L (a L 1 ) T and L = δ L W L b L OK, that wasn t so bad! We found very simple expressions for the derivatives with respect to the weights in the last hidden layer! Problem: How do we do the other layers? Machine Learning: Chenhao Tan Boulder 29 of 58

50 Back propagation Back propagation Intermediate summary For a giving training example x, perform forward propagation to get z l and a l on each layer. Then to get the partial derivatives for W 2 or W L : 1. Compute δ L = L a L j g (z L ) 2. Compute L = δ L (a L 1 ) T and L = δ L W L b L OK, that wasn t so bad! We found very simple expressions for the derivatives with respect to the weights in the last hidden layer! Problem: How do we do the other layers? Since the formulas were so nice once we knew the adjacent δ l, it sure would be nice if we could easily compute the δ l s on earlier layers. Machine Learning: Chenhao Tan Boulder 29 of 58

51 Back propagation Back propagation Back Propagation But the relationship between L and z 1 is really complicated because of multiple passes through the activation functions. Machine Learning: Chenhao Tan Boulder 30 of 58

52 Back propagation Back propagation Back Propagation But the relationship between L and z 1 is really complicated because of multiple passes through the activation functions. It is OK! Back propagation comes to rescue! Machine Learning: Chenhao Tan Boulder 30 of 58

53 Back propagation Back propagation Back Propagation But the relationship between L and z 1 is really complicated because of multiple passes through the activation functions. It is OK! Back propagation comes to rescue! Notice that δ 1 depends on δ 2. a 0 W 1 z 1 a 1 W 2 z2 a 2 L (y, a 2 ) Machine Learning: Chenhao Tan Boulder 30 of 58

54 Back propagation Back propagation Back Propagation Notice that δ 1 depends on δ 2. L (y, a 2 ) a 0 W 1 z 1 a 1 W 2 z2 a 2 By multivariate chain rule, L z l 1 k = j L z l j z l j z l 1 k Machine Learning: Chenhao Tan Boulder 31 of 58

55 Back propagation Back propagation Back Propagation Notice that δ 1 depends on δ 2. L (y, a 2 ) a 0 W 1 z 1 a 1 W 2 z2 a 2 By multivariate chain rule, δ l 1 k = L z l 1 k = j L z l j z l j z l 1 k = j δj l z l j z l 1 k Machine Learning: Chenhao Tan Boulder 31 of 58

56 Back propagation Back propagation Back Propagation Notice that δ 1 depends on δ 2. L (y, a 2 ) a 0 W 1 z 1 a 1 W 2 z2 a 2 By multivariate chain rule, δ 1 2 = L z 1 2 = δ1 2 z 2 1 z 1 + δ 2 z z 1 2 Machine Learning: Chenhao Tan Boulder 31 of 58

57 Back propagation Back propagation Back Propagation δ 1 2 = L z 1 2 = δ1 2 z 2 1 z 1 + δ 2 z z 1 2 Machine Learning: Chenhao Tan Boulder 32 of 58

58 Back propagation Back propagation Back Propagation δ 1 2 = L z 1 2 = δ1 2 z 2 1 z 1 + δ 2 z z 1 2 Recall that z 2 = W 2 a 1 + b 2, it follows that Taking the derivative z2 i z 1 2 z 2 i = w 2 i1a w 2 i2a w 2 i3a b 2 i = w 2 i2 g (z 1 2 ), and plugging in gives δ 1 2 = L z 1 2 = δ 2 1w 2 12g (z 1 2) + δ 2 2w 2 22g (z 1 2) Machine Learning: Chenhao Tan Boulder 32 of 58

59 Back propagation Back propagation Back Propagation If we do this for each of the 3 δi 1 s, something nice happens: (Exercise for yourself: work out δ1 1 and δ1 3 for yourself) δ 1 1 = δ 2 1w 2 11g (z 1 1) + δ 2 2w 2 21g (z 1 1) δ 1 2 = δ 2 1w 2 12g (z 1 2) + δ 2 2w 2 22g (z 1 2) δ 1 3 = δ 2 1w 2 13g (z 1 3) + δ 2 2w 2 23g (z 1 3) Machine Learning: Chenhao Tan Boulder 33 of 58

60 Back propagation Back propagation Back Propagation If we do this for each of the 3 δi 1 s, something nice happens: (Exercise for yourself: work out δ1 1 and δ1 3 for yourself) δ 1 1 = δ 2 1w 2 11g (z 1 1) + δ 2 2w 2 21g (z 1 1) δ 1 2 = δ 2 1w 2 12g (z 1 2) + δ 2 2w 2 22g (z 1 2) δ 1 3 = δ 2 1w 2 13g (z 1 3) + δ 2 2w 2 23g (z 1 3) Notice that each row of the system gets multiplied by g (z 1 i ), so let s factor those out. Machine Learning: Chenhao Tan Boulder 33 of 58

61 Back propagation Back propagation Back Propagation If we do this for each of the 3 δi 2 s, something nice happens: δ 1 1 = (δ 2 1w δ 2 2w 2 21) g (z 1 1) δ 1 2 = (δ 2 1w δ 2 2w 2 22) g (z 1 2) δ 1 3 = (δ 2 1w δ 2 2w 2 23) g (z 1 3) Machine Learning: Chenhao Tan Boulder 34 of 58

62 Back propagation Back propagation Back Propagation If we do this for each of the 3 δi 2 s, something nice happens: δ 1 1 = (δ 2 1w δ 2 2w 2 21) g (z 1 1) δ 1 2 = (δ 2 1w δ 2 2w 2 22) g (z 1 2) δ3 1 = (δ1w δ2w ) g (z 1 3) [ ] [ Remember δ 2 δ 2 = 1 δ2 2, W 2 w 2 = 11 w 2 12 w 2 ] 13 w 2 21 w 2 22 w 2 23 Do you see δ 2 and W 2 lurking anywhere in the above system? Machine Learning: Chenhao Tan Boulder 34 of 58

63 Back propagation Back propagation Back Propagation If we do this for each of the 3 δi 2 s, something nice happens: Does this help? w 2 (W 2 ) T 11 w 2 [ ] 21 = w 2 12 w 2 22, δ 2 δ 2 = 1 w 2 13 w 2 δ δ 1 1 = (δ 2 1w δ 2 2w 2 21) g (z 1 1) δ 2 2 = (δ 2 1w δ 2 2w 2 22) g (z 1 2) δ 2 3 = (δ 2 1w δ 2 2w 2 23) g (z 1 3) Machine Learning: Chenhao Tan Boulder 35 of 58

64 Back propagation Back propagation Back Propagation If we do this for each of the 3 δi 2 s, something nice happens: δ 1 1 = (δ 2 1w δ 2 2w 2 21) g (z 1 1) δ 1 2 = (δ 2 1w δ 2 2w 2 22) g (z 1 2) δ 1 3 = (δ 2 1w δ 2 2w 2 23) g (z 1 3) δ 1 = (W 2 ) T δ 2 g (z 1 ) Machine Learning: Chenhao Tan Boulder 36 of 58

65 Back propagation Back propagation Back Propagation OK Great! We can easily compute δ 1 from δ 2 Then we can compute derivatives of L w.r.t. weights W 1 and biases b 1 exactly the way we did for W 2 and biases b 2 1. Compute δ 1 = (W 2 ) T δ 2 g (z 1 ) 2. Compute L W 1 = δ 1 (a 0 ) T and L b 1 = δ 1 Machine Learning: Chenhao Tan Boulder 37 of 58

66 Back propagation Back propagation Back Propagation OK Great! We can easily compute δ 1 from δ 2 Then we can compute derivatives of L w.r.t. weights W 1 and biases b 1 exactly the way we did for W 2 and biases b 2 1. Compute δ 1 = (W 2 ) T δ 2 g (z 1 ) 2. Compute L W 1 = δ 1 (a 0 ) T and L b 1 = δ 1 We ve worked this out for a simple network with one hidden layer. Nothing we ve done assumed anything about the number of layers, so we can apply the same procedure recursively with any number of layers. Machine Learning: Chenhao Tan Boulder 37 of 58

67 Back propagation Full algorithm Back Propagation δ L = L g (z L ) # Compute δ s on output layer a L j For l = L,..., 1 L = δ l (a l 1 ) T # Compute weight derivatives W l L b l = δl # Compute bias derivatives δ l 1 = ( W l) T δ l g (z l 1 ) # Back prop δ s to previous layer (After this, ready to do a SGD update on weights/biases) Machine Learning: Chenhao Tan Boulder 38 of 58

68 Back propagation Full algorithm Reminder of gradient descent Machine Learning: Chenhao Tan Boulder 39 of 58

69 Back propagation Full algorithm Training a Feed-Forward Neural Network Given initial guess for weights and biases. Loop over each training example in random order: 1. Forward propagate to get activations on each layer 2. Back propagate to get derivatives 3. Update weights and biases via stochastic gradient descent 4. Repeat Machine Learning: Chenhao Tan Boulder 40 of 58

70 Practical issues of back propagation Outline Forward propagation recap Back propagation Chain rule Back propagation Full algorithm Practical issues of back propagation Unstable gradients Weight Initialization Alternative regularization Batch size Machine Learning: Chenhao Tan Boulder 41 of 58

71 Practical issues of back propagation Back Propagation In practice, many remaining questions may arise. δ L = L g (z L ) # Compute δ s on output layer a L j For l = L,..., 1 L = δ l (a l 1 ) T # Compute weight derivatives W l L b l = δl # Compute bias derivatives δ l 1 = ( W l) T δ l g (z l 1 ) # Back prop δ s to previous layer Machine Learning: Chenhao Tan Boulder 42 of 58

72 Practical issues of back propagation Unstable gradients Unstable gradients x h 1 h 2... h L o Machine Learning: Chenhao Tan Boulder 43 of 58

73 Practical issues of back propagation Unstable gradients Unstable gradients x h 1 h 2... h L o L b 1 = g (z 1 ) w 2 g (z 2 ) w 3... g (z L ) L a L Machine Learning: Chenhao Tan Boulder 43 of 58

74 Practical issues of back propagation Unstable gradients Unstable gradients sigmoid derivative Machine Learning: Chenhao Tan Boulder 44 of 58

75 Practical issues of back propagation Unstable gradients Vanishing gradients If we use Gaussian initialization for weights, w j N (0, 1), w j < 1 w j σ (z j ) < 1 4 L decay to zero exponentially b1 Machine Learning: Chenhao Tan Boulder 45 of 58

76 Practical issues of back propagation Unstable gradients Vanishing gradients ReLu ReLu derivative Machine Learning: Chenhao Tan Boulder 46 of 58

77 Practical issues of back propagation Unstable gradients Exploding gradients If w j = 100, w j σ (z j ) k > 1 Machine Learning: Chenhao Tan Boulder 47 of 58

78 Practical issues of back propagation Unstable gradients Training a Feed-Forward Neural Network In practice, many remaining questions may arise, more examples: 1. How do we initialize weights and biases? 2. How do we regularize? 3. Can I batch this? Machine Learning: Chenhao Tan Boulder 48 of 58

79 Practical issues of back propagation Weight Initialization Non-convexity Machine Learning: Chenhao Tan Boulder 49 of 58

80 Practical issues of back propagation Weight Initialization Weight initialization Old idea: W = 0, what happens? Machine Learning: Chenhao Tan Boulder 50 of 58

81 Practical issues of back propagation Weight Initialization Weight initialization Old idea: W = 0, what happens? There is no source of asymmetry. (Every neuron looks the same and leads to a slow start.) Machine Learning: Chenhao Tan Boulder 50 of 58

82 Practical issues of back propagation Weight Initialization Weight initialization Old idea: W = 0, what happens? There is no source of asymmetry. (Every neuron looks the same and leads to a slow start.) δ L = a LL g (z L ) # Compute δ s on output layer For l = L,..., 1 L = δ l (a l 1 ) T # Compute weight derivatives W l L b l = δl # Compute bias derivatives δ l 1 = ( W l) T δ l g (z l 1 ) # Back prop δ s to previous layer Machine Learning: Chenhao Tan Boulder 50 of 58

83 Practical issues of back propagation Weight Initialization Weight initialization First idea: small random numbers, W N (0, 0.01) Machine Learning: Chenhao Tan Boulder 51 of 58

84 Practical issues of back propagation Weight Initialization Weight initialization Var(z) = Var( i w i x i ) = nvar(w i )Var(x i ) Machine Learning: Chenhao Tan Boulder 52 of 58

85 Practical issues of back propagation Weight Initialization Weight initialization Xavier initialization [Glorot and Bengio, 2010] W N (0, 2 n in + n out ) Machine Learning: Chenhao Tan Boulder 53 of 58

86 Practical issues of back propagation Weight Initialization Weight initialization Xavier initialization [Glorot and Bengio, 2010] He initialization [He et al., 2015] W N (0, W N (0, 2 n in + n out ) 2 n in ) Machine Learning: Chenhao Tan Boulder 53 of 58

87 Practical issues of back propagation Weight Initialization Weight initialization Xavier initialization [Glorot and Bengio, 2010] He initialization [He et al., 2015] W N (0, W N (0, 2 n in + n out ) 2 n in ) This is an actively research area and next great idea may come from you! Machine Learning: Chenhao Tan Boulder 53 of 58

88 Practical issues of back propagation Alternative regularization Dropout layer "randomly set some neurons to zero in the forward pass" [Srivastava et al., 2014] Machine Learning: Chenhao Tan Boulder 54 of 58

89 Practical issues of back propagation Alternative regularization Dropout layer Forces the network to have a redundant representation. Machine Learning: Chenhao Tan Boulder 55 of 58

90 Practical issues of back propagation Alternative regularization Dropout layer Another interpretation: Dropout is training a large ensemble of models. Machine Learning: Chenhao Tan Boulder 55 of 58

91 Practical issues of back propagation Batch size Batch size We have so far learned gradient descent which uses all training data to compute gradients. Machine Learning: Chenhao Tan Boulder 56 of 58

92 Practical issues of back propagation Batch size Batch size We have so far learned gradient descent which uses all training data to compute gradients. Alternatively, we use a single instance to compute gradients in stochastic gradient descent. Machine Learning: Chenhao Tan Boulder 56 of 58

93 Practical issues of back propagation Batch size Batch size We have so far learned gradient descent which uses all training data to compute gradients. Alternatively, we use a single instance to compute gradients in stochastic gradient descent. In general, we can use a parameter batch size to compute the gradients from a few instances. N (the entire training data) 1 (a single instance) More common values: 16, 32, 64, 128 Machine Learning: Chenhao Tan Boulder 56 of 58

94 Practical issues of back propagation Batch size Wrap up Back propagation allows for computing the gradients of the parameters and watch out for unstable gradients! δ L = L g (z L ) # Compute δ s on output layer a L j For l = L,..., 1 L = δ l (a l 1 ) T # Compute weight derivatives W l L b l = δl # Compute bias derivatives δ l 1 = ( W l) T δ l g (z l 1 ) # Back prop δ s to previous layer Machine Learning: Chenhao Tan Boulder 57 of 58

95 Practical issues of back propagation Batch size References (1) Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Yee Whye Teh and Mike Titterington, editors, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pages , Chia Laguna Resort, Sardinia, Italy, May PMLR. URL Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In The IEEE International Conference on Computer Vision (ICCV), December Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15: , URL Machine Learning: Chenhao Tan Boulder 58 of 58

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton Motivation Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses