Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16 Slides adapted from Jordan Boyd-Graber, Justin Johnson, Andrej Karpathy, Chris Ketelsen, Fei-Fei Li, Mike Mozer, Michael Nielson Machine Learning: Chenhao Tan Boulder 1 of 58

Logistics Prelim 1 grades and solutions are available Homework 3 is due next week Feedback from the second homework: Include more examples and spend more time on explaining the equations in class Make questions bold CLEAR open house right after the class (free food)! Machine Learning: Chenhao Tan Boulder 2 of 58

Overview Forward propagation recap Back propagation Chain rule Back propagation Full algorithm Practical issues of back propagation Unstable gradients Weight Initialization Alternative regularization Batch size Machine Learning: Chenhao Tan Boulder 3 of 58

Forward propagation recap Outline Forward propagation recap Back propagation Chain rule Back propagation Full algorithm Practical issues of back propagation Unstable gradients Weight Initialization Alternative regularization Batch size Machine Learning: Chenhao Tan Boulder 4 of 58

Forward propagation recap Forward propagation algorithm How do we make predictions based on a multi-layer neural network? Store the biases for layer l in b l, weight matrix in W l W 1, b 1 W 2, b 2 W 3, b 3 W 4, b 4 x 1 x 2 o 1...... x d Machine Learning: Chenhao Tan Boulder 5 of 58

Forward propagation recap Forward propagation algorithm Suppose your network has L layers Make prediction for an instance x 1: Initialize a 0 = x 2: for l = 1 to L do 3: z l = W l a l 1 + b l 4: a l = g(z l ) // g represents the nonlinear activation 5: end for 6: The prediction ŷ is simply a L Machine Learning: Chenhao Tan Boulder 6 of 58

Forward propagation recap Neural networks in a nutshell Training data S train = {(x, y)} Network architecture (model) Loss function (objective function) How do we learn the parameters? ŷ = f w (x) W l, b l, l = 1,..., L L (y, ŷ) Machine Learning: Chenhao Tan Boulder 7 of 58

Forward propagation recap Neural networks in a nutshell Training data S train = {(x, y)} Network architecture (model) Loss function (objective function) How do we learn the parameters? Stochastic gradient descent, ŷ = f w (x) W l, b l, l = 1,..., L L (y, ŷ) W l W l L (y, ŷ) η W l Machine Learning: Chenhao Tan Boulder 7 of 58

Forward propagation recap Reminder of gradient descent Machine Learning: Chenhao Tan Boulder 8 of 58

Forward propagation recap Challenge Challenge: How do we compute derivatives of the loss function with respect to weights and biases? Solution: Back propagation Machine Learning: Chenhao Tan Boulder 9 of 58

Back propagation Outline Forward propagation recap Back propagation Chain rule Back propagation Full algorithm Practical issues of back propagation Unstable gradients Weight Initialization Alternative regularization Batch size Machine Learning: Chenhao Tan Boulder 10 of 58

Back propagation Chain rule The Chain Rule The chain rule allows us to take derivatives of nested functions. Machine Learning: Chenhao Tan Boulder 11 of 58

Back propagation Chain rule The Chain Rule The chain rule allows us to take derivatives of nested functions. Univariate chain rule: d dx f (g(x)) = f (g(x)) g (x) = df dg dg dx Machine Learning: Chenhao Tan Boulder 11 of 58

Back propagation Chain rule The Chain Rule The chain rule allows us to take derivatives of nested functions. Univariate chain rule: Example: d dx 1 1+exp( x) d dx f (g(x)) = f (g(x)) g (x) = df dg dg dx Machine Learning: Chenhao Tan Boulder 11 of 58

Back propagation Chain rule The Chain Rule The chain rule allows us to take derivatives of nested functions. Univariate chain rule: d dx f (g(x)) = f (g(x)) g (x) = df dg dg dx Example: d 1 dx 1+exp( x) = 1 exp( x) 1 (1+exp( x)) 2 dσ(x) = σ(x)(1 σ(x)) dx Machine Learning: Chenhao Tan Boulder 11 of 58

Back propagation Chain rule The Chain Rule Multivariate chain rule: x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) Machine Learning: Chenhao Tan Boulder 12 of 58

Back propagation Chain rule The Chain Rule Multivariate chain rule: x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) Derivative of L with respect to x: Similarly, f y, f z f x Machine Learning: Chenhao Tan Boulder 12 of 58

Back propagation Chain rule The Chain Rule What is the derivative of f with respect to r? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) Machine Learning: Chenhao Tan Boulder 13 of 58

Back propagation Chain rule The Chain Rule What is the derivative of f with respect to r? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) f r = f x x r + f y y r + f z z r Machine Learning: Chenhao Tan Boulder 13 of 58

Back propagation Chain rule The Chain Rule What is the derivative of f with respect to s? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) Machine Learning: Chenhao Tan Boulder 14 of 58

Back propagation Chain rule The Chain Rule What is the derivative of f with respect to s? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) f s = f x x s + f y y s + f z z s Machine Learning: Chenhao Tan Boulder 14 of 58

Back propagation Chain rule The Chain Rule What is the derivative of f with respect to s? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) Example: Let f = xyz, x = r, y = rs, and z = s. Find f / s f s = f x x s + f y y s + f z z s Machine Learning: Chenhao Tan Boulder 15 of 58

Back propagation Chain rule The Chain Rule What is the derivative of f with respect to s? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) Example: Let f = xyz, x = r, y = rs, and z = s. Find f / s f = yz 0 + xz r + xy 1 s Machine Learning: Chenhao Tan Boulder 15 of 58

Back propagation Chain rule The Chain Rule What is the derivative of f with respect to s? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) Example: Let f = xyz, x = r, y = rs, and z = s. Find f / s f s = rs2 0 + rs r + r 2 s 1 Machine Learning: Chenhao Tan Boulder 15 of 58

Back propagation Chain rule The Chain Rule What is the derivative of f with respect to s? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) Example: Let f = xyz, x = r, y = rs, and z = s. Find f / s f s = 2r2 s Machine Learning: Chenhao Tan Boulder 15 of 58

Back propagation Chain rule The Chain Rule What is the derivative of f with respect to s? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) Example: Let f = xyz, x = r, y = rs, and z = s. Find f / s f (r, s) = r rs s = r 2 s 2 f s = 2r2 s Machine Learning: Chenhao Tan Boulder 16 of 58

Back propagation Chain rule The Chain Rule What is the derivative of f with respect to u? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) Machine Learning: Chenhao Tan Boulder 17 of 58

Back propagation Chain rule The Chain Rule What is the derivative of f with respect to u? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) f u = f r r u + f s s u Machine Learning: Chenhao Tan Boulder 17 of 58

Back propagation Chain rule The Chain Rule What is the derivative of f with respect to u? x(r, s) u v r(u, v) s(u, v) y(r, s) z(r, s) f (x, y, z) Crux: If you know the derivative of objective w.r.t. intermediate value in the chain, can eliminate everything in between. Machine Learning: Chenhao Tan Boulder 18 of 58

Back propagation Back propagation Back Propagation W 1, b 1 W 2, b 2 W 3, b 3 W 4, b 4 x 1 x 2... o 1... x d Machine Learning: Chenhao Tan Boulder 19 of 58

Back propagation Back propagation Back Propagation For the derivation, we ll consider a simplified network a 0 W 1 z 1 a 1 W 2 z2 a 2 L (y, a 2 ) We want to use back propagation to compute partial derivative of L w.r.t. the weights and biases L, for l = 1, 2 w l ij w 2 ij is the weight from node j in layer l 1 to node i in layer l. Machine Learning: Chenhao Tan Boulder 20 of 58

Back propagation Back propagation Back Propagation For the derivation, we ll consider a simplified network a 0 W 1 z 1 a 1 W 2 z2 a 2 L (y, a 2 ) We need to choose an intermediate term that lives on the nodes, that we can easily compute derivative with respect to. Could choose a s, but we ll choose z s because math is easier. Machine Learning: Chenhao Tan Boulder 20 of 58

Back propagation Back propagation Back Propagation For the derivation, we ll consider a simplified network a 0 W 1 z 1 a 1 W 2 z2 a 2 L (y, a 2 ) Define the derivative w.r.t. the z s by δ: δ l j = L z l j Note that δ l has the same size as z l and a l. Machine Learning: Chenhao Tan Boulder 20 of 58

Back propagation Back propagation Back Propagation For the derivation, we ll consider a simplified network a 0 W 1 z 1 a 1 W 2 z2 a 2 L (y, a 2 ) Let s compute δ L for output layer L: δ L j = L z L j = L a L j da L j dz L j Machine Learning: Chenhao Tan Boulder 20 of 58

Back propagation Back propagation Back Propagation δ L j = L z L j = L a L j da L j dz L j We know that a L j = g(z L j ), so dal j dz L j = g (z L j ) δ L j = L a L j g (z L j ) Note: The first term is j th entry of gradient of L. Machine Learning: Chenhao Tan Boulder 21 of 58

Back propagation Back propagation Back Propagation δ L j = L a L j g (z L j ) We can combine all of these into a vector operation δ L = L a L g (z L ) Where g (z L ) is the activation function applied elementwise to z L. The symbol indicates element-wise multiplication of vectors. Notice that computing δ L requires knowing activations. This means that before we can compute derivatives for SGD through back propagation, we first run forward propagation through the network. Machine Learning: Chenhao Tan Boulder 22 of 58

Back propagation Back propagation Back Propagation Example: Suppose we re in regression setting and choose a sigmoid activation function: L = 1 (y j a L j ) 2 and a L j = σ(z) 2 j L a L j = (a L j y j ), da L j dz L j = σ (z L j ) = σ(z L j )(1 σ(z L j )) So δ L = (a L y) σ(z L ) (1 σ(z L )) Machine Learning: Chenhao Tan Boulder 23 of 58

Back propagation Back propagation Back Propagation OK Great! Now we can easily-ish compute the δ s for the output layer. But really we re after partials w.r.t. to weights and biases. a 0 W 1 z 1 a 1 W 2 z2 a 2 L (y, a 2 ) Machine Learning: Chenhao Tan Boulder 24 of 58

Back propagation Back propagation Back Propagation We want to find derivative L w.r.t. to weights and biases a 0 W 1 z 1 a 1 W 2 z2 a 2 L (y, a 2 ) Every weight connected to a node in layer L depends on a single δ L j Machine Learning: Chenhao Tan Boulder 25 of 58

Back propagation Back propagation Back Propagation L (y, a 2 ) a 0 W 1 z 1 a 1 W 2 z2 a 2 So we have L w L jk = L z L j z L j w L jk = δ L j z L j w L jk Need to compute zl j w L. Recall z L = W L a L 1 + b L jk j th entry in vector z L j = i w L jia L 1 i + b L j Machine Learning: Chenhao Tan Boulder 26 of 58

Back propagation Back propagation Back Propagation L (y, a 2 ) a 0 W 1 z 1 a 1 W 2 z2 a 2 So we have L w L jk = L z L j Taking derivative w.r.t. w L jk gives z L j w L jk = δ L j z L j w L jk zl j w L jk = a L 1 k L w L jk = a L 1 k δj L Machine Learning: Chenhao Tan Boulder 26 of 58

Back propagation Back propagation Back Propagation L (y, a 2 ) a 0 W 1 z 1 a 1 W 2 z2 a 2 So we have L w L jk = a L 1 k δj L Easy expression for derivative w.r.t. every weight leading into layer L. Machine Learning: Chenhao Tan Boulder 27 of 58

Back propagation Back propagation Back Propagation Let s make the notation a little more practical. [ W 2 w 2 = 11 w 2 12 w 2 ] 13 w 2 21 w 2 22 w 2 23 Machine Learning: Chenhao Tan Boulder 28 of 58

Back propagation Back propagation Back Propagation Let s make the notation a little more practical. [ W 2 w 2 = 11 w 2 12 w 2 ] 13 w 2 21 w 2 22 w 2 23 [ L ] L L [ L W 2 = w 2 11 w 2 12 w 2 13 δ 2 = 1 a 1 1 δ1 2a1 2 δ1 2 ] a1 3 δ2 2a1 1 δ2 2a1 2 δ2 2a1 3 L w 2 21 L w 2 22 L w 2 23 Now we can write this as an outer-product of δ 2 and a 1, (Exercise for yourself, L b 2 ) L W 2 = δ2 (a 1 ) T Machine Learning: Chenhao Tan Boulder 28 of 58

Back propagation Back propagation Intermediate summary For a giving training example x, perform forward propagation to get z l and a l on each layer. Then to get the partial derivatives for W 2 or W L : 1. Compute δ L = L a L j g (z L ) 2. Compute L = δ L (a L 1 ) T and L = δ L W L b L OK, that wasn t so bad! We found very simple expressions for the derivatives with respect to the weights in the last hidden layer! Machine Learning: Chenhao Tan Boulder 29 of 58

Back propagation Back propagation Back Propagation But the relationship between L and z 1 is really complicated because of multiple passes through the activation functions. Machine Learning: Chenhao Tan Boulder 30 of 58

Back propagation Back propagation Back Propagation But the relationship between L and z 1 is really complicated because of multiple passes through the activation functions. It is OK! Back propagation comes to rescue! Machine Learning: Chenhao Tan Boulder 30 of 58

Back propagation Back propagation Back Propagation Notice that δ 1 depends on δ 2. L (y, a 2 ) a 0 W 1 z 1 a 1 W 2 z2 a 2 By multivariate chain rule, L z l 1 k = j L z l j z l j z l 1 k Machine Learning: Chenhao Tan Boulder 31 of 58

Back propagation Back propagation Back Propagation Notice that δ 1 depends on δ 2. L (y, a 2 ) a 0 W 1 z 1 a 1 W 2 z2 a 2 By multivariate chain rule, δ l 1 k = L z l 1 k = j L z l j z l j z l 1 k = j δj l z l j z l 1 k Machine Learning: Chenhao Tan Boulder 31 of 58

Back propagation Back propagation Back Propagation Notice that δ 1 depends on δ 2. L (y, a 2 ) a 0 W 1 z 1 a 1 W 2 z2 a 2 By multivariate chain rule, δ 1 2 = L z 1 2 = δ1 2 z 2 1 z 1 + δ 2 z 2 2 2 2 z 1 2 Machine Learning: Chenhao Tan Boulder 31 of 58

Back propagation Back propagation Back Propagation δ 1 2 = L z 1 2 = δ1 2 z 2 1 z 1 + δ 2 z 2 2 2 2 z 1 2 Machine Learning: Chenhao Tan Boulder 32 of 58

Back propagation Back propagation Back Propagation δ 1 2 = L z 1 2 = δ1 2 z 2 1 z 1 + δ 2 z 2 2 2 2 z 1 2 Recall that z 2 = W 2 a 1 + b 2, it follows that Taking the derivative z2 i z 1 2 z 2 i = w 2 i1a 1 1 + w 2 i2a 1 2 + w 2 i3a 1 3 + b 2 i = w 2 i2 g (z 1 2 ), and plugging in gives δ 1 2 = L z 1 2 = δ 2 1w 2 12g (z 1 2) + δ 2 2w 2 22g (z 1 2) Machine Learning: Chenhao Tan Boulder 32 of 58

Back propagation Back propagation Back Propagation If we do this for each of the 3 δi 1 s, something nice happens: (Exercise for yourself: work out δ1 1 and δ1 3 for yourself) δ 1 1 = δ 2 1w 2 11g (z 1 1) + δ 2 2w 2 21g (z 1 1) δ 1 2 = δ 2 1w 2 12g (z 1 2) + δ 2 2w 2 22g (z 1 2) δ 1 3 = δ 2 1w 2 13g (z 1 3) + δ 2 2w 2 23g (z 1 3) Machine Learning: Chenhao Tan Boulder 33 of 58

Back propagation Back propagation Back Propagation If we do this for each of the 3 δi 2 s, something nice happens: δ 1 1 = (δ 2 1w 2 11 + δ 2 2w 2 21) g (z 1 1) δ 1 2 = (δ 2 1w 2 12 + δ 2 2w 2 22) g (z 1 2) δ 1 3 = (δ 2 1w 2 13 + δ 2 2w 2 23) g (z 1 3) Machine Learning: Chenhao Tan Boulder 34 of 58

Back propagation Back propagation Back Propagation If we do this for each of the 3 δi 2 s, something nice happens: δ 1 1 = (δ 2 1w 2 11 + δ 2 2w 2 21) g (z 1 1) δ 1 2 = (δ 2 1w 2 12 + δ 2 2w 2 22) g (z 1 2) δ3 1 = (δ1w 2 2 13 + δ2w 2 2 23) g (z 1 3) [ ] [ Remember δ 2 δ 2 = 1 δ2 2, W 2 w 2 = 11 w 2 12 w 2 ] 13 w 2 21 w 2 22 w 2 23 Do you see δ 2 and W 2 lurking anywhere in the above system? Machine Learning: Chenhao Tan Boulder 34 of 58

Back propagation Back propagation Back Propagation If we do this for each of the 3 δi 2 s, something nice happens: Does this help? w 2 (W 2 ) T 11 w 2 [ ] 21 = w 2 12 w 2 22, δ 2 δ 2 = 1 w 2 13 w 2 δ2 2. 23 δ 1 1 = (δ 2 1w 2 11 + δ 2 2w 2 21) g (z 1 1) δ 2 2 = (δ 2 1w 2 12 + δ 2 2w 2 22) g (z 1 2) δ 2 3 = (δ 2 1w 2 13 + δ 2 2w 2 23) g (z 1 3) Machine Learning: Chenhao Tan Boulder 35 of 58

Back propagation Back propagation Back Propagation OK Great! We can easily compute δ 1 from δ 2 Then we can compute derivatives of L w.r.t. weights W 1 and biases b 1 exactly the way we did for W 2 and biases b 2 1. Compute δ 1 = (W 2 ) T δ 2 g (z 1 ) 2. Compute L W 1 = δ 1 (a 0 ) T and L b 1 = δ 1 We ve worked this out for a simple network with one hidden layer. Nothing we ve done assumed anything about the number of layers, so we can apply the same procedure recursively with any number of layers. Machine Learning: Chenhao Tan Boulder 37 of 58

Back propagation Full algorithm Back Propagation δ L = L g (z L ) # Compute δ s on output layer a L j For l = L,..., 1 L = δ l (a l 1 ) T # Compute weight derivatives W l L b l = δl # Compute bias derivatives δ l 1 = ( W l) T δ l g (z l 1 ) # Back prop δ s to previous layer (After this, ready to do a SGD update on weights/biases) Machine Learning: Chenhao Tan Boulder 38 of 58

Back propagation Full algorithm Reminder of gradient descent Machine Learning: Chenhao Tan Boulder 39 of 58

Back propagation Full algorithm Training a Feed-Forward Neural Network Given initial guess for weights and biases. Loop over each training example in random order: 1. Forward propagate to get activations on each layer 2. Back propagate to get derivatives 3. Update weights and biases via stochastic gradient descent 4. Repeat Machine Learning: Chenhao Tan Boulder 40 of 58

Practical issues of back propagation Outline Forward propagation recap Back propagation Chain rule Back propagation Full algorithm Practical issues of back propagation Unstable gradients Weight Initialization Alternative regularization Batch size Machine Learning: Chenhao Tan Boulder 41 of 58

Practical issues of back propagation Back Propagation In practice, many remaining questions may arise. δ L = L g (z L ) # Compute δ s on output layer a L j For l = L,..., 1 L = δ l (a l 1 ) T # Compute weight derivatives W l L b l = δl # Compute bias derivatives δ l 1 = ( W l) T δ l g (z l 1 ) # Back prop δ s to previous layer Machine Learning: Chenhao Tan Boulder 42 of 58

Practical issues of back propagation Unstable gradients Unstable gradients x h 1 h 2... h L o Machine Learning: Chenhao Tan Boulder 43 of 58

Practical issues of back propagation Unstable gradients Unstable gradients x h 1 h 2... h L o L b 1 = g (z 1 ) w 2 g (z 2 ) w 3... g (z L ) L a L Machine Learning: Chenhao Tan Boulder 43 of 58

Practical issues of back propagation Unstable gradients Unstable gradients 1.0 0.8 sigmoid derivative 0.6 0.4 0.2 0.0 10 0 10 Machine Learning: Chenhao Tan Boulder 44 of 58

Practical issues of back propagation Unstable gradients Vanishing gradients If we use Gaussian initialization for weights, w j N (0, 1), w j < 1 w j σ (z j ) < 1 4 L decay to zero exponentially b1 Machine Learning: Chenhao Tan Boulder 45 of 58

Practical issues of back propagation Unstable gradients Vanishing gradients ReLu 2.0 1.5 ReLu derivative 1.0 0.5 0.0 2 0 2 Machine Learning: Chenhao Tan Boulder 46 of 58

Practical issues of back propagation Unstable gradients Exploding gradients If w j = 100, w j σ (z j ) k > 1 Machine Learning: Chenhao Tan Boulder 47 of 58

Practical issues of back propagation Unstable gradients Training a Feed-Forward Neural Network In practice, many remaining questions may arise, more examples: 1. How do we initialize weights and biases? 2. How do we regularize? 3. Can I batch this? Machine Learning: Chenhao Tan Boulder 48 of 58

Practical issues of back propagation Weight Initialization Non-convexity Machine Learning: Chenhao Tan Boulder 49 of 58

Practical issues of back propagation Weight Initialization Weight initialization Old idea: W = 0, what happens? Machine Learning: Chenhao Tan Boulder 50 of 58

Practical issues of back propagation Weight Initialization Weight initialization Old idea: W = 0, what happens? There is no source of asymmetry. (Every neuron looks the same and leads to a slow start.) δ L = a LL g (z L ) # Compute δ s on output layer For l = L,..., 1 L = δ l (a l 1 ) T # Compute weight derivatives W l L b l = δl # Compute bias derivatives δ l 1 = ( W l) T δ l g (z l 1 ) # Back prop δ s to previous layer Machine Learning: Chenhao Tan Boulder 50 of 58

Practical issues of back propagation Weight Initialization Weight initialization First idea: small random numbers, W N (0, 0.01) Machine Learning: Chenhao Tan Boulder 51 of 58

Practical issues of back propagation Weight Initialization Weight initialization Var(z) = Var( i w i x i ) = nvar(w i )Var(x i ) Machine Learning: Chenhao Tan Boulder 52 of 58

Practical issues of back propagation Weight Initialization Weight initialization Xavier initialization [Glorot and Bengio, 2010] W N (0, 2 n in + n out ) Machine Learning: Chenhao Tan Boulder 53 of 58

Practical issues of back propagation Weight Initialization Weight initialization Xavier initialization [Glorot and Bengio, 2010] He initialization [He et al., 2015] W N (0, W N (0, 2 n in + n out ) 2 n in ) Machine Learning: Chenhao Tan Boulder 53 of 58

Practical issues of back propagation Alternative regularization Dropout layer "randomly set some neurons to zero in the forward pass" [Srivastava et al., 2014] Machine Learning: Chenhao Tan Boulder 54 of 58

Practical issues of back propagation Alternative regularization Dropout layer Forces the network to have a redundant representation. Machine Learning: Chenhao Tan Boulder 55 of 58

Practical issues of back propagation Alternative regularization Dropout layer Another interpretation: Dropout is training a large ensemble of models. Machine Learning: Chenhao Tan Boulder 55 of 58

Practical issues of back propagation Batch size Batch size We have so far learned gradient descent which uses all training data to compute gradients. Machine Learning: Chenhao Tan Boulder 56 of 58

Practical issues of back propagation Batch size Batch size We have so far learned gradient descent which uses all training data to compute gradients. Alternatively, we use a single instance to compute gradients in stochastic gradient descent. In general, we can use a parameter batch size to compute the gradients from a few instances. N (the entire training data) 1 (a single instance) More common values: 16, 32, 64, 128 Machine Learning: Chenhao Tan Boulder 56 of 58

Practical issues of back propagation Batch size Wrap up Back propagation allows for computing the gradients of the parameters and watch out for unstable gradients! δ L = L g (z L ) # Compute δ s on output layer a L j For l = L,..., 1 L = δ l (a l 1 ) T # Compute weight derivatives W l L b l = δl # Compute bias derivatives δ l 1 = ( W l) T δ l g (z l 1 ) # Back prop δ s to previous layer Machine Learning: Chenhao Tan Boulder 57 of 58

Practical issues of back propagation Batch size References (1) Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Yee Whye Teh and Mike Titterington, editors, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pages 249 256, Chia Laguna Resort, Sardinia, Italy, 13 15 May 2010. PMLR. URL http://proceedings.mlr.press/v9/glorot10a.html. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In The IEEE International Conference on Computer Vision (ICCV), December 2015. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929 1958, 2014. URL http://jmlr.org/papers/v15/srivastava14a.html. Machine Learning: Chenhao Tan Boulder 58 of 58