Lecture 13 Back-propagation

Size: px

Start display at page:

Download "Lecture 13 Back-propagation"

Douglas Wilkerson
6 years ago
Views:

1 Lecture 13 Bac-propagation 02 March 2016 Taylor B. Arnold Yale Statistics STAT 365/665 1/21

2 Notes: Problem set 4 is due this Friday Problem set 5 is due a wee from Monday (for those of you with a midterm crunch this wee); I will post the questions by tomorrow morning 2/21

3 Neuralnetworreview Last time we established the idea of a sigmoid neuron, which taes a vector of numeric variables x and emits a value as follows: σ(x w + b) = e (x w+b) It is entirely defined by a vector of weights w and bias term b, and functions exactly lie logistic regression. 3/21

4 Neuralnetworreview, cont. These single neurons can be strung together to construct a neural networ. The input variables are written as special neurons on the left-hand side of the diagram: 4/21

5 Stochasticgradientdescent We started taling about how to learn neural networs via a variant of gradient descent, called stochastic gradient descent. The only detail left to figure out is exactly how calculate the gradient of the cost function in an efficient way. 5/21

6 Ideabehindbac-propagation Before starting down the path of explaining the math behind bac-propagation, I want to explain what the big idea behind the method is and why it maes sense as a general approach. 6/21

7 Ideabehindbac-propagation, cont. 7/21

8 Somenotation, cont. We need a way of referring unambiguously to the weights and biases in a multi-layer neural networ. To start define w l j, As the weight of node in layer l 1 as applied by node j in layer l. 8/21

9 Somenotation, cont. Now let: b l j Be the bias of node j of layer l. 9/21

10 Somenotation, cont. We now define two additional quantities. The weighted input emitted from node j in layer l: z l j The term input refers to the fact that this in the input to the activation function. 10/21

11 Somenotation, cont. And the activation a l j Derived from applying the function σ (the sigmoid or logit function for us so far) to the the weighted input z l j. 11/21

12 Somenotation, cont. That s a lot to tae in all at once. Here is the cheat-sheet version that shows how all of these quantities fit together: a l j = σ(z l j) = σ( w l ja l 1 + b l j) 12/21

13 Somenotation, cont. That s a lot to tae in all at once. Here is the cheat-sheet version that shows how all of these quantities fit together: a l j = σ(z l j) = σ( w l ja l 1 + b l j) It will, also, be beneficial to define one more quantity: δ l j = C z l j Called the error functions. These error functions will help in eeping the number of computations we need to mae as small as possible. 12/21

14 Costfunctionagain In what follows we assume that we are dealing with one individual data point. So the cost is simply the cost of that one training point (which we called C i in Monday s notes). If the neural networ has L total layers, notice that in general we will have the cost being a function only of the activations from layer L of the neural networ: In our case, this can be explicitly written as: C(w, b) = f(a L 1, a L 2,...) C(w, b) = (y a L ) 2 Note: this sum is over the dimension of the output, not the number of samples! If we only have a univariate output, this will just be a single value. 13/21

15 Feed-forwardstep We want to calculate how much the cost changes with respect to the errors on the outer layer. This ends up being a fairly straightforward application of the chain rule: δ L j = C z L j = C a L al z L j = C a L j = C a L j al j z L j σ (z L j ) 14/21

16 Feed-forwardstep, cont. We can explicitly calculate the first partial derivative given the cost function. Here for example it is given as: C a L j = a L j (y a L ) 2 = 2 (a L j y j ) The function σ is also easily determined by differentiation of the sigmoid function: σ (z) = σ(z) σ(1 z) 15/21

17 Bac-propagationstep We now want to relate the errors δj l to the errors in δl+1 j. This allows us to then wor bacwards from the feed-forward step. Notice that: δ l j = C z l j = C z l+1 zl+1 z l j = δ l+1 zl+1 z l j 16/21

18 Bac-propagationstep We now want to relate the errors δj l to the errors in δl+1 j. This allows us to then wor bacwards from the feed-forward step. Notice that: δ l j = C z l j = C z l+1 zl+1 z l j = δ l+1 zl+1 z l j To calculate the remaining derivatives, notice that we have a formula relating one set of weighted inputs to the next set: z l+1 = i = i w l+1 i a l i + b l+1 w l+1 i σ(z l i) + b l+1 16/21

19 Bac-propagationstep, cont. Using this relationship, we can now tae partial derivatives z l+1 z l j ( ) = z l w l+1 i σ(z l j) + b l+1 j i = w l+1 j σ (z l j) Which when plugged into our original equations yields: δ l j = δ l+1 w l+1 j σ (z l j) 17/21

20 Relatingerrorstoweightsandbias We now just need to relate the errors δj l to the gradient with respect to the weights and biases directly. We have, from the linearity of the bias term, a very simple relationship between δj l with the gradient of the bias terms: δ l j = C z l j = C b l bl j j z l j = C b l j 18/21

21 Relatingerrorstoweightsandbias For the weights, a very similar Which can be re-written as: δ l j = C z l j = C w l wl j j z l j = C 1 w l j a l 1 C w l j = a l 1 δj l 19/21

22 Puttingthesealltogether We have now four equations that describe the mechanics of bac-propagation. I will write them here a bit more compactly for reference: δ L j = a LC σ (z L ) δ l j = ((w l+1 ) T δ l+1 ) σ (z l ) C = δj l b l j C w l j = a l 1 δj l Where dropping an index indicates that I am doing math on the entire vector or matrix of values. The symbol (Hadamard product) refers to element-wise multiplication; a common operation in programming but less-so in abstract mathematics. 20/21

23 Thebac-propagationalgorithm The bac-propagation algorithm as a whole is then just: 1. Select an element i from the current minibatch and calculate the weighted inputs z and activations a for every layer using a forward pass through the networ 2. Now, use these values to calculate the errors δ for each layer, starting at the last hidden layer and woring bacwards, using bac-propagation 3. Calculate w C i and b C i using the last two equations 4. Repeat for all samples in the minibatch, to get: i M wc i i M w C bc i b C M M 5. Update the weights and biases by (the estimates of) η w C and η b C, respectively 6. Repeat over every mini-batch, and for the desired number of epochs 21/21

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16 Slides adapted from Jordan Boyd-Graber, Justin Johnson, Andrej Karpathy, Chris Ketelsen, Fei-Fei Li, Mike Mozer, Michael Nielson