Neural Networks: Backpropagation Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others 1
Neural Networks What is a neural network? Predicting with a neural network Training neural networks Practical concerns 2
This lecture What is a neural network? Predicting with a neural network Training neural networks Backpropagation Practical concerns 3
Training a neural network Given A network architecture (layout of neurons, their connectivity and activations) A dataset of labeled examples S = {(x i, y i )} The goal: Learn the weights of the neural network Remember: For a fixed architecture, a neural network is a function parameterized by its weights Prediction: y" = NN(x, w) 4
Recall: Learning as loss minimization We have a classifier NN that is completely defined by its weights Learn the weights by minimizing a loss L min / L(NN x 0, w, y 0 ) w 0 Perhaps with a regularizer 5
Recall: Learning as loss minimization We have a classifier NN that is completely defined by its weights Learn the weights by minimizing a loss L min / L(NN x 0, w, y 0 ) w 0 Perhaps with a regularizer So far, we saw that this strategy worked for: 1. Logistic Regression Each 2. Support Vector Machines minimizes a 3. Perceptron different loss 4. LMS regression function All of these are linear models Same idea for non-linear models too! 6
Back to our running example Given an input x, how is the output predicted output y = w 6 45 + w 6 55 z 5 + w 6 95 z 9 z 9 = σ(w 49 + w 59 x 5 + w 99 x 9 ) z 5 = σ(w 45 + w 55 x 5 + w 95 x 9 ) 7
Back to our running example Given an input x, how is the output predicted output y = w 6 45 + w 6 55 z 5 + w 6 95 z 9 z 9 = σ(w 49 + w 59 x 5 + w 99 x 9 ) z 5 = σ(w 45 + w 55 x 5 + w 95 x 9 ) Suppose the true label for this example is a number y 8
Back to our running example Given an input x, how is the output predicted output y = w 6 45 + w 6 55 z 5 + w 6 95 z 9 z 9 = σ(w 49 + w 59 x 5 + w 99 x 9 ) z 5 = σ(w 45 + w 55 x 5 + w 95 x 9 ) Suppose the true label for this example is a number y We can write the square loss for this example as: L = 1 2 y y 9 9
Learning as loss minimization We have a classifier NN that is completely defined by its weights Learn the weights by minimizing a loss L min / L(NN x 0, w, y 0 ) w 0 Perhaps with a regularizer How do we solve the optimization problem? 10
Stochastic gradient descent min w / L(NN x 0, w, y 0 ) 0 Given a training set S = {(x i, y i )}, x 2 < d 1. Initialize parameters w 2. For epoch = 1 T: 1. Shuffle the training set 2. For each training example (x i, y i )2 S: Treat this example as the entire dataset Compute the gradient of the loss L(NN x 0, w, y 0 ) Update: w w γ F L(NN x 0, w, y 0 )) 3. Return w 11
Stochastic gradient descent min w / L(NN x 0, w, y 0 ) 0 Given a training set S = {(x i, y i )}, x 2 < d 1. Initialize parameters w 2. For epoch = 1 T: 1. Shuffle the training set 2. For each training example (x i, y i )2 S: Treat this example as the entire dataset Compute the gradient of the loss L(NN x 0, w, y 0 ) Update: w w γ F L(NN x 0, w, y 0 )) 3. Return w 12
Stochastic gradient descent min w / L(NN x 0, w, y 0 ) 0 Given a training set S = {(x i, y i )}, x 2 < d 1. Initialize parameters w 2. For epoch = 1 T: 1. Shuffle the training set 2. For each training example (x i, y i )2 S: Treat this example as the entire dataset Compute the gradient of the loss L(NN x 0, w, y 0 ) Update: w w γ F L(NN x 0, w, y 0 )) 3. Return w 13
Stochastic gradient descent min w / L(NN x 0, w, y 0 ) 0 Given a training set S = {(x i, y i )}, x 2 < d 1. Initialize parameters w 2. For epoch = 1 T: 1. Shuffle the training set 2. For each training example (x i, y i )2 S: Treat this example as the entire dataset Compute the gradient of the loss L(NN x 0, w, y 0 ) Update: w w γ F L(NN x 0, w, y 0 )) 3. Return w 14
Stochastic gradient descent min w / L(NN x 0, w, y 0 ) 0 Given a training set S = {(x i, y i )}, x 2 < d 1. Initialize parameters w 2. For epoch = 1 T: 1. Shuffle the training set 2. For each training example (x i, y i )2 S: Treat this example as the entire dataset Compute the gradient of the loss L(NN x 0, w, y 0 ) Update: w w γ F L(NN x 0, w, y 0 )) 3. Return w 15
Stochastic gradient descent min w / L(NN x 0, w, y 0 ) 0 Given a training set S = {(x i, y i )}, x 2 < d 1. Initialize parameters w 2. For epoch = 1 T: 1. Shuffle the training set 2. For each training example (x i, y i )2 S: Treat this example as the entire dataset Compute the gradient of the loss L(NN x 0, w, y 0 ) Update: w w γ F L(NN x 0, w, y 0 )) 3. Return w t : learning rate, many tweaks possible 16
Stochastic gradient descent min w / L(NN x 0, w, y 0 ) 0 Given a training set S = {(x i, y i )}, x 2 < d 1. Initialize parameters w 2. For epoch = 1 T: 1. Shuffle the training set 2. For each training example (x i, y i )2 S: Treat this example as the entire dataset Compute the gradient of the loss L(NN x 0, w, y 0 ) The objective is not convex. Initialization can be important Update: w w γ F L(NN x 0, w, y 0 )) 3. Return w t : learning rate, many tweaks possible 17
Stochastic gradient descent min w / L(NN x 0, w, y 0 ) 0 Given a training set S = {(x i, y i )}, x 2 < d 1. Initialize parameters w 2. For epoch = 1 T: 1. Shuffle the training set 2. For each training example (x i, y i )2 S: Treat this example as the entire dataset Compute the gradient of the loss L(NN x 0, w, y 0 ) The objective is not convex. Initialization can be important Update: w w γ F L(NN x 0, w, y 0 )) 3. Return w Have we solved everything? t : learning rate, many tweaks possible 18
The derivative of the loss function? If the neural network is a differentiable function, we can find the gradient Or maybe its sub-gradient This is decided by the activation functions and the loss function It was easy for SVMs and logistic regression Only one layer L(NN x 0, w, y 0 ) But how do we find the sub-gradient of a more complex function? Eg: A recent paper used a ~150 layer neural network for image classification! We need an efficient algorithm: Backpropagation 19
The derivative of the loss function? If the neural network is a differentiable function, we can find the gradient Or maybe its sub-gradient This is decided by the activation functions and the loss function It was easy for SVMs and logistic regression Only one layer L(NN x 0, w, y 0 ) But how do we find the sub-gradient of a more complex function? Eg: A recent paper used a ~150 layer neural network for image classification! We need an efficient algorithm: Backpropagation 20
The derivative of the loss function? If the neural network is a differentiable function, we can find the gradient Or maybe its sub-gradient This is decided by the activation functions and the loss function It was easy for SVMs and logistic regression Only one layer L(NN x 0, w, y 0 ) But how do we find the sub-gradient of a more complex function? Eg: A recent paper used a ~150 layer neural network for image classification! We need an efficient algorithm: Backpropagation 21
Checkpoint Where are we If we have a neural network (structure, activations and weights), we can make a prediction for an input If we had the true label of the input, then we can define the loss for that example If we can take the derivative of the loss with respect to each of the weights, we can take a gradient step in SGD Questions? 22
Checkpoint Where are we If we have a neural network (structure, activations and weights), we can make a prediction for an input If we had the true label of the input, then we can define the loss for that example If we can take the derivative of the loss with respect to each of the weights, we can take a gradient step in SGD Questions? 23
Checkpoint Where are we If we have a neural network (structure, activations and weights), we can make a prediction for an input If we had the true label of the input, then we can define the loss for that example If we can take the derivative of the loss with respect to each of the weights, we can take a gradient step in SGD Questions? 24
Checkpoint Where are we If we have a neural network (structure, activations and weights), we can make a prediction for an input If we had the true label of the input, then we can define the loss for that example If we can take the derivative of the loss with respect to each of the weights, we can take a gradient step in SGD Questions? 25
Checkpoint Where are we If we have a neural network (structure, activations and weights), we can make a prediction for an input If we had the true label of the input, then we can define the loss for that example If we can take the derivative of the loss with respect to each of the weights, we can take a gradient step in SGD Questions? 26
Reminder: Chain rule for derivatives If z is a function of y and y is a function of x Then z is a function of x, as well Question: how to find GH GI Slide courtesy Richard Socher 27
Reminder: Chain rule for derivatives If z = a function of y 5 + a function of y 9, and the y 0 s are functions of x Then z is a function of x, as well Question: how to find GH GI Slide courtesy Richard Socher 28
Reminder: Chain rule for derivatives If z is a sum of functions of y 0 s, and the y 0 s are functions of x Then z is a function of x, as well Question: how to find GH GI Slide courtesy Richard Socher 29
Backpropagation L = 1 2 y y 9 output y = w 6 45 + w 6 55 z 5 + w 6 95 z 9 z 9 = σ(w 49 + w 59 x 5 + w 99 x 9 ) z 5 = σ(w 45 + w 55 x 5 + w 95 x 9 ) 30
Backpropagation L = 1 2 y y 9 output y = w 6 45 + w 6 55 z 5 + w 6 95 z 9 z 9 = σ(w 49 + w 59 x 5 + w 99 x 9 ) z 5 = σ(w 45 + w 55 x 5 + w 95 x 9 ) We want to compute GJ GK LM N and GJ GK LM O 31
Backpropagation Applying the chain rule to compute the gradient (And remembering partial computations along the way to speed up things) L = 1 2 y y 9 output y = w 6 45 + w 6 55 z 5 + w 6 95 z 9 z 9 = σ(w 49 + w 59 x 5 + w 99 x 9 ) z 5 = σ(w 45 + w 55 x 5 + w 95 x 9 ) We want to compute GJ GK LM N and GJ GK LM O 32
Backpropagation example Output layer output L = 1 2 y = w 6 45 + w 6 55 z 5 + w 6 95 z 9 y y 9 6 w 45 4 w 45 33
Backpropagation example Output layer output L = 1 2 y = w 6 45 + w 6 55 z 5 + w 6 95 z 9 y y 9 6 w 45 6 w 45 34
Backpropagation example Output layer output L = 1 2 y = w 6 45 + w 6 55 z 5 + w 6 95 z 9 y y 9 6 w 45 6 w 45 = y y 35
Backpropagation example Output layer output L = 1 2 y = w 6 45 + w 6 55 z 5 + w 6 95 z 9 y y 9 6 w 45 6 w 45 = y y w 45 6 = 1 36
Backpropagation example Output layer output L = 1 2 y = w 6 45 + w 6 55 z 5 + w 6 95 z 9 y y 9 6 w 55 4 w 55 37
Backpropagation example Output layer output L = 1 2 y = w 6 45 + w 6 55 z 5 + w 6 95 z 9 y y 9 6 w 55 4 w 55 38
Backpropagation example Output layer output L = 1 2 y = w 6 45 + w 6 55 z 5 + w 6 95 z 9 y y 9 6 w 55 4 w 55 = y y 39
Backpropagation example Output layer output L = 1 2 y = w 6 45 + w 6 55 z 5 + w 6 95 z 9 y y 9 6 w 55 4 w 55 = y y w 45 6 = z 5 40
Backpropagation example Output layer output L = 1 2 y = w 6 45 + w 6 55 z 5 + w 6 95 z 9 y y 9 6 w 55 6 w 55 = y y w 45 6 = z 5 We have already computed this partial derivative for the previous case Cache to speed up! 41
Backpropagation example Hidden layer derivatives L = 1 2 y y 9 output y = w 6 45 + w 6 55 z 5 + w 6 95 z 9 z 9 = σ(w 49 + w 59 x 5 + w 99 x 9 ) z 5 = σ(w 45 + w 55 x 5 + w 95 x 9 ) 42
Backpropagation example Hidden layer derivatives L = 1 2 y y 9 output y = w 6 45 + w 6 55 z 5 + w 6 95 z 9 z 9 = σ(w 49 + w 59 x 5 + w 99 x 9 ) z 5 = σ(w 45 + w 55 x 5 + w 95 x 9 ) We want GJ GKO QQ 43
Backpropagation example Hidden layer derivatives L = 1 2 y y 9 44
Backpropagation example Hidden layer y = w 6 45 + w 6 55 z 5 + w 6 95 z 9 w (w 45 99 6 + w 6 55 z 5 + w 6 95 z 9 ) 45
Backpropagation example Hidden layer y = w 6 45 + w 6 55 z 5 + w 6 95 z 9 (w 55 w (w 45 99 6 6 + w 6 55 z 5 + w 6 95 z 9 ) z 5 + w 95 6 z 9) 46
Backpropagation example Hidden layer y = w 6 45 + w 6 55 z 5 + w 6 95 z 9 (w 55 w (w 45 99 6 6 + w 6 55 z 5 + w 6 95 z 9 ) z 5 + w 95 6 z 9) 0 z 5 is not a function of w 99 47
Backpropagation example Hidden layer y = w 6 45 + w 6 55 z 5 + w 6 95 z 9 w 95 w (w 45 99 6 z 9 6 + w 6 55 z 5 + w 6 95 z 9 ) 48
Backpropagation example Hidden layer z 9 = σ(w 49 + w 59 x 5 + w 99 x 9 ) w 95 w (w 45 99 6 z 9 6 + w 6 55 z 5 + w 6 95 z 9 ) 49
Backpropagation example Hidden layer z 9 = σ(w 49 + w 59 x 5 + w 99 x 9 ) w 95 w (w 45 99 6 z 9 Call this s 6 + w 6 55 z 5 + w 6 95 z 9 ) 50
Backpropagation example Hidden layer z 9 = σ(w 49 + w 59 x 5 + w 99 x 9 ) w 95 w (w 45 99 6 z 9 w 95 6 z 9 s Call this s 6 + w 6 55 z 5 + w 6 95 z 9 ) s 51
Backpropagation example Hidden layer z 9 = σ(w 49 + w 59 x 5 + w 99 x 9 ) w 95 6 z 9 s s Call this s (From previous slide) 52
Backpropagation example Hidden layer z 9 = σ(w 49 + w 59 x 5 + w 99 x 9 ) w 95 6 z 9 s s Call this s Each of these partial derivatives is easy 53
Backpropagation example Hidden layer z 9 = σ(w 49 + w 59 x 5 + w 99 x 9 ) w 95 6 z 9 s s Call this s Each of these partial derivatives is easy = y y 54
Backpropagation example Hidden layer z 9 = σ(w 49 + w 59 x 5 + w 99 x 9 ) w 95 6 z 9 s s Call this s Each of these partial derivatives is easy = y y z 9 s = z 9(1 z 9 ) Why? Because z 9 s is the logistic function we have already seen 55
Backpropagation example Hidden layer z 9 = σ(w 49 + w 59 x 5 + w 99 x 9 ) w 95 6 z 9 s s Call this s Each of these partial derivatives is easy = y y z 9 s = z 9(1 z 9 ) s w = x 9 99 Why? Because z 9 s is the logistic function we have already seen 56
Backpropagation example Hidden layer z 9 = σ(w 49 + w 59 x 5 + w 99 x 9 ) w 95 6 z 9 s s Call this s Each of these partial derivatives is easy = y y z 9 s = z 9(1 z 9 ) s w = x 9 99 More important: We have already computed many of these partial derivatives because we are proceeding from top to bottom (i.e. backwards) Why? Because z 9 s is the logistic function we have already seen 57
The Backpropagation Algorithm The same algorithm works for multiple layers Repeated application of the chain rule for partial derivatives First perform forward pass from inputs to the output Compute loss From the loss, proceed backwards to compute partial derivatives using the chain rule Cache partial derivatives as you compute them Will be used for lower layers 58
Mechanizing learning Backpropagation gives you the gradient that will be used for gradient descent SGD gives us a generic learning algorithm Backpropagation is a generic method for computing partial derivatives A recursive algorithm that proceeds from the top of the network to the bottom Modern neural network libraries implement automatic differentiation using backpropagation Allows easy exploration of network architectures Don t have to keep deriving the gradients by hand each time 59
Stochastic gradient descent Given a training set S = {(x i, y i )}, x 2 < d 1. Initialize parameters w 2. For epoch = 1 T: 1. Shuffle the training set 2. For each training example (x i, y i )2 S: Treat this example as the entire dataset min w / L(NN x 0, w, y 0 ) 0 The objective is not convex. Initialization can be important Compute the gradient of the loss L(NN x 0, w, y 0 ) using backpropagation Update: w w γ F L(NN x 0, w, y 0 )) 3. Return w t : learning rate, many tweaks possible 60