Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Size: px

Start display at page:

Download "Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis"

Martina Sharp
6 years ago
Views:

1 Introduction to Natural Computation Lecture 9 Multilayer Perceptrons and Backpropagation Peter Lewis 1 / 25

2 Overview of the Lecture Why multilayer perceptrons? Some applications of multilayer perceptrons. Learning with multilayer perceptrons: Minimising error using gradient descent. How to update weights. The backpropagation learning algorithm. 2 / 25

3 Recap: Perceptrons 1 w 0 x 1 w 1 x = n i=0 w ix i 0 1 y 0 w n x n A key limitation of the single perceptron Only works for linearly separable data. 3 / 25

4 Multilayer Perceptrons (MLPs) Input layer Hidden layer Output layer MLPs are feed-forward neural networks. Organised in layers: One input layer of distribution points. One or more hidden layers of artificial neurons (nodes). One output layer of artificial neurons (nodes). Each node in a layer is connected to all other nodes in the next layer. Each connection has a weight (remember that the weight can be zero). MLPs are universal approximators! 4 / 25

5 MLPs: Universal approximators Figure : Partitioning of the input space by linear threshold-nodes in an MLP with two hidden layers and one output node in the output layer and examples of separable decision regions. Font: 5 / 25

6 Activation Functions x 1 w 1 x 2 w 2 Σ 0 h 1 y w n x n Step function Sigmoid function y y h x 0 h x Outputs 0 or 1. Outputs a real value between 0 and 1. With sigmoid activation functions... we would get smooth transitions instead of hard lined decision boundaries (e.g. in the previous slide s example). 6 / 25

7 Sigmoids... 7 / 25

8 Applications of MLPs MLPs have been applied to a very wide range of problems... Classification problems. Regression problems. Examples Patients lengths of stay. Stock price forecasting. Image recognition. Car driving control / 25

9 Learning with MLPs Input layer Hidden layer Output layer As with single perceptrons, finding the right weights is very hard! Solution technique: learning! As with perceptrons, learning means adjusting the weights based on training examples. 9 / 25

10 Supervised Learning General idea 1 Send the MLP an input pattern, x, from the training set. 2 Get the output from the MLP, y. 3 Compare y with the right answer, or target t, to get the error quantity. 4 Use the error quantity to modify the weights, so next time y will be closer to t. 5 Repeat with another x from the training set. Of course, when updating weights after seeing x, the network doesn t just change the way it deals with x, but other inputs too... Inputs it has not seen yet! The ability to deal accurately with unseen inputs is called generalisation. 10 / 25

11 Learning and Error Minimisation Recall: the perceptron learning rule Try to minimise the difference between the actual and desired outputs: w i = w i + α(t y)x i We define an error function to represent such a difference over a set of inputs. Typical error function: mean squared error (MSE) For example, the mean squared error can be defined as: E( w) = 1 2N N (t p o p ) 2 p=1 where N is the number of patterns, t p is the target output for the pattern p, o p is the output obtained for the pattern p, and The 2 makes little difference, but is inserted to make life easier later on! This tells us how good the neural network is. Learning aims to minimise the error E( w) by adjusting the weights w. 11 / 25

12 Gradient Descent One technique that can be used for minimising functions is gradient descent. Can we use this on our error function E? We would like a learning rule that tells us how to update weights, like this: w ij = w ij + w ij But what should w ij be? 12 / 25

13 Summary So Far... We learnt what a multilayer perceptron is and why we might need them. We know the sort of learning rule we would like, to update weights in order to minimise the error. However, we don t yet know what this learning rule should be. We need to learn a bit about gradient and derivatives to figure it out! After that, we can come back to the learning rule. 13 / 25

14 Gradient and Derivatives: The idea The derivative is a measure of the rate of change of a function, as its input changes. Consider a function y = f(x). The derivative dy dx indicates how much y changes in response to changes in x. If x and y are real numbers, and if the graph of y is plotted against x, the derivative measures the slope or gradient of the line at each point, i.e., it describes the steepness or incline. 14 / 25

15 Gradient and Derivatives: The idea dy dx > 0 implies that y increases as x increases. If we want to find the minimum y, we should reduce x. dy dx < 0 implies that y decreases as x increases. If we want to find the minimum y, we should increase x. dy dx = 0 implies that we are at a minimum or maximum or a plateau. To get closer to the minimum: x new = x old η dy dx where η indicates how much we would like to reduce or increase x and dy dx tells us the correct direction to go. 15 / 25

16 Gradient and Derivatives: The idea OK... so we know how to use derivatives to adjust one input value. But we have several weights to adjust! We need to use partial derivatives. A partial derivative of a function of several variables is its derivative with respect to one of those variables, with the others held constant. Example If y = f(x 1, x 2 ), then we can have y and y. x 1 x 2 In our learning rule case, if we can work out the partial derivatives, we can use this rule to update the weights: Learning rule w ij = w ij + w ij where w ij = η E w ij. 16 / 25

17 Summary So Far... We learnt what a multilayer perceptron is and why we might need them. We know a learning rule for updating weights in order to minimise the error: w ij = w ij + w ij where w ij = η E w ij. w ij tells us in which direction and how much we should change each weight to roll down the slope (descend the gradient) of the error function E. So, how do we calculate E w ij? 17 / 25

18 Using Gradient Descent to Minimise the Error Recall the mean squared error function E, which we want to minimise: E( w) = 1 2N N (t p o p ) 2 p=1 If we use a sigmoid activation function f, then the output of neuron i for pattern p is o p i = f(u i) = where a is a pre-defined constant e au i And u i is the result of the input function in neuron i: u i = j w ij x ij For the pth pattern and the ith neuron, we use gradient descent on the error function: w ij = η E p w ij = η(t p i op i )f (u i )x ij where f (u i ) = df du i is the derivative of f with respect to u i. If f is the sigmoid function, f (u i ) = af(u i )(1 f(u i )). 18 / 25

19 Using Gradient Descent to Minimise the Error So, we can update weights after processing each pattern, using this rule: w ij = η (t p i op i )f (u i ) x ij w ij = η δ p i x ij This is known as the generalised delta rule. Note that we need to use the derivative of the activation function f. So, f must be differentiable! The threshold activation function is not continuous, thus not differentiable! The sigmoid activation function has a derivative which is easy to calculate. 19 / 25

Updating Output vs Hidden Neurons We can update output neurons using the generalised delta rule: w ij = η δ p i x ij δ p i = f (u i )(t p i op i ) This δ p i is only good for the output neurons,

20 Updating Output vs Hidden Neurons We can update output neurons using the generalised delta rule: w ij = η δ p i x ij δ p i = f (u i )(t p i op i ) This δ p i is only good for the output neurons, since it relies on target outputs. But we don t have target output for the hidden nodes! What can we use instead? δ p i = f (u i ) k w ki δ k This rule propagates error back from output nodes to hidden nodes. If effect, it blames hidden nodes according to how much influence they had. So, now we have rules for updating both output and hidden neurons! 20 / 25

21 Backpropagation Forward propagation of the input signals Input layer Hidden layer Output layer Backpropagation of the error Backpropagation updates an MLP s weights based on gradient descent of the error function. 21 / 25

22 Online/Stochastic Backpropagation 1: Initialize all weights to small random values. 2: repeat 3: for each training example do 4: Forward propagate the input features of the example to determine the MLP s outputs. 5: Back propagate the error to generate w ij for all weights w ij. 6: Update the weights using w ij. 7: end for 8: until stopping criteria reached. 22 / 25

23 Batch Backpropagation 1: Initialize all weights to small random values. 2: repeat 3: for each training example do 4: Forward propagate the input features of the example to determine the MLP s outputs. 5: Back propagate the error to generate w ij for all weights w ij. 6: end for 7: Update the weights based on the accumulated values w ij. 8: until stopping criteria reached. 23 / 25

We know a learning rule for updating weights in order to minimise the error: w ij = η E w ij If we use the squared error,

24 Summary We learnt what a multilayer perceptron is and why we might need them. We have some intuition about using gradient descent on an error function. We know a learning rule for updating weights in order to minimise the error: w ij = η E w ij If we use the squared error, we get the generalized delta rule w ij = η δ p i x ij. We know how to calcuate δ p i for both output and hidden layers. We can use this rule to learn an MLP s weights using the backpropagation algorithm. 24 / 25

25 Further Reading Gurney K. Introduction to Neural Networks. 3rd ed. Taylor and Francis; / 25

Multilayer Perceptrons and Backpropagation

Multilayer Perceptrons and Backpropagation Informatics 1 CG: Lecture 7 Chris Lucas School of Informatics University of Edinburgh January 31, 2017 (Slides adapted from Mirella Lapata s.) 1 / 33 Reading: