What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Multi-layer networks Steve Renals Machine Learning Practical MLP Lecture 3 7 October 2015 MLP Lecture 3 Multi-layer networks 2

What Do Single Layer Neural Networks Do? MLP Lecture 3 Multi-layer networks 3

Single layer network Single-layer network, 1 output, 2 inputs + bias + x 0 x 1 x 2 MLP Lecture 3 Multi-layer networks 4

Geometric interpretation Single-layer network, 1 output, 2 inputs + bias x 2 w y(w; x) =0 x 1 w 0 w Bishop, sec 3.1 MLP Lecture 3 Multi-layer networks 5

Example data (three classes) 5 Data 4 3 2 1 0 1 2 3 4 4 3 2 1 0 1 2 3 4 5 MLP Lecture 3 Multi-layer networks 6

Classification regions with single-layer network 5 Plot of Decision regions 4 3 2 1 0 1 2 3 4 4 3 2 1 0 1 2 3 4 5 Single-layer networks are limited to linear classification boundaries MLP Lecture 3 Multi-layer networks 7

Single layer network trained on MNIST Digits 10 Outputs 0 1 2 3 4 5 6 7 8 9 Bias 785x10 weight matrix 784 Inputs + bias 28x28 Output weights define a template for each class MLP Lecture 3 Multi-layer networks 8

Hinton Diagrams Visualise the weights for class k 400 (20x20) inputs MLP Lecture 3 Multi-layer networks 9

2 3 MLP Lecture 3 Multi-layer networks 10 Hinton diagram for single layer network trained on MNIST Weights for each class act as a discriminative template Inner product of class weights and input to measure closeness to each template Classify to the closest template (maximum value output) 0 1

Multi-Layer Networks MLP Lecture 3 Multi-layer networks 11

From templates to features Good classification needs to cope with the variability of real data: scale, skew, rotation, translation,... Very difficult to do with a single template per class Could have multiple templates per task... this will work, but we can do better Use features rather than templates (images from: Michael Nielsen, Neural Networks and Deep Learning, http://neuralnetworksanddeeplearning.com/chap1.html) MLP Lecture 3 Multi-layer networks 12

Incorporating features in neural network architecture Layered processing: inputs - features - classification How to obtain features - learning! 0 1 2 3 4 5 6 7 8 9 Bias MLP Lecture 3 Multi-layer networks 13

Multi-layer network y f f k f f Outputs Softmax + + + + a (2) k w (2) k g g h (1) g g Hidden layer + + Sigmoid + + w (1) i a (1) x i Inputs y k = softmax( H r=0 w (2) kr h(1) r ) ; h (1) = sigmoid( d s=0 w (1) s x s ) MLP Lecture 3 Multi-layer networks 16

Multi-layer network for MNIST (image from: Michael Nielsen, Neural Networks and Deep Learning, http://neuralnetworksanddeeplearning.com/chap1.html) MLP Lecture 3 Multi-layer networks 17

Training MLPs: Credit assignment Hidden units make training the weights more complicated, since the hidden units affect the error function indirectly via all the outputs The Credit assignment problem: what is the error of a hidden unit? how important is input-hidden weight w (1) i to output unit k? Solution: Gradient descent requires derivatives of the error with respect to each weight Algorithm: back-propagation of error (backprop) Backprop gives a way to compute the derivatives. These derivatives are used by an optimisation algorithm (e.g. gradient descent) to train the weights. MLP Lecture 3 Multi-layer networks 18

Training MLPs: Error function and required gradients Cross-entropy error function: Required gradients: E n = E n w (2) k C tk n ln y k n k=1 ; E n w (1) i Gradient for hidden-to-output weights similar to single-layer network: ( E n w (2) = E n C ) k a (2) a(2) k E n y c = w k y k c=1 c a (2) a(2) k w k k = (y k t k ) }{{} δ (2) k h (1) MLP Lecture 3 Multi-layer networks 19

Training MLPs: Input-to-hidden weights E n w (1) i = E n a (1) }{{} δ (1) a(1) w (1) i }{{} x i To compute δ (1) = E n / a (1), the error signal for hidden unit, we must sum over all the output units contributions to δ (1) : K δ (1) = c=1 K = = c=1 ( K E n a (2) c δ (2) c a(2) c a (1) a(2) c h (1) δ (2) c w (2) c ) h(1) a (1) h (1) (1 h (1) ) c=1 MLP Lecture 3 Multi-layer networks 21

Training MLPs: Gradients E n w (2) k E n w (1) i = (y k t k ) }{{} = δ (2) k ( k c=1 h (1) δ (2) c w (2) c ) h (1) (1 h (1) ) } {{ } δ (1) x i MLP Lecture 3 Multi-layer networks 22

Back-propagation of error: hidden unit error signal y 1 y Outputs y K w (2) 1 w (2) l w (2) K Hidden units h w (1) i x i MLP Lecture 3 Multi-layer networks 23

Back-propagation of error: hidden unit error signal y 1 y Outputs y K (2) 1 (2) ` w (2) 1 w (2) l w (2) K (2) K Hidden units h w (1) i x i MLP Lecture 3 Multi-layer networks 23

Back-propagation of error: hidden unit error signal y 1 y Outputs y K (2) 1 (2) ` w (2) 1 w (2) l w (2) K (2) K Hidden units h (1) = X` w (1) i (2) l w`! h (1 h ) x i MLP Lecture 3 Multi-layer networks 23

Back-propagation of error The back-propagation of error algorithm is summarised as follows: 1 Apply an input vectors from the training set, x, to the network and forward propagate to obtain the output vector y 2 Using the target vector t compute the error E n 3 Evaluate the error signals δ (2) k for each output unit 4 Evaluate the error signals δ (1) for each hidden unit using back-propagation of error 5 Evaluate the derivatives for each training pattern, summing to obtain the overall derivatives Back-propagation can be extended to multiple hidden layers, in each case computing the δ (l) s for the current layer as a weighted sum of the δ (l+1) s of the next layer MLP Lecture 3 Multi-layer networks 24

Training with multiple hidden layers y 1 y` Outputs yk (3) 1 w (3) 1k (3) ` w (3) `k w (3) Kk (3) K h (2) 1 (2) 1 w (2) 1 h (2) k w (2) k (2) k w (2) H Hidden (2) H h (2) H h (1) w (1) i (1) Hidden x i Inputs MLP Lecture 3 Multi-layer networks 25

Summary Understanding what single-layer networks compute How multi-layer networks allow feature computation Training multi-layer networks using back-propagation of error Reading: Michael Nielsen, chapter 1 of Neural Networks and Deep Learning http://neuralnetworksanddeeplearning.com/chap1.html Chris Bishop, Sections 3.1, 3.2, and Chapter 4 of Neural Networks for Pattern Recognition MLP Lecture 3 Multi-layer networks 26