Aprendizagem Automática Multilayer Perceptron Ludwig Krippahl
Aprendizagem Automática Summary Perceptron and linear discrimination Multilayer Perceptron, nonlinear discrimination Backpropagation and training MLP Regularization in MLP 1
Aprendizagem Automática Perceptron 2
Perceptron Inspired on the neuron: BruceBlaus, CC-BY, source Wikipedia 3
Perceptron Action potential is "all or nothing" Chris 73, CC-BY, source Wikipedia 4
Perceptron Model neuron as linear combination of inputs and non-linear function Chrislb, CC-BY, source Wikipedia 5
Perceptron Orginal perceptron combined a linear combination of the inputs and a threshold function: 1, y = w j x j + w 0 s(y) = { j=1 Note: we can include in w as before. d w 0 0, y > 0 y 0 Training rule for the original perceptron w i = w i + Δw i Δ w i = η(t o) x i Adjust weights slightly to correct misclassified examples. Greater adjustment to those with larger inputs. Problem: not differentiable. But there are alternatives. 6
Perceptron To use in multiple layers, need continuous derivative (E.g. sigmoid) 1 s(y) = = 1 + e y 1 1 + e T x w 7
Perceptron Strictly speaking, perceptron is the original, single layer, network model by Frank Rosenblatt (1957), with threshold activation. However, the term is often also used for neurons with continuous threshold functions, such as in the the multilayer perceptron In this course, we will not worry about the name...
Aprendizagem Automática One Neuron 8
Single Neuron Can classify linearly separable sets, such as the OR function 9
Single Neuron Training a single neuron Minimize quadratic error between class and output 1 N E = ( t j s j ) 2 2 j=1 Like perceptron, present each example and adjust weights. E t 1 = ( 2 t j s j ) 2 10
Single Neuron Training a single neuron Gradient of the error wrt w: E t 1 δe j δw i = δe j δs j δs j δnet j δnet j δw i = ( t j s j δe j ) 2 = ( t j s j ) 2 δs j δs j s j 1 = = s j (1 s j ) 1 + e net j δnet j M ne = + = t j w 0 i=1 w i x i δnet j δw i x i Update rule ( η generally between 0.1 and 0.5): Δ w j δe j = η i = η( t j s j ) s j (1 s j ) x j i δw i 11
Single Neuron Intuitive explanation: Δ w j δe j = η i = η( t j s j ) s j (1 s j ) x j i δw i 12
Single Neuron Stochastic Gradient Descent Online learning: one step per example, in random order 13
Single Neuron Stochastic Gradient Descent Batch training: add Δw j i for batch, then update. 14
Single Neuron OR: error per epoch, classifier. 15
Single Neuron Canot classify non-linearly separable sets (e.g. XOR function) 16 17
Single Neuron XOR: error per epoch, classifier. 18
Aprendizagem Automática Multilayer Perceptron 19
Multilayer Perceptron We can solve the XOR problem with a hidden layer 20
Multilayer Perceptron Fully connected Feed-forward Input layer: x 1, x 2 Hidden layer: H 1, H 2 Output layer: (all sigmoidal) O 1 21
Multilayer Perceptron Training a Multilayer Perceptron Output neuron n of layer k receives input from m from layer i through weight j Same as single neuron but using output of previous instead of x Δw j mkn = η δe j kn δs j kn δs j kn δnet j kn δnet j kn δw mkn Compute δ = η( t j s j ) (1 ) = η kn sj s j kn kn sj δkns j im im for each neuron δ = ( t j s j ) (1 ) kn sj s j kn kn 22
Multilayer Perceptron For a weight m on hidden layer i, we must propagate the output error backwards from all neurons ahead Gradient of error w.r.t. weight of output neuron: δe j kn δs j kn δnet j kn δs j kn δnet j kn δw mkn Propagate back the errors of all forward neurons (and compute δ): Δw j min = η ( δs j δnet j δs j ) kp kp in p δe j kp δs j kp δnet j kp δs j in δnet j in δnet j in δw min = η( ) (1 ) = η p δkpw mkp s j in s j in xj i δinx j i 23
Multilayer Perceptron Intuitive explanation: Δw j min = η ( δs j δnet j δs j ) kp kp in p δe j kp δs j kp δnet j kp δs j in δnet j in δnet j in δw min = η( ) (1 ) = η p δkpw mkp s j in s j in xj i δinx j i 24
Multilayer Perceptron Backpropagation Algorithm Propagate the input forward through all layers s( x) = For output neurons compute δ k = s k (1 s k )(t s k ) Backpropagate errors to back layers to compute all 1 1 + e ( T x ) w δ = (1 ) δ i s i s i p δ pw pk Note: w pk are weights of "front" neurons connecting to neuron i Update weights (for forward layers, Δ w ki = η x is s of back layer) δix ki 25
Multilayer Perceptron XOR problem, one hidden layer 26
Multilayer Perceptron XOR: error per epoch, classifier. 27
Multilayer Perceptron Hidden layer maps inputs to a linearly separable representation. (Image shows output of hidden layer) 28
Multilayer Perceptron Autoassociator We can use the "reencoding" of the hidden layers for reducing the dimensions. Mitchell's binary encoder 8 inputs 3 hidden neurons 8 output neurons 10000000... 00000001 Training: Backpropagation, forcing output to be the same as input 29
Multilayer Perceptron Autoassociator 30
Multilayer Perceptron MLP, in practice Start with random values close to zero (sigmoidal function saturates rapidly away from zero) 31
Multilayer Perceptron MLP, in practice Start with random values close to zero (sigmoidal function saturates rapidly away from zero) Since weight initialization and order of examples is random, expect different runs to converge at different epochs 32
Multilayer Perceptron MLP, in practice Start with random values close to zero (sigmoidal function saturates rapidly away from zero) Since weight initialization and order of examples is random, expect different runs to converge at different epochs Standardize or normalize the inputs (keep scaling factors for classification) Since there is only one Also to avoid saturating the sigmoidal function η for all dimensions, all values should be on same scale The logistic function ouputs in [0,1], so use these class values For multiple classes, use multiple output neurons 33
Multilayer Perceptron MLP, in practice Stochastic Gradient Descent Present training examples in random orderr One pass through all examples is one epoch Evaluate the error, (t s) 2, for training or cross-validation Repeat until convergence or before overfitting (cross-validation) 34
Multilayer Perceptron Regularization with early stop E.g. 5-fold cross-validation, stop around 40 epochs 35
Multilayer Perceptron Regularization by weight decay Decrease weight values by a small fraction δe Δ w j = η λw j δw j Useless parameters will tend to zero 36
Aprendizagem Automática Summary 37
Multilayer Perceptron Summary Perceptron, classical and continuous output function Stochastic gradient descent (squared error) One layer vs several layers (multilayer perceptron) Training with backpropagation Regularization: early stop and weight decay Further reading Mitchell, Chapter 4 Alpaydin, Chapter 11 (note: linear output) Marsland, Chapter 3 (Bishop, Chapter 5) 38