Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 2012 Engineering Part IIB: Module 4F10

Introduction In the next 2 lectures we ll look at Multi-Layer Perceptrons (MLPs) which are more powerful than Single-Layer models. It will be seen that MLPs can be trained as a discriminative model to yield class posteriors. MLPs are a type of Artificial Neural Network: the computation is performed using a set of (many) simple units with weighted connections between them. Also there are learning algorithms to set the values of the weights and the same basic structure (with different weight values) is able to perform many tasks. In this and the following lecture we will consider Overall structure of multi-layer perceptrons Decision boundaries that they can form Training Criteria Networks as posterior probability estimators Basic Error back-propagation training algorithm Improved training methods Engineering Part IIB: Module 4F10 1

Application of MLPs Application of multi-layer perceptron neural networks to vision problems Application of Multilayer Perceptron Network for Tagging Parts-of-Speech An application of the multilayer perceptron: Solar radiation maps in Spain Application of multilayer perceptron networks to laser diode nonlinearity determination for radio-over-fibre mobile communications A Neural Network Based System for Intrusion Detection and Classification of Attacks Water quality sensing using multi-layer percepton artificial neural networks Evaluation of a multilayer perceptron based model to forecast O3 and NO2 hourly levels Many applications in many areas! Recently much interest in various areas of speech recognition from Deep Neural Networks: MLPs with many hidden layers. Engineering Part IIB: Module 4F10 2

Multi-Layer Perceptron A multi-layer perceptron is needed to handle the XOR problem. More generally multi-layer perceptrons allow a neural network to perform arbitrary mappings. A 2-hidden layer neural network is shown. The aim is to map an input vector x into an output y(x). The layers may be described as: Inputs x 1 x 2 y (x) 1 y (x) 2 Outputs Input layer: accepts the data vector or pattern; x d First layer Second Output layer layer y (x) K Hidden layers: one or more layers. They accept the output from the previous layer, weight them, and pass through a, normally, non-linear activation function. Output layer: takes the output from the final hidden layer weights them, and possibly pass through an output non-linearity, to produce the target values. Engineering Part IIB: Module 4F10 3

Possible Decision Boundaries The decision boundaries that may be produced varies with network topology. Here only threshold (see the single layer perceptron) activation functions are used. There are three situations to consider Single layer: able to position a hyperplane in input space (the SLP). Two layers (one hidden layer): this is able to describe a decision boundary which surrounds a single convex region of the input space. (1) (2) (3) Three layers (two hidden layers): able to generate arbitrary decision boundaries Note: any decision boundary can be approximated arbitrarily closely by a two layer network with sigmoidal activation functions if enough hidden units. Engineering Part IIB: Module 4F10 4

Number of Hidden Units The number of hidden layers determines the decision boundaries that can be generated. Choosing the number of layers the following considerations are made. Multi-layer networks are harder to train than single layer networks. Two layer networks (one hidden) with sigmoidal activation functions can model any decision boundary. Two layer networks are commonly used in pattern recognition ( hidden layer with sigmoidal activation functions), although increasing interest in deep networks. How many units to have in each layer? Number of output units is often determined by number of output classes. The number of inputs is determined by the number of input dimensions The number of hidden units is a design issue. The problems are: too few, the network will not model complex decision boundaries; too many, the network will have poor generalisation. Engineering Part IIB: Module 4F10 5

Hidden Layer Perceptron The hidden and output layer perceptron is a generalisation of the SLP. The weighted input is passed to a general activation function, rather than a threshold. Consider a single perceptron. Assume that there are n units at the previous level (k 1). x x 1 2 1 w i1 w i2 w i0 Σ z i Activation function i y (x ) x j = y j (x (k 1) ) The output from the perceptron, y i (x) may be written as y i (x ) = φ(z i ) = φ(w ix +w i0 ) = φ( where φ() is the activation function. x n w in n j=1 w ij x j +w i0 ) Note: activation function is not necessarily non-linear. However, if linear activation functions are used throughout much of the power of the network is lost. Engineering Part IIB: Module 4F10 6

Activation Functions A variety of non-linear activation functions that may be used. General form: y i (x) = φ(w ix+w i ) = φ(z i ) and there are n units, perceptrons, for the current level. Heaviside (or step) function: y i (x) = { 0, zi < 0 1, z i 0 Used in threshold units, output is binary (gives a discriminant function). Sigmoid (or logistic regression) function: y i (x) = 1 1+exp( z i ) The output is continuous, 0 y i (x) 1. Engineering Part IIB: Module 4F10 7

Softmax (or normalised exponential or generalised logistic) function: y i (x) = exp(z i ) n j=1 exp(z j) The output is positive and the sum of all the outputs at the current level is 1, 0 y i (x) 1. Hyperbolic tan (or tanh) function: y i (x) = exp(z i) exp( z i ) exp(z i )+exp( z i ) The output is continuous, 1 y i (x) 1. Engineering Part IIB: Module 4F10 8

Activation Functions (cont) 1 0.8 Graph shows: step activation function (green) sigmoid activation function (red) tanh activation function (blue) 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 5 4 3 2 1 0 1 2 3 4 5 Softmax often used at output layer as sum-to-one constraint is enforced (for e.g. probability over output classes). Sigmoid can be used as an approximation to a 0/1 threshold Linear output units can be used for regression problems. Engineering Part IIB: Module 4F10 9

Notation Used Consider a multi-layer perceptron with: d-dimensional input data; L hidden layers (L+1 layer including the output layer); N units in the k th level; K-dimensional output. x is the input to the k th layer x is the extended input to the k th layer x = [ 1 x ] x x x 1 2 N (k 1) 1 w i1 w i2 w (k 1) in w i0 Σ z i Activation function y i W is the weight matrix of the k th layer. By definition this is a N N (k 1) matrix. Engineering Part IIB: Module 4F10 10

Notation (cont) W is the weight matrix including the bias weight of[ the k th layer. ] By definition this is a N (N (k 1) +1) matrix. W = w 0 W z is the N -dimensional vector defined as z = y is the output from the k th layer, so y j = φ(z j ) W x All hidden layer activation functions assumed same φ(). Initially also assume that output activation function is also φ(). These matrix notation feed forward equations may then used for an MLP with input x and output y(x). x (1) = x x = y (k 1) z = W x y = φ(z ) y(x) = y (L+1) where 1 k L+1. The target values for the training of the networks will be denoted as t(x) for training example x. Engineering Part IIB: Module 4F10 11

Training Criteria A variety of training criteria may be used. Assuming we have supervised training examples {{x 1,t 1 }...,{x n,t n }} Some standard examples are: Least squares error: one of the most common training criteria. E = 1 2 n y(x p ) t p ) 2 = 1 2 p=1 n p=1 K (y i (x p ) t pi ) 2 i=1 This may be derived from considering the targets as being corrupted by zero-mean Gaussian distributed noise. Engineering Part IIB: Module 4F10 12

Cross-Entropy for two classes: consider the case when t is binary (and softmax output). The expression is E = n (t p log(y(x p ))+(1 t p )log(1 y(x p ))) p=1 This expression goes to zero with the perfect mapping - compare logistic regression training. Cross-Entropy for multiple classes: the above expression becomes (again softmax output) E = n p=1 K t pi log(y i (x p )) i=1 The minimum value is now non-zero, it represents the entropy of the target values. Engineering Part IIB: Module 4F10 13

Posterior Probabilities Consider the multi-class classification training problem with d-dimensional feature vectors: x; K-dimensional output from network: y(x) & targets t. Want output of the network, y(x), to approximate the posterior distribution of the set of K classes. So y i (x) P(ω i x) Consider training a network with: { 1 if x ωi 1-out-ofK coding, i.e. t i = 0 if x ω i The network will act as a d-dimensional to K-dimensional mapping. Can we interpret the output of the network? A couple of assumptions will be made errors from each output independent (not true of soft-max) error is a function of the magnitude difference of output and target E = K i=1 f( y i(x) t i ) Engineering Part IIB: Module 4F10 14

Posterior Probabilities (cont) In the limit of infinite training data, E{E} = K f( y i (x) t i )P(t x)p(x)dx i=1 t As we are using the 1-out-of-K coding P(t x) = K i=1 where δ ij = This yields K δ(t i δ ij )P(ω j x) j=1 { 1, (i = j) 0, otherwise E{E} = K i=1 (f(1 y i (x))p(ω i x)+f(y i (x))(1 P(ω i x)))p(x)dx Engineering Part IIB: Module 4F10 15

Differentiating and equating to zero, for least squared, cross-entropy solutions are y i (x) = P(ω i x) - this is a discriminative model. Some limitations exist for this to be valid: an infinite amount of training data, or knowledge of correct joint distribution for p(x,t); the topology of the network is complex enough that final error is small; good training algorithm for the network - it finds global optimum Engineering Part IIB: Module 4F10 16

Compensating for Different Priors Bayes law is used for the posterior probability for generative models. P(ω j x) = p(x ω whereclasspriors, P(ω j ), andclassconditional j)p(ω j ) densities, p(x ω j ), are estimated separately. p(x) For some tasks two use different training data sets (for example for speech recognition, the language model and the acoustic model). How can this difference in priors from the training and the test conditions be built into the MLP framework where the posterior probability is directly calculated? (This is applicable to all discriminative models) Again using Bayes law p(x ω j ) P(ω j x) P(ω j ) Thus if posterior is divided by the training data prior a value proportional to the class-conditional probability can be obtained. The standard form of Bayes rule may now be used with the appropriate, different, prior. Engineering Part IIB: Module 4F10 17

Error Back Propagation So far the training of MLPs has not been discussed. Originally interest in MLPs was limited due to issues in training. These problems were partly addressed with the development of the error back propagation algorithm. Error back propagation algorithm is based on gradient descent. Hence x 1 y 1(x) the activation function must be differentiable: need to be able to compute the derivative of the error function with respect to the weights of all layers. Inputs x 2 x d Hidden layer y (x) 2 y K(x) Output layer Outputs All gradients are evaluated at the current model parameters. Engineering Part IIB: Module 4F10 18

Output Layer or SLP Want to minimise e.g. the square error between the target of the output, t p, and the current output value y(x p ). Assume training criterion is least squares and the activation function is a sigmoid function. The cost function may be written as E = 1 2 n (y(x p ) t p ) (y(x p ) t p )) = p=1 n p=1 E (p) To simplify notation, we will only consider a single observation x with associated target values t and current output from the network y(x). The error with this single observation is denoted E. How does the error change as y(x) changes? But we are not interested in y(x) E y(x) = y(x) t How do we find the effect of varying the weights? Engineering Part IIB: Module 4F10 19