Multilayer Perceptron = FeedForward Neural Network

Multilayer Perceptron = FeedForward Neural Networ History Definition Classification = feedforward operation Learning = bacpropagation = local optimization in the space of weights

Pattern Classification Figures in these slides were taen from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stor, John Wiley & Sons, 2000 with the permission of the authors and the publisher

Patter History 3 Perceptron (Rosenblatt, 1956) with its simple learning algorithm generated a lot of excitement Minsy & Papert (1969) showed that even a simple XOR cannot be learnt by a perceptron, whichvlead to septicism about their utility The problem was solved by networs of perceptrons (multilayer perceptrons, feedforward neural networs)

4 Patter

Patter A three layer perceptron networ implementing the XOR 5

Multilayer Perceptron, terminology 6 Net activation: net d d t = xiw i + w 0 = xiw i w.x, i= 1 i= 0 where the subscript i indexes units in the input layer, in the hidden; w i denotes the input-to-hidden layer weights at the hidden unit. A single bias unit is connected to each unit other than the input units Each hidden unit emits an output that is a nonlinear function of its activation, that is: y = f(net ) Each output unit similarly computes its net activation based on the hidden unit signals as: net n = H nh t y w + w0 = y w = w.y, = 1 = 0 where the subscript indexes units in the ouput layer and n H denotes the number of hidden units Patter

Patter 7 The function f(.), the activation function, introduces nonlinearity the a unit. The first activation function considered was the sign (as in perceptron): f ( net ) = sgn( net ) 1 if net 1 if net 0 < 0 Unfortunately, a networ with such f is difficult to train

Multilayer Perceptron = FF Neural net 8 Is a function of the following form n H d g ( x ) z = f w f = 1 i= 1 ( = 1,...,c) w i x i + w 0 + w 0 (1) Popularized in 1986, bacpropagation learning bacame possible for f continuous and differentiable We can allow the activation in the output layer to be different from the activation function in the hidden layer or have different activation for each individual unit We assume for now that all activation functions to be identical Patter

Patter 9 The vector of outputs is denoted z. z = f(net ) In the case of c outputs (classes), we can view the networ as computing c discriminants functions z = g (x) and classify the input x according to the largest discriminant function g (x) = 1,, c

10 Patter

Patter Expressive Power of multi-layer Networs 11 Question: Can every decision be implemented by a three-layer networ described by equation (1)? Answer: Yes (due to A. Kolmogorov) Any continuous function from input to output can be implemented in a three-layer net, given sufficient number of hidden units n H, proper nonlinearities, and weights. g( x ) 2n + 1 = = 1 δ ( ) n Σβ ( x ) x I ( I = [0,1];n 2 ) i i for properly chosen functions δ and β i

Patter Expressive Power of multi-layer Networs 12 Each of the 2n+1 hidden units δ taes as input a sum of d nonlinear functions, one for each input feature x i Each hidden unit emits a nonlinear function δ of its total input Unfortunately: Kolmogorov s theorem is not constructive; tells us very little about how to find the nonlinear functions based on data; this is the central problem in networ-based pattern recognition Remarably: Kolmogorov s theorem proves that a continuous function in dimension n can be represented as a function of a sufficient number of one-dimensional functions

13 Patter

Patter Modes of operation of MLP 14 Networ have two modes of operation: Classfication = Feedforward operation The feedforward operations consists of presenting a pattern to the input units and passing (or feeding) the signals through the networ in order to get outputs units (no cycles!) Learning The supervised learning consists of presenting an input pattern and modifying the networ parameters (weights) to reduce distances between the computed output and the desired output

Learning: Bacpropagation Cost function 15 is a gradient descent method. The cost function is, for a single training pattern: J( w ) = 1 2 where t is -th target (or desired) output and z be the -th computed output with = 1,, c and w represents all the weights of the networ Class is represented as vector (0,0,..1,, 0), where 1 is in the -th position. For a pure output of the newural net, z=(0,..,1..,0), J(w) is 0 or 1. c = 1 ( t z ) 2 = 1 2 t z 2 Patter

Learning: Update 16 in gradient descent, the update of weights is in the direction of the gradient w J = η w the step size is found e.g. by line search (a search for a minimum in 1D). w t +1 = w t + w Patter

Patter Learning: Calculating the gradient 17 Error on the hidden to-output weights J w = J net where the sensitivity of unit is defined as: J δ = net and describes how the overall error changes with the activation of the unit s net net. w = δ net w δ = J net = J z z. net = ( t z ) f' ( net )

Patter Learning: Calculating the gradient 18 Since net = w t.y therefore: net w = y Conclusion: the weight update (or learning rule) for the hidden-to-output weights is: w = ηδ y = η(t z ) f (net )y Error on the input-to-hidden units J w i = J y y. net net. w i

Learning: Calculating the gradient 19 J y = = y 1 2 c = 1 ( t z z net = = 1 c c ( t z ). = = 1 net y = 1 ) 2 c ( t ( t z z z ) y ) f' ( net )w Similarly as in the preceding case, we define the sensitivity for a hidden unit: δ which means that: The sensitivity at a hidden unit is simply the sum of the individual sensitivities at the output units weighted by the hidden-to-output weights w ; all multipled by f (net ) The learning rule for the input-tohidden weights is: w = ηx δ = η Σw δ f' ( net ) i f' ( net i ) c = 1 w δ [ ] xi δ Patter

Patter Learning: Calculating the gradient 20 Calculating the gradient for the full training set is a sum of gradients, since the cost function is additive J = n p= 1 J p (1) Stopping criterion: { e.g. The algorithm terminates when the change in the criterion function J(w) is smaller than some preset value θ { when cross-validation error goes up

21 Patter

Learning: Calculating the gradient 22 Starting with a pseudo-random weight configuration, the stochastic bacpropagation algorithm can be written as shown in the box. The evalution of J is stochastic to speed up the process Begin initialize n H ; w, criterion θ, η, m 0 do m m + 1 x m randomly chosen pattern w i w i + ηδ x i ; w w + ηδ y until J(w) < θ return w End Patter

Patter Disadvantages of MLP 23 learning by local optimization: global minimum not guaranteed time-consuming learning. Many nested loops: repeat for different number of hidden units random restarts to deal with local minima iterations of the bacpropagation algorithm many parameters, not easy to set for a nonexperienced users a cost-function (quadratic loss) which is convenient, but not the classification error

Patter Advantages of MLP 24 handles well problems with multiple classes handles naturally both classification and regression problems (estimating real values) after normalization, the output may be interpreted as aposteriori probability. We now the class with maximum response, but also the reliability of it.

Than you 25/15