EC 6430 Pattern Recognition and Analysis Monsoon 2011 Lecture Notes - 6 The Multi-Layer Perceptron Single layer networks have limitations in terms of the range of functions they can represent. Multi-layer networks are capable of approximating any continous function. Feed-forward networks (NO feed-back loops), ensures that the network outputs can be calculated as explict functions of the inputs and the weights. Feed-forward Network Functions The idea is to construct networks having sucessive layers of processing units, with connections running from every unit in one layer to every unit in the next layer, but no other connections are permitted. Figure 1, shows an example of a two layer network. Units which are not treated as output units are called hidden units. There are D inputs (excluding the bias), M hidden units (again excluding the bias given at this point) and K output units. 1
Figure 1: Two layer network The output of the j th hidden unit is obtained by, a j = D wjix 1 i + wj0 1 (1) i=1 w 1 ji: denotes a weight in the rst layer (superscript), going from i th input to the j th hidden unit. We can set x 0 equation as, = 1, (as in single layer networks) and rewrite above a j = D wjix 1 i (2) i=0 We can dene activation functions for the outputs of hidden unit, 2
z j = g (a j ) (3) The output of hidden units are transformed by using a second layer of processing elements. b k = M wkjz 2 j + wk0 2 (4) j=1 where, k = 1 K. k th output is now obtained by dening activation function, y k = g (b k ) (5) Heaviside, or Step activation functions { 0 when a < 0 g (a) = 1 when a 0 (6) Example: Binary Input with x i = 0 or 1. Network ouput is also 0 or 1. So this is essentially a Boolean function. We can implement any Boolean function, provided we have enough number of hidden units. For D inputs, there will be a possible 2 D number of binary patterns. We take one hidden unit for each input pattern which has an output target of 1. So now each pattern producing a 1 will have a hidden unit associated with it. Each hidden unit responds to just the corresponding pattern. Now for that input, weight from an input to a given hidden unit is set to +1, if the corresponding pattern has a 1 for that input. weight is set to 1, if the pattern has a 0 for that input. 3
The bias for the hidden unit is set to 1 b, where b is the number of non-zero inputs for that pattern. (Bias is input dependent). Each hidden unit is connected to the output with a weight +1. Output bias is set at 1. Assignment 7a: Solve 2 bit X-OR problem 1. Using a single layer perceptron. 2. Using a two-layer network procedure described above. Sigmoidal function Logistic sigmoidal function is given by, g (a) = 1 1 + exp ( a) (7) We also use a `tanh' activation function given by, g (a) = tanh (a) = ea e a (8) e a + e a The two functions above have linear relationship and hence it does not really aect the end result. `tanh' gives faster convergence during training. Back-propagation algorithm A Back Propagation network learns by example. You give the algorithm examples of what you want the network to do and it changes the network's weights so that, when training is nished, it will give you the required output for a particular input. To train the network you need to give it examples of what you want, with the output you want (Target) for a particular input. Consider the input-target set in gure 2, 4
Figure 2: Back-propagation input This is given to a 2 layer MPL as shown in gure 3. Figure 3: 2 layer MLP input The network is rst initialised by setting up all its weights to be small random numbers say between 1 and +1. Next, the input pattern is applied and the output calculated (this is called the forward pass). Calculate the Error. 5
Using standard sum-of-squares function, E n = 1 2 K (y k t k ) (9) k=1 where, y k is the response of output unit k, and t k is the corresponding target. Now consider the connection between a hidden layer neuron, and an output neuron. We dene a variable delta given as, δ k = En b k = g (b k ) En y k (10) For sum-of-squares function, we will end up with, δ k = y k t k (11) δ k = 1 g (a) (1 g (a)) (y k t k ), if, g (a) = 1 + exp ( a) (12) Now we have to evaluate the δ j for the input layer - hidden layer. δ j = g (a j ) k w kj δ k (13) note here that you are using δ k value from the second layer to compute δ j value of the rst layer. For sum-of-squares function, we will end up with, δ j = K w kj δ k (14) k=1 δ j = g (a) (1 g (a)) K w kj δ k, if, g (a) = k=1 1 1 + exp ( a) (15) 6
We now use the δ values to compute the change in weights. There are two options available here, Batch mode, where you give all inputs together. Online mode, where you give inputs one-by-one and compute weights for each input. For online mode, we have, w ji = ηδ j x i (16) For batch mode, w kj = ηδ k z j (17) w ji = η n δ n j x n i (18) w kj = η n δ n k z n j (19) η remains the learning rate parameter. Note: if g (a) = tanh (a) = ea e a e a +e a, g (a) = 1 g (a) 2. 7
Figure 4: Back propogation example Number of hidden neurons? Broadly - Set the hidden layer conguration using just two rules: 1. number of hidden layers equals one; and 8
2. the number of neurons in that layer is the mean of the neurons in the input and output layers. What is happening here? We are generating decision boundaries by quadratic discriminant of the form, y (x) = w 2 x 2 + w 1 x + w 0 (20) Keep in mind that we can generalize this to higher dimensions and not just limit ourselves to quadratic equations. This then leads to higher -order processing units. Remember that the order of the polynomial was a problem and a highorder polynomial could result in over-tting. Growing and pruning algorithms A network that is initially small is considered, and new units and connections are allowed to be added during training. Such techniques are called growing algorithms. Alternative arrangement is to start with a relatively large network and gradually remove either connections or complete hidden units. These are the pruning algorithms. Assignment 7b: Analysis on MLP 1. Fit using MLP, with varying number of hidden-units. (a) Dene three 2 dimensional guassians: G1, G2 and G3 (use matlab function: mvnrnd) (b) G1 has µ 1 = 0.25, 0.3 and Σ 1 = 0.2, 0.25; 0.25, 0.4 (c) G2 has µ 2 = 0.5, 0.6 and Σ 2 = 0.3, 0.1; 0.1, 0.4 (d) G3 has µ 2 = 0.7, 0.75 and Σ 2 = 0.3, 0.1; 0.1, 0.4 (e) Generate 100 data points from all three. 9
(f) Use rst 80 data points from each as training set and the next 20 from each as your test set. (g) So you now have 2-dimensional inputs, with 3 classes. Train a MLP network with 3 hidden units. (h) What is the training accuracy? What is the testing accuracy? 2. How does variation in the number of hidden units aect accuracy? (a) Vary the number of hidden-units to 2 and 4. (b) How does the accuracy value change? Is it more, less? (c) Vary it to 10, how does the testing accuracy change? Assignment 7c: MLP for classication with FLD. 1. Applicaton of MLP for classication. (a) Dene three 4 dimensional guassians. (b) Generate 100 data points for each class using an appropriate covariance matrix and means. (c) Use FLD to reduce to 3 dimensions. (d) Classify using a 2-layer MLP with 3 hidden units. End (e) What is the training and testing accuracies? (f) How does the accuracy values compare with Euclidean classication you did before? References [1] www.rgu.ac.uk/les/chapter3%20-%20bp.pdf 10