CE213 Artificial Intelligence Lecture 14

CE213 Artificial Intelligence Lecture 14 Neural Networks: Part 2 Learning Rules -Hebb Rule - Perceptron Rule -Delta Rule Neural Networks Using Linear Units [ Difficulty warning: equations! ] 1

Learning in Neural Networks Consider a linear separation unit (LSU) or MP neuron first: x 1 w 1 x 2 x 3 w 2 w 3 θ y where w n x n θ =1 <θ =0 Given a labelled sample dataset, how to determine the values of w i and θ so that the LSU will classify or fit the sample data correctly? 1) Analytical way(usually when n 3) 2) Learning 2

Learning in Neural Networks (2) The function of an MP neuron can be constructed by appropriate choice of weights and threshold, i.e., w 1 x 1 + w 2 x 2 = θ. Could we train a neuron or neural network to achieve the desired weights by presenting it with a sequence of examples/samples? What information is available from samples and an MP neuron? The current input: The current weights: The actual output : The desired output: x 1 x n w 1 w n y z 3

Learning in Neural Networks (3) A General Solution to Learning in Neural Networks: ( +1)= ( )+ (,,, ) where w i (t) is the weight of the ithinput at time t is a function that determines the change in weight. A learning procedure would simply be: x 2 learnt weights initial weights Initialisation of weight values; REPEAT Present next example/sample; Adjust all weights using ; UNTIL Training complete; (There could be different completion criteria.) https://www.youtube.com/watch?v=vgwemzhplsa x 1 4

Learning in Neural Networks (4) What about setting the threshold? Set threshold to zero and add another input x 0 with weight. Suppose x 0 is always 1. 0 =1 x 1 x 2 x 3 w 1 w 2 w 3 x 0 =1 0 w 0 y w n < 0 =0 x n It is equivalent to set θ= 5

Learning in Neural Networks (5) So the fixed input is essentially an adaptive threshold, because =1 < =0 ( is equivalent to threshold θ) In this way, we can arrange for the threshold to be modified in the same way as the weights. Now let s focus on how to determine (,,, ). 6

The Hebb Rule Hebb(1949) proposed that learning could take the form of strengthening the connections between any pair of neurons that were simultaneously active. This implies a learning rule of the form (x i may be the output of another neuron) ( +1)= ( )+ where αis a constant that determines the rate of learning. Later researchers modified this for training MP neurons. During learning, the output of the neuron (y) is forced to the required/desired value (z). Hence the learning rule becomes This is now known as the Hebb Rule. ( +1)= ( )+ Is this supervised learning? 7

The Perceptron Rule The Perceptron Rule (late 1950s) introduced an important new idea: More specifically: IF IT AIN T BROKE, DON T FIX IT. Only adjust the weights when the neuron would have given the wrong answer. Thus we have the notion of a rule only used to correct errors. The Perceptron Weight Adjustment Rule: ( +1)= ( )+ only if which is, of course, the Hebb rule restricted to error correction. 8

The Delta Rule The delta rule generalises the idea of error correction introduced in the perceptron rule. Rather than having a rule that is only applied if an error occurs, this approach includes the error as part of the rule: ( +1)= ( )+ ( ) This is the delta rule (proposed in 1960, also known as the Widrow- Hoff rule) Why is it called the delta rule? Because the difference or error (z - y) is often written as δ. 9

The Delta Rule (2) Function Approximation Using the Delta Rule Let s consider the task of training a function approximator. If we take away the threshold we have a linear sum unit: = The inputs and outputs could be real numbers, not just 1s and 0s. Such a unit clearly computes a hyperplane in n-dimensional space. 10

The Delta Rule (3) How might this unit behave if trained? Suppose training data can be described z= f(x 1,,x n ). Then the weights will adapt to values such that the unit provides a linear approximation to the function f. Thus neural units and networks can be used for function approximation as well as for classification. The two are closely related: Function approximation is a surface in n-dimensional space. This surface separates two classes of points in n-dimensional space. For function approximation, it can be shown that the delta rule is an optimal strategy for changing the weights. (Strict proof is beyond the scope of CE213) 11

Limitations of the Delta Rule Two major limitations It can only learn linear functions or linear separation for classification. It is equivalent to linear regression. Standard techniques for calculating regressions are orders of magnitude faster can avoid the need to choose an appropriate learning rate parameter α. Delta rule can start training neurons as soon as data is available. Choosing the learning rate parameter α If αis too small, learning will be very slow. If αis too large, the final weight values will cycle around the optimum values. 12

More Than One Output (Network) A Single Layer Linear Neural Network (consisting of linear units): y 1 y 2 y m x 1 x 2 x 3 x n The behaviour of this network is most conveniently expressed by = where is the input vector is the output vector is the weight matrix (w ij, i=1,2,,m; j=1,2,,n) 13

More Than One Output (2) The weight of the jthinput to the ithneuron is element [i,j] of the weight matrix W. So the delta rule for learning now takes the form where +1 = + =( ) T (T transpose) or w ij ( t + 1) = wij ( t) + α x j( zi yi ) 14

More Than One Output (3) How does such a multiple output, single layer network behave? Clearly it can be viewed as m single neurons operating in parallel: There are no connections that could allow the behaviour of one neuron to affect another. A single unit finds an approximation to = ( ) The multiple output network simply extends that to the case where the function has vector rather than scalar values. = ( ) 15

Multiple Layer Network Using Linear Units Let s try adding another layer of linear units (not linear separation units): h This is known as a feed-forward neural networkbecause signals flow in one direction (forward). 16

Multiple Layer Network Using Linear Units (2) Now let us see what it does: h = (lower layer) = h (higher layer Therefore = ( )=( ) But the product of two matrices is another matrix. So let = Hence = A multilayer linear neural network can be equivalently implemented by a single layer neural network. 17

Multiple Layer Network Using Linear Units (3) Conclusion There is no point building a multilayer feedforwardnetwork if the units perform linear transformations of their inputs. Therefore But The only way to build multilayer neural networks that are more powerful than single layer networks is to use units (neurons) with non-linear output functions. The delta rule is designed to find the best weights for linear units only. Need of new learning rule/algorithm! 18

Summary Learning in Neural Networks Using the Delta Rule The Hebb Rule The Perceptron Rule The Delta Rule Limitations of the Delta Rule Multilayer Networks Using Linear Units 19