Single layer NN We consider the simple architecture consisting of just one neuron. Generalization to a single layer with more neurons as illustrated below is easy because: M M The output units are independent among each other Each weight only affects one of the outputs Neural Networks NN Neuron Model The (McCulloch-Pitts) perceptron is a single layer NN with a non-linear ϕ, the sign function + if v 0 ϕ(v) = if v < 0 b (bias) x x w w v ϕ(v) y w n x n Neural Networks NN Rossella Cancelliere
Training How can we train a perceptron for a classification task? We try to find suitable values for the weights in such a way that the training examples are correctly classified. Geometrically, we try to find a hyperplane that separates the examples of the two classes. Neural Networks NN 3 Geometric View m i= The equation below describes a (hyper-)plane in the input space consisting of real valued m-dimensional vectors. The plane splits the input space into two regions, each of them describing one class. decision region for C w xi + w0 i = 0 decision boundary x w x + w x + w 0 >= 0 C C x w x + w x + w 0 = 0 Neural Networks NN 4 Rossella Cancelliere
Classification The perceptron is used for binary classification. Given training examples of classes C, C train the perceptron in such a way that it classifies correctly the training examples: If the output of the perceptron is + then the input is assigned to class C If the output is - then the input is assigned to C Neural Networks NN 5 The learning algorithm C a c ax + by + c = 0 y = x b b P(4,3) Example: C r : x y + = 0 y = x + x(n)=[-, 4, 3] w(n)=[,, -] w(n+)=x(n)+w(n) = [0, 5, ] y = 5 x T Training rule: w(n+)=w(n) if w(n) x(n) > 0 and x(n) belongs to class C T w(n+)=w(n) if w(n) x(n) < 0 and x(n) belongs to class C T w(n+)=w(n) -η(n) x(n) if w(n) x(n) > 0 and x(n) belongs to class C T w(n+)=w(n) + η(n) x(n) if w(n) x(n) < 0 and x(n) belongs to class C Neural Networks NN 6 Rossella Cancelliere 3
The learning algorithm n=; initialize w(n) randomly; while (there are misclassified training examples) Select a misclassified augmented example (x(n),d(n)) w(n+) = w(n) + ηd(n)x(n); n = n+; end-while; η = learning rate parameter (real number) Neural Networks NN 7 Convergence theorem (proof) Suppose the classes C, C are linearly separable (that is, there exists a hyper-plane that separates them). Then the perceptron algorithm applied to C C terminates successfully after a finite number of iterations. Proof: Consider the set C containing the inputs of C C transformed by replacing x with -x for each x with class label -. For simplicity assume w() = 0, η =. Let x() x(k) C be the sequence of inputs that have been used after k iterations. Then w() = w() + x() w(3) = w() + x() w(k+) = x() + + x(k) w(k+) = w(k) + x(k) Neural Networks NN 8 Rossella Cancelliere 4
Convergence theorem (proof) Since C and C are linearly separable then there exists w * such that w * T x > 0, x C. Let α = min w * T x Then w * T w(k+) = w * T x() + + w * T x(k) kα By the Cauchy-Schwarz inequality we get: w * w(k+) [w * T w(k+)] k α w(k+) (A) w * Neural Networks NN 9 Convergence theorem (proof) Now we consider another route: w(k+) = w(k) + x(k) w(k+) = w(k) + x(k) + w T (k)x(k) euclidean norm 443 0 because x(k) is misclassified w(k+) w(k) + x(k) =0 w() w() + x() w(3) w() + x() w(k+) k i= x(i) Neural Networks NN 0 Rossella Cancelliere 5
Convergence theorem (proof) Let β = max x(n) x(n) C w(k+) k β (B) For sufficiently large values of k: (B) becomes in conflict with (A). Then k cannot be greater than k max such that (A) and (B) are both satisfied with the equality sign. kmaxα w* = kmaxβ kmax = β w* α β w * The algorithm terminates successfully in at most iterations, i.e. α and w( k + ) w( k ) 0 lim w( k ) = k w( k max ) lim = k Neural Networks NN Example Consider the -dimensional training set C C, C = {(,), (, -), (0, -)} with class label C = {(-,-), (-,), (0,)} with class label - Train a perceptron on C C Neural Networks NN Rossella Cancelliere 6
A possible implementation Consider the augmented training set C C, with first entry fixed to (to deal with the bias as extra weight): (,, ), (,, -), (, 0, -),(,-, -), (,-, ), (,0, ) Replace x with -x for all x C and use the following update rule: T w( n) + ηx( n) if w ( n) x( n) 0 w( n + ) = w( n) otherwise Epoch = the application of the update rule to each example of the training set. Then terminate the execution of the learning algorithm if the weights do not change after one epoch. Neural Networks NN 3 Execution the execution of the perceptron learning algorithm for each epoch is illustrated below, with w()=(,0,0), η =, and transformed inputs (,, ), (,, -), (,0, -), (-,, ), (-,, -), (-,0, -) Adjusted pattern Weight applied w(n) x(n) Update? New weight (,, ) (, 0, 0) No (, 0, 0) (,, -) (, 0, 0) No (, 0, 0) (,0, -) (, 0, 0) No (, 0, 0) (-,, ) (, 0, 0) - Yes (0,, ) (-,, -) (0,, ) 0 Yes (-,, 0) (-,0, -) (-,, 0) No (-,, 0) End epoch Neural Networks NN 4 Rossella Cancelliere 7
Execution Adjusted pattern Weight applied w(n) x(n) Update? New weight (,, ) (-,, 0) No (-,, 0) (,, -) (-,, 0) No (-,, 0) (,0, -) (-,, 0) - Yes (0,, -) (-,, ) (0,, -) No (0,, -) (-,, -) (0,, -) 3 No (0,, -) (-,0, -) (0,, -) No (0,, -) End epoch At epoch 3 no weight changes. (check!) stop execution of algorithm. Final weight vect.: (0,, -) decision hyperplane is x -x = 0. Neural Networks NN 5 Result C x - - + Decision boundary: x -x = 0 - / x - + - + C Neural Networks NN 6 Rossella Cancelliere 8
Limitations The perceptron can only model linearly separable classes, like (those described by) the following Boolean functions: AND OR It cannot model the XOR! Neural Networks NN 7 Adaline: Adaptive Linear Element When the two classes are not linearly separable, it may be desirable to obtain a linear separator that minimizes the mean squared error. Adaline (Adaptive Linear Element): E uses a linear neuron model and the Least-Mean-Square (LMS) learning algorithm For the k-th example ( k ) ( k ) x, d (w) = j= 0 m ( k ) ( k ) ( d y ) = d x ( k ) ( k ) ( k ) ( ) Neural Networks NN 8 j w j Total error: N E tot = E k = ( k ) Rossella Cancelliere 9
Adaline The total error E_tot is the sum of the squared errors of all the examples. E_tot is a quadratic function of the weights whose derivative exists everywhere. Then incremental gradient descent may be used to minimize E_tot. At each iteration LMS algorithm selects an example and decreases the network error E of that example. Neural Networks NN 9 Incremental Gradient Descent start from an arbitrary point in the weight space the direction in which the error E of an example (as a function of the weights) is decreasing most rapidly is the opposite of the gradient of E: ( k ( E ) (w)) = w m take a small step (of size η) in that direction Neural Networks NN 0 E E, K, w ( k w(k + ) = w(k) η( E ) (w)) Rossella Cancelliere 0
Weights Update Rule Computation of Gradient(E): N ( k ) E tot E (w ) = w w j ( k ) ( k ) E (w ) w j w(k k = j E = y + ) ( k ) ( k ) (w ) y w Delta rule for weight update: = w(k) ( k ) ( k ) ( d y ) x k j Neural Networks NN j = + ηe ( k ) ( k ) x LMS learning algorithm k=; initialize w(k) randomly; while (E_tot unsatisfactory and k<max_iterations) ( k ) ( k ) Select an example x,d e ( ) ( w(k) x ) ( k) ( k) T ( k) = d w( k + ) = w(k) + η e ( k ) ( ) x k k = k+; end-while; η = learning rate parameter (real number) Neural Networks NN Rossella Cancelliere
Comparison: and Adaline Adaline Architecture Single-layer Single-layer Neuron model Learning algorithm Application Non-linear Minimize number of misclassified examples Linear classification linear Minimize total error Linear classification regression Neural Networks NN 3 Rossella Cancelliere