Plan for Today Single-Layer Perceptron Computational Intelligence Winter Term 00/ Prof. Dr. Günter Rudolph Lehrstuhl für Algorithm Engineering (LS ) Fakultät für Informatik TU Dortmund Accelerated Learning Online- vs. Batch-Learning Multi-Layer-Perceptron Model Backpropagation Single-Layer Perceptron (SLP) Single-Layer Perceptron (SLP) Acceleration of Perceptron Learning Assumption: { 0, } n ) x for all x (0,, 0) If classification incorrect, then w x < 0. Consequently, size of error is just δ = -w x > 0. ) w t+ = w t + (δ + ε) x for ε > 0 (small) corrects error in a single step, since w t+ x = (w t + (δ + ε) x) x = w t x + (δ + ε) x x = -δ + δ x + ε x = δ ( x ) + ε x > 0 0 > 0 3 Generalization: Assumption: R n ) x > 0 for all x (0,, 0) as before: w t+ = w t + (δ + ε) x for ε > 0 (small) and δ = - w t x > 0 ) w t+ x = δ ( x ) + ε x < 0 possible! > 0 Idea: Scaling of data does not alter classification task! Let = min { x : B } > 0 Set ^x = x ) set of scaled examples B^ ) ^x ) ^x 0 ) w t+ x ^ > 0 4
Single-Layer Perceptron (SLP) Single-Layer Perceptron (SLP) There exist numerous variants of Perceptron Learning Methods. Theorem: (Duda & Hart 973) If rule for correcting weights is w t+ = w t + γ t x (if w t x < 0). 8 t 0 : γ t 0. as yet: now: Online Learning Update of weights after each training pattern (if necessary) Batch Learning Update of weights only after test of all training patterns 3. Update rule: w t+ = w t + γ Σ x w t x < 0 B (γ > 0) then w t w* for t with 8 x w* > 0. e.g.: γ t = γ > 0 or γ t = γ / (t+) for γ > 0 5 vague assessment in literature: advantage : usually faster disadvantage : needs more memory just a single vector! 6 Single-Layer Perceptron (SLP) Single-Layer Perceptron (SLP) find weights by means of optimization Let F(w) = { B : w x < 0 } be the set of patterns incorrectly classified by weight w. Gradient method w t+ = w t γrf(w t ) Gradient points in direction of steepest ascent of function f( ) Objective function: f(w) = Σ F(w) w x min! Optimum: f(w) = 0 iff F(w) is empty Possible approach: gradient method w t+ = w t γ rf(w t ) (γ > 0) converges to a local minimum (dep. on w 0 ) Gradient Caution: Indices i of w i here denote components of vektor w; they are not the iteration counters! 7 8
Single-Layer Perceptron (SLP) Single-Layer Perceptron (SLP) Gradient method thus: gradient How difficult is it (a) to find a separating hyperplane, provided it exists? (b) to decide, that there is no separating hyperplane? Let B = P [ { -x : N } (only positive examples), w i R, θ R, B = m For every example x i B should hold: x i w + x i w + + x in w n θ trivial solution w i = θ = 0 to be excluded! Therefore additionally: η R x i w + x i w + + x in w n θ η 0 gradient method batch learning Idea: η maximize if η* > 0, then solution found 9 0 Single-Layer Perceptron (SLP) Matrix notation: What can be achieved by adding a layer? Single-layer perceptron (SLP) ) Hyperplane separates space in two subspaces N P Linear Programming Problem: f(z, z,, z n, z n+, z n+ ) = z n+ s.t. Az 0 max! If z n+ = η > 0, then weights and threshold are given by z. Otherwise separating hyperplane does not exist! calculated by e.g. Kamarkaralgorithm in polynomial time Two-layer perceptron ) arbitrary convex sets can be separated Three-layer perceptron ) arbitrary sets can be separated (depends on number of neurons)- several convex sets representable by nd layer, these sets can be combined in 3rd layer ) more than 3 layers not necessary! connected by AND gate in nd layer convex sets of nd layer connected by OR gate in 3rd layer
XOR with 3 neurons in steps XOR with 3 neurons in layers x - y z x y y z 0 0 0 0 0 x - y z x y y z 0 0 0 0 0 0 0 - y 0 0 0 - y 0 0 0 0 0 convex set without AND gate in nd layer 3 4 XOR can be realized with only neurons! x y - z BUT: this is not a layered network (no MLP)! x y -y x -y+ z 0 0 0 0 0 0 0 0 0 0 0 0-0 0 Evidently: MLPs deployable for addressing significantly more difficult problems than SLPs! But: How can we adjust all these weights and thresholds? Is there an efficient learning algorithm for MLPs? History: Unavailability of efficient learning algorithm for MLPs was a brake shoe until Rumelhart, Hinton and Williams (986): Backpropagation Actually proposed by Werbos (974) but unknown to ANN researchers (was PhD thesis) 5 6
Quantification of classification error of MLP Learning algorithms for Multi-Layer-Perceptron (here: layers) Total Sum Squared Error (TSSE) output of net for weights w and input x Total Mean Squared Error (TMSE) target output of net for input x idea: minimize error! f(w t, u t ) = TSSE min! Gradient method u t+ = u t - γr u f(w t, u t ) w t+ = w t - γr w f(w t, u t ) x x n w w nm m u # training patters # output neurons const. TSSE leads to same solution as TSSE 7 BUT: f(w, u) cannot be differentiated! Why? Discontinuous activation function a(.) in neuron! idea: find smooth activation function similar to original function! 8 0 θ Learning algorithms for Multi-Layer-Perceptron (here: layers) good idea: sigmoid activation function (instead of signum function) 0 θ Learning algorithms for Multi-Layer-Perceptron (here: layers) Gradient method w y x u z monotone increasing f(w t, u t ) = TSSE y 0 differentiable non-linear output [0,] instead of { 0, } u t+ = u t - γr u f(w t, u t ) w t+ = w t - γr w f(w t, u t ) z e.g.: threshold θ integrated in activation function values of derivatives directly determinable from function values x i : inputs y j : values after first layer z k : values after second layer x I w nm y J J y j = h( ) z K K z k = a( ) 9 0
output of neuron j after st layer error for input x and target output z*: output of neuron k after nd layer error of input x: total error for all training patterns (x, z*) B: (TSSE) output of net target output for input x gradient of total error: thus: vector of partial derivatives w.r.t. weights u jk and w ij assume: ) and: and chain rule of differential calculus: outer derivative inner derivative 3 4
partial derivative w.r.t. w ij : partial derivative w.r.t. u jk : factors reordered error signal δ k error signal δ k from previous layer 5 error signal δ j from current layer 6 Generalization (> layers) Let neural network have L layers S, S, S L. Let neurons of all layers be numbered from to N. All weights w ij are gathered in weights matrix W. Let o j be output of neuron j. error signal: j S m neuron j is in m-th layer error signal of neuron in inner layer determined by error signals of all neurons of subsequent layer and weights of associated connections. ) First determine error signals of output neurons, use these error signals to calculate the error signals of the preceding layer, use these error signals to calculate the error signals of the preceding layer, and so forth until reaching the first inner layer. correction: in case of online learning: correction after each test pattern presented 7 ) thus, error is propagated backwards from output layer to first inner ) backpropagation (of error) 8
) other optimization algorithms deployable! in addition to backpropagation (gradient descent) also: Backpropagation with Momentum take into account also previous change of weights: QuickProp assumption: error function can be approximated locally by quadratic function, update rule uses last two weights at step t and t. Resilient Propagation (RPROP) exploits sign of partial derivatives: times negative or positive ) increase step! change of sign ) reset last step and decrease step! typical values: factor for decreasing 0,5 / factor of increasing, evolutionary algorithms individual = weights matrix later more about this! 9