Feedforward Networks

Size: px

Start display at page:

Download "Feedforward Networks"

Alison Clarke
5 years ago
Views:

1 Feedforward Neural Networks - Backpropagation Feedforward Networks Gradient Descent Learning and Backpropagation CPSC 533 Fall 2003 Christian Jacob Dept.of Coputer Science,University of Calgary

2 Feedforward Neural Networks - Backpropagation 2 Adaptive "Prograing" of ANNs through Learning ANN Learning A learning algorith is an adaptive ethod by which a network of coputing units self-organizes to ipleent the desired behavior. Testing Input/Output Exaples Calculating Network Errors Figure. Learning process in a paraetric syste Changing Network Paraeters In soe learning algoriths, exaples of the desired input-output apping are presented to the network. A correction step is executed iteratively until the network learns to produce the desired response.

3 Feedforward Neural Networks - Backpropagation 3 Learning Schees Unsupervised Learning For a given input, the exact nuerical output a network should produce is unknown. Since no "teacher" is available, the network ust organize itself (e.g., in order to associate clusters with units). Exaples: Clustering with self-organizing feature aps, Kohonen networks. Figure 2. Three clusters and a classifier network Supervised Learning Soe input vectors are collected and presented to the network. The output coputed by the network is observed and the deviation fro the expected answer is easured. The weights are corrected (= learning algorith) according to the agnitude of the error. Ë Error-correction Learning: The agnitude of the error, together with the input vector, deterines the agnitude of the corrections to the weights. Exaples: Perceptron learning, backpropagation. Ë Reinforceent Learning: After each presentation of an input-output exaple we only know whether the network produces the desired result or not. The weights are updated based on this Boolean decision (true or false).

4 Feedforward Neural Networks - Backpropagation 4 this Boolean decision (true or false). Exaples: Learning how to ride a bike. Learning by Gradient Descent Definition of the Learning Proble Let us start with the siple case of linear cells, which we have introduced as perceptron units. The linear network should learn appings (for =,, P) between Ë an input pattern x = Hx,, x N L and Ë an associated target pattern T. Figure 3. Perceptron

Feedforward Neural Networks - Backpropagation 5 The output Oi of cell

goal of the learning procedure is, that eventually the output O i for

5 Feedforward Neural Networks - Backpropagation 5 The output Oi of cell i for the input pattern x is calculated as Oi = Hwki ÿ xk L () k The goal of the learning procedure is, that eventually the output O i for input pat tern x corresponds to the desired output Ti : Oi = Ti = Hwki ÿ xk L! (2) k Exaple: Letter Classification Note: This letter classification will only work with non-linear (sigoidal) processing units.

6 Feedforward Neural Networks - Backpropagation 6 Explicit Solution (Linear Network)* For a linear network, the weights that satisfy Equation (2) can be calculated explicitly using the pseudo-inverse: w ik = ÅÅÅÅ P l T i HQ k - L l x k l (3) Q l = ÅÅÅÅ P k x k x k l (4) Correlation Matrix Here Q l is a coponent of the correlation atrix Q k of the input patterns: i x k x k x k x2 k x k xp k Q k =.... j P P P k x k x k x k x k y z { (5) You can check that this is indeed a solution by verifying w ik x k = T i. k (6) Caveat Note that Q - only exists for linearly independent input patterns. That eans, if there are a i such that for all k =,, N a x k + a 2 x k a P x k P = 0, (7) then the outputs O i cannot be selected independently fro each other, and the proble is NOT solvable.

7 Feedforward Neural Networks - Backpropagation 7 Learning by Gradient Descent (Linear Network) Let us now try to find a learning rule for a linear network with M output units. Starting fro a rando initial weight setting w 0, the learning procedure should find a solution weight atrix for Equation (2). Error Function For this purpose, we define a cost or error function EHw L: E Hw L = ÅÅÅÅ M 2 = M E Hw L = ÅÅÅÅ 2 = P HT - O L 2 = P i j T - k k = Hw k ÿ x k L y z { EHw L 0 will approach zero as w = 8w k < satisfies Equation (2). This cost function is a quadratic function in weight space. 2 (8)

8 Feedforward Neural Networks - Backpropagation 8 Paraboloid Therefore, EHw L is a paraboloid with a single global iniu. << RealTie3D` Plot3D@x 2 + y 2, 8x, -5, 5<, 8y, -5, 5<D;

9 Feedforward Neural Networks - Backpropagation 9 ContourPlot@x 2 + y 2, 8x, -5, 5<, 8y, -5, 5<D; If the pattern vectors are linearly independent i.e., a solution for Equation (2) exists the iniu is at E = 0.

Feedforward Neural Networks - Backpropagation 0 Graphical Illustration: Following the Gradient Finding the Miniu: Following the Gradient We can find the iniu of EHw L in weight space by following the

10 Feedforward Neural Networks - Backpropagation 0 Graphical Illustration: Following the Gradient Finding the Miniu: Following the Gradient We can find the iniu of EHw L in weight space by following the negative gradient - w EHw L = - EHw L ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ w We can ipleent this gradient strategy as follows: (9) Changing a Weight Each weight w ki œ w is changed by Dw ki proportionate to the E gradient at the current weight position (i.e., the current settings of all the weights): Dw ki = -h E Hw L ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ w ki (0)

11 Feedforward Neural Networks - Backpropagation Steps Towards the Solution Dw ki = -h ÅÅÅÅÅÅÅÅÅÅÅÅ w ki i j k M ÅÅÅÅ 2 = P i j T - k n = Hw n ÿ x n L y z { 2y z { P Dw ki = -h ÅÅÅÅ 2 = ÅÅÅÅÅÅÅÅÅÅÅÅ w ki i M j k = i j T - k n Hw n ÿ x n L y z { 2y z { () P Dw ki = -h ÅÅÅÅ 2 = 2 i j T i - Hw ni ÿ x n L y z H-x k L k n { Weight Adaptation Rule P Dw ki = h HT i - O i L x k = (2) The paraeter h is usually referred to as the learning rate. In this forula, the adaptation of the weights are accuulated over all patterns. Delta, LMS Learning If we change the weights after each presentation of an input pattern to the network, we get a sipler for for the weight update ter: or with Dw ki = h HT i - O i L x k Dw ki = h d i x k d i = T i - O i. (3) (4) (5) This learning rule has several naes:

12 Feedforward Neural Networks - Backpropagation 2 Ë Delta rule Ë Adaline rule Ë Widrow-Hoff rule Ë LMS (least ean square) rule.

Feedforward Neural Networks - Backpropagation 3 Gradient Descent Learning with Nonlinear Cells We will now extend the gradient descent technique for the case of nonlinear cells, that is, where the

13 Feedforward Neural Networks - Backpropagation 3 Gradient Descent Learning with Nonlinear Cells We will now extend the gradient descent technique for the case of nonlinear cells, that is, where the activation/output function is a general nonlinear function g(x). The input function is denoted by hhxl. The activation/output function ghhhxll is assued to be differentiable in x. Reeber: Rewriting the Error Function The definition of the error function (Equation (8)) can be siply rewritten as follows: E Hw L = ÅÅÅÅ M 2 = M E Hw L = ÅÅÅÅ 2 = P HT - O L 2 = P = i j T - g i j k k k Hw k ÿ x k L y y zz {{ 2 (6)

14 Feedforward Neural Networks - Backpropagation 4 Weight Gradients Consequently, we can copute the w ki gradients: E Hw P L ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ = w ki = HT i - g Hh i LL ÿ g Hh i L ÿ x k (7) Fro Weight Gradients to the Learning Rule This eventually (after soe ore calculations) shows us that the adaptation ter Dw ki for w ki has the sae for as in Equations (0), (3), and (4), naely: where Dw ki = h d i x k d i = HT i - O i L ÿ g Hh i L (8) (9) Suitable Activation Functions The calculation of the above d ters is easy for the following functions g, which are coonly used as activation functions: Hyperbolic Tangens: g HxL = tanh b x g HxL = b H - g 2 HxLL (20)

15 Feedforward Neural Networks - Backpropagation 5 Hyperbolic Tangens Plot: Plot@Tanh@xD, 8x, -5, 5<D; - Plot of the first derivative: - Plot@Tanh '@xd, 8x, -5, 5<D;

16 Feedforward Neural Networks - Backpropagation 6 Check for equality with - tanh 2 x Plot@ - Tanh@xD 2, 8x, -5, 5<D; Influence of the b paraeter: p@b_d := Plot@Tanh@b xd, 8x, -5, 5<, PlotRange Ø All, DisplayFunction Ø IdentityD p2@b_d := Plot@Tanh '@b xd, 8x, -5, 5<, PlotRange Ø All, DisplayFunction Ø IdentityD Table@Show@GraphicsArray@8p@bD, p2@bd<dd, 8b,, 5<D; - -

17 Feedforward Neural Networks - Backpropagation

18 Feedforward Neural Networks - Backpropagation 8 Table@Show@GraphicsArray@8p@bD, p2@bd<dd, 8b, 0.,, 0.<D;

19 Feedforward Neural Networks - Backpropagation

20 Feedforward Neural Networks - Backpropagation 20 Sigoid: g HxL = ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ + e-2 bx (2) g HxL = 2 b g HxL H - g HxLL Sigoid Plot: sigoid@x_, b_d := ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ + E -2 b x Plot@sigoid@x, D, 8x, -5, 5<D; Plot of the first derivative: D@sigoid@x, bd, xd 2-2 x b b ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ H + -2 x b L 2

21 Feedforward Neural Networks - Backpropagation 2 Plot@D@sigoid@x, D, xd êê Evaluate, 8x, -5, 5<D; Check for equality with 2 ÿ g ÿ H - gl Plot@2 sigoid@x, D H - sigoid@x, DL, 8x, -5, 5<D; Influence of the b paraeter: p@b_d := Plot@sigoid@x, bd, 8x, -5, 5<, PlotRange Ø All, DisplayFunction Ø IdentityD p2@b_d := Plot@D@sigoid@x, bd, xd êê Evaluate, 8x, -5, 5<, PlotRange Ø All, DisplayFunction Ø IdentityD

22 Feedforward Neural Networks - Backpropagation 22 Table@Show@GraphicsArray@8p@bD, p2@bd<dd, 8b,, 5<D;

23 Feedforward Neural Networks - Backpropagation 23 Table@Show@GraphicsArray@8p@bD, p2@bd<dd, 8b, 0.,, 0.<D;

24 Feedforward Neural Networks - Backpropagation

25 Feedforward Neural Networks - Backpropagation 25 d Update Rule for Sigoid Units Using the sigoidal activation function, the d update rule takes the siple for: d i = O i H - O i L HT i - O i L, (22) which is used in the weight update rule: Dw ki = h d i x k (23)

26 Feedforward Neural Networks - Backpropagation 26 Learning in Multilayer Networks Multilayer networks with nonlinear processing eleents have a wider capability for solving classification tasks. Learning by error backpropagation is a coon ethod to train ultilayer networks. Error Backpropagation The backpropagation (BP) algorith describes an update procedure for the set of weights w in a feedforward ultilayer network. The network has to learn input-output patterns 8x k, T i <. The basis for BP learning is, again, a siilar gradient descent technique as used for perceptron learning, as described above. Notation We use the following notation: Ë x k : value of input unit k for training pattern ; k =,, N ; =,, P Ë H j : output of hidden unit j Ë O i : output of output unit i, i =,, M Ë w kj : weight of the link fro input unit k to hidden unit j Ë W ji : weight of the link fro hidden unit j to output unit i Propagating the input through the network For pattern the hidden unit j receives the input N h j = w kj x k k= and generates the output (24)

27 Feedforward Neural Networks - Backpropagation 27 H j = g Hh j L = g i j kk= N w kj x y k z. { (25) These signals are propagated to the output cells, which receive the signals h i = W ij H j = j j N W ij g i j kk= w kj x y k z { (26) and generate the output i O i = g Hh i L = g j k j N W ij g i j kk= y w kj x y k z { z { (27) Error function We use the known quadratic function as our error function: E Hw L = ÅÅÅÅ M 2 = P HT - O L 2 = (28) Continuing the calculations, we get: E Hw L = ÅÅÅÅ M 2 = M E Hw L = ÅÅÅÅ 2 = P HT - g Hh LL 2 = P = i i j T - g j k k j N W j g i j kk= yy w kj x y k z { zz {{ 2 (29) M E Hw L = ÅÅÅÅ 2 = P i j T - g i j k k j = W j H y y j zz {{ 2 Updating the weights: hidden output layer For the connections fro hidden to output cells we can use the delta weight update rule:

28 Feedforward Neural Networks - Backpropagation 28 with DW ji = -h E ÅÅÅÅÅÅÅÅÅÅÅÅ W ji DW ji = h HT i - O i L g Hh i L H j DW ji = h d i H j d i = g Hh i L HT i - O i L (30) (3) Updating the weights: input hidden layer Dw kj = -h E ÅÅÅÅÅÅÅÅÅÅÅÅ w kj i Dw kj = -h j ÅÅÅÅÅÅÅÅÅÅ E k H j ÿ H j y ÅÅÅÅÅÅÅÅÅÅÅÅ z w kj { (32) After a few ore calculations we get the following weight update rule: Dw kj = h d j x k (33) with d j = g Hh j L i W ji d i (34)

29 Feedforward Neural Networks - Backpropagation 29 The Backpropagation Algorith For the BP algoriths we use the following notations: Ë V i : output of cell i in layer 0 Ë V i : corresponds to x i, the i-th input coponent Ë w ji : the connection fro V - j to V i Backpropagation Algorith Ï Step : Initialize all weights with rando values. Ï Step 2: Select a pattern x and attach it to the input layer H = 0L: V j 0 = x j, " k (35) Ï Step 3: Propagate the signals through all layers: V i = g Hh i L = g i j k j w ji V y - j z, " i, " { (36) Ï Step 4: Calculate the d's of the output layer: d i M = g Hh i M L HT i M - V i M L (37) Ï Step 5: Calculate the d's for the inner layers by error backpropagation: d i - = g Hh i - L j w ij d j, = M, M -,, 2 (38) Ï Step 6: Adapt all connection weights: w new ji = w old ji + Dw with ji Dw ji = h d - i V j (39) Ï Step 7: Go back to Step 2 for the next training pattern.

30 Feedforward Neural Networks - Backpropagation 30 References Freean, J. A. Siulating Neural Networks with Matheatica. Addison-Wesley, Reading, MA, 994. Hertz, J., Krogh, A., and Paler, R. G. Introduction to the Theory of Neural Coputation. Addison-Wesley, Reading, MA, 99. Rojas, R. Neural Networks: A Systeatic Introduction. Springer Verlag, Berlin,996.

Feedforward Networks

Feedforward Networks Gradient Descent Learning and Backpropagation Christian Jacob CPSC 433 Christian Jacob Dept.of Coputer Science,University of Calgary CPSC 433 - Feedforward Networks 2 Adaptive "Prograing"