Feedforward Networks. Gradient Descent Learning and Backpropagation. Christian Jacob. CPSC 533 Winter 2004

Size: px

Start display at page:

Download "Feedforward Networks. Gradient Descent Learning and Backpropagation. Christian Jacob. CPSC 533 Winter 2004"

Esmond Gyles Booth
5 years ago
Views:

1 Feedforward Networks Gradient Descent Learning and Backpropagation Christian Jacob CPSC 533 Winter 2004 Christian Jacob Dept.of Coputer Science,University of Calgary

2 Backprop-print.nb Adaptive "Prograing" of ANNs through Learning ANN Learning A learning algorith is an adaptive ethod by which a network of coputing units self-organizes to ipleent the desired behavior. Testing Input/Output Exaples Calculating Network Errors Figure. Learning process in a paraetric syste Changing Network Paraeters In soe learning algoriths, exaples of the desired input-output apping are presented to the network. A correction step is executed iteratively until the network learns to produce the desired response.

3 05-2-Backprop-print.nb 3 Learning Schees Unsupervised Learning For a given input, the exact nuerical output a network should produce is unknown. Since no "teacher" is available, the network ust organize itself (e.g., in order to associate clusters with units). Exaples: Clustering with self-organizing feature aps, Kohonen networks. Figure 2. Three clusters and a classifier network Supervised Learning Soe input vectors are collected and presented to the network. The output coputed by the network is observed and the deviation fro the expected answer is easured. The weights are corrected (= learning algorith) according to the agnitude of the error. Ë Error-correction Learning: The agnitude of the error, together with the input vector, deterines the agnitude of the corrections to the weights. Exaples: Perceptron learning, backpropagation. Ë Reinforceent Learning: After each presentation of an input-output exaple we only know whether the network produces the desired result or not. The weights are updated based on this Boolean decision (true or false).

4 Backprop-print.nb this Boolean decision (true or false). Exaples: Learning how to ride a bike. Learning by Gradient Descent Definition of the Learning Proble Let us start with the siple case of linear cells, that is neurons that can perfor linear separations on input patterns (such as the perceptron). The linear network should learn appings (for =,, P inputs) between Ë an input pattern x = Hx,, x N L and Ë an associated target pattern T. In the following exaple (fro the perceptron deo) the input patterns are 20 points x = Hx, y L,..., x 20 = Hx 20, y 20 L with target patterns T =... = T 20 = 0, and 20 points x 2 = Hx 2, y 2 L,..., x 40 = Hx 40, y 40 L with target patterns T 2 =... = T 40 =. Classifier

05-2-Backprop-print.nb 5 For the following calculations we assue siple network structures like these, which only have an input and an output layer (no hidden layers!): Figure 3.

5 05-2-Backprop-print.nb 5 For the following calculations we assue siple network structures like these, which only have an input and an output layer (no hidden layers!): Figure 3. Perceptron network structure The actual output O i of cell i for the input pattern x is calculated as O i = Hw ki ÿ x k L k () The goal of the learning procedure is, that eventually the actual output O i for input pattern x corresponds to the desired output T i : O i =! T i = Hw ki ÿ x k L k (2) Explicit Solution (Linear Network)* For a linear network, the weights that satisfy Equation (2) can be calculated explicitly using the pseudo-inverse: w ik = ÅÅÅÅ P l T i HQ k - L l x k l (3)

6 Backprop-print.nb Q l = ÅÅÅÅ P k x k x k l (4) Correlation Matrix Here Q l is a coponent of the correlation atrix Q k of the input patterns: i x k x k x k x2 k x k xp k Q k =.... j P P P k x k x k x k x k y z { (5) You can check that this is indeed a solution by verifying w ik x k = T i. k (6) Caveat Note that Q - only exists for linearly independent input patterns. That eans, if there are a i such that for all k =,, N a x k + a 2 x k a P x k P = 0, (7) then the outputs O i cannot be selected independently fro each other, and the proble is NOT solvable. Learning by Gradient Descent (Linear Network) Let us now try to find a learning rule for a linear network with M output units. Starting fro a rando initial weight setting w 0, the learning procedure should find a solution weight atrix for Equation (2). Error Function For this purpose, we define a cost or error function EHw L:

7 05-2-Backprop-print.nb 7 E Hw L = ÅÅÅÅ M 2 = M E Hw L = ÅÅÅÅ 2 = P HT - O L 2 = P i j T - k k = Hw k ÿ x k L y z { EHw L 0 will approach zero as w = 8w k < satisfies Equation (2). This cost function is a quadratic function in weight space. 2 (8) Help... HELP!!!!!!!!! "Are we supposed to understand these ath expressions?" Paraboloid Therefore, EHw L is a paraboloid with a single global iniu. << RealTie3D`

8 Backprop-print.nb 2 + y 2, 8x, -5, 5<, 8y, -5, 5<D; ContourPlot@x 2 + y 2, 8x, -5, 5<, 8y, -5, 5<D;

9 05-2-Backprop-print.nb 9 If the pattern vectors are linearly independent i.e., a solution for Equation (2) exists the iniu is at E = 0. Graphical Illustration: Following the Gradient

10 Backprop-print.nb Finding the Miniu: Following the Gradient We can find the iniu of EHw L in weight space by following the negative gradient -! w EHw L = -!EHw L ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ! w We can ipleent this gradient strategy as follows: (9) Changing a Weight Each weight w ki œ w is changed by Dw ki proportionate to the E gradient at the current weight position (i.e., the current settings of all the weights): Dw ki = -h! E Hw L ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ! w ki (0) Steps Towards the Solution (!) Dw ki = -h i! ÅÅÅÅÅÅÅÅÅÅÅÅ! w ki j k M ÅÅÅÅ 2 = P i j T - k n = Hw n ÿ x n L y z { 2y z { P Dw ki = -h ÅÅÅÅ 2 = M i! ÅÅÅÅÅÅÅÅÅÅÅÅ! w ki j k = i j T - k n Hw n ÿ x n L y z { 2y z { () P Dw ki = -h ÅÅÅÅ 2 = 2 i j T i - Hw ni ÿ x n L y z H-x k L k n { Weight Adaptation Rule P Dw ki = h HT i - O i L x k = (2) The paraeter h is usually referred to as the learning rate.

11 05-2-Backprop-print.nb In this forula, the adaptation of the weights are accuulated over all patterns. Delta, LMS Learning If we change the weights after each presentation of an input pattern to the network, we get a sipler for for the weight update ter: or with Dw ki = h HT i - O i L x k Dw ki = h d i x k d i = T i - O i. (3) (4) (5) This learning rule has several naes: Ë Delta rule Ë Adaline rule Ë Widrow-Hoff rule Ë LMS (least ean square) rule. Gradient Descent Learning with Nonlinear Cells We will now extend the gradient descent technique for the case of nonlinear cells, that is, where the activation/output function is a general nonlinear function g(x). The input function is denoted by hhxl. The activation/output function ghhhxll is assued to be differentiable in x.

2 05-2-Backprop-print.nb Reeber: Why Nonlinear Units?

12 Backprop-print.nb Reeber: Why Nonlinear Units? General Decision Curves Functions used to discriinate between regions of input space are called decision curves. A neural network ust learn to identify these regions and to associate the with the correct classification response.

13 05-2-Backprop-print.nb 3 Figure 4. Non-linear separation of input space Rewriting the Error Function The definition of the error function (Equation (8)) can be siply rewritten as follows: E Hw L = ÅÅÅÅ M 2 = M E Hw L = ÅÅÅÅ 2 = P HT - O L 2 = P = i j T - g i j k k k Hw k ÿ x k L y y zz {{ 2 (6) Weight Gradients Consequently, we can copute the w ki gradients:

14 Backprop-print.nb! E Hw P L ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ =! w ki = HT i - g Hh i LL ÿ g Hh i L ÿ x k (7) Fro Weight Gradients to the Learning Rule This eventually (after soe ore calculations) shows us that the adaptation ter Dw ki for w ki has the sae for as in Equations (0), (3), and (4), naely: where Dw ki = h d i x k d i = HT i - O i L ÿ g Hh i L (8) (9) Suitable Activation Functions The calculation of the above d ters is easy for the following functions g, which are coonly used as activation functions: Hyperbolic Tangens: g HxL = tanh b x g HxL = b H - g 2 HxLL (20) Hyperbolic Tangens Plot:

15 05-2-Backprop-print.nb 5 Plot@Tanh@xD, 8x, -5, 5<D; Plot of the first derivative: - Plot@Tanh '@xd, 8x, -5, 5<D; Check for equality with - tanh 2 x

16 Backprop-print.nb - 2, 8x, -5, 5<D; Influence of the b paraeter: p@b_d := Plot@Tanh@b xd, 8x, -5, 5<, PlotRange Ø All, DisplayFunction Ø IdentityD p2@b_d := Plot@Tanh '@b xd, 8x, -5, 5<, PlotRange Ø All, DisplayFunction Ø IdentityD Table@Show@GraphicsArray@8p@bD, p2@bd<dd, 8b,, 5<D; Table@Show@GraphicsArray@8p@bD, p2@bd<dd, 8b, 0.,, 0.<D;

17 05-2-Backprop-print.nb Sigoid: g HxL = ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ + e-2 bx (2) g HxL = 2 b g HxL H - g HxLL Sigoid Plot: sigoid@x_, b_d := ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ + E -2 b x Plot@sigoid@x, D, 8x, -5, 5<D; Plot of the first derivative:

18 Backprop-print.nb bd, xd 2-2 x b b ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ H + -2 x b L 2 Plot@D@sigoid@x, D, xd êê Evaluate, 8x, -5, 5<D; Check for equality with 2 ÿ g ÿ H - gl Plot@2 sigoid@x, D H - sigoid@x, DL, 8x, -5, 5<D; Influence of the b paraeter:

19 05-2-Backprop-print.nb 9 p@b_d := Plot@sigoid@x, bd, 8x, -5, 5<, PlotRange Ø All, DisplayFunction Ø IdentityD p2@b_d := Plot@D@sigoid@x, bd, xd êê Evaluate, 8x, -5, 5<, PlotRange Ø All, DisplayFunction Ø IdentityD Table@Show@GraphicsArray@8p@bD, p2@bd<dd, 8b,, 5<D; Table@Show@GraphicsArray@8p@bD, p2@bd<dd, 8b, 0.,, 0.<D;

20 Backprop-print.nb d Update Rule for Sigoid Units Using the sigoidal activation function, the d update rule takes the siple for: d i = O i H - O i L HT i - O i L, (22) which is used in the weight update rule: Dw ki = h d i x k (23)

21 05-2-Backprop-print.nb 2 Learning in Multilayer Networks Multilayer networks with nonlinear processing eleents have a wider capability for solving classification tasks. Learning by error backpropagation is a coon ethod to train ultilayer networks. Error Backpropagation The backpropagation (BP) algorith describes an update procedure for the set of weights w in a feedforward ultilayer network. The network has to learn input-output patterns 8x k, T i <. The basis for BP learning is, again, a siilar gradient descent technique as used for perceptron learning, as described above. Notation We use the following notation: Ë x k : value of input unit k for training pattern ; k =,, N ; =,, P Ë H j : output of hidden unit j Ë O i : output of output unit i, i =,, M Ë w kj : weight of the link fro input unit k to hidden unit j Ë W ji : weight of the link fro hidden unit j to output unit i Propagating the input through the network For pattern the hidden unit j receives the input N h j = w kj x k k= and generates the output (24)

22 Backprop-print.nb H j = g Hh j L = g i j kk= N w kj x y k z. { (25) These signals are propagated to the output cells, which receive the signals h i = W ij H j = j j N W ij g i j kk= w kj x y k z { (26) and generate the output i O i = g Hh i L = g j k j N W ij g i j kk= y w kj x y k z { z { (27) Error function We use the known quadratic function as our error function: E Hw L = ÅÅÅÅ M 2 = P HT - O L 2 = (28) Continuing the calculations, we get: E Hw L = ÅÅÅÅ M 2 = M E Hw L = ÅÅÅÅ 2 = P HT - g Hh LL 2 = P = i i j T - g j k k j N W j g i j kk= yy w kj x y k z { zz {{ 2 (29) M E Hw L = ÅÅÅÅ 2 = P i j T - g i j k k j = W j H y y j zz {{ 2 Updating the weights: hidden output layer For the connections fro hidden to output cells we can use the delta weight update rule:

23 05-2-Backprop-print.nb 23 with DW ji = -h! E ÅÅÅÅÅÅÅÅÅÅÅÅ! W ji DW ji = h HT i - O i L g Hh i L H j DW ji = h d i H j d i = g Hh i L HT i - O i L (30) (3) Updating the weights: input hidden layer Dw kj = -h! E ÅÅÅÅÅÅÅÅÅÅÅÅ! w kj i Dw kj = -h j ÅÅÅÅÅÅÅÅÅÅ! E ÿ! H j y ÅÅÅÅÅÅÅÅÅÅÅÅ z k! H j! w kj { (32) After a few ore calculations we get the following weight update rule: with Dw kj = h d j x k (33) d j = g Hh j L i W ji d i (34) The Backpropagation Algorith For the BP algoriths we use the following notations: Ë V i : output of cell i in layer Ë V 0 i : corresponds to x i, the i-th input coponent Ë w ji : - the connection fro V j to V i

24 Backprop-print.nb Figure 5. Propagating signals fro the input to the output layer.

25 05-2-Backprop-print.nb 25 Figure 6. Backpropagating error deltas fro the output to the input layer.

26 Backprop-print.nb Backpropagation Algorith Ï Step : Initialize all weights with rando values. Ï Step 2: Select a pattern x and attach it to the input layer H = 0L: V j 0 = x j, " k (35) Ï Step 3: Propagate the signals through all layers: V i = g Hh i L = g i j k j w ji V y - j z, " i, " { (36) Ï Step 4: Calculate the d's of the output layer: d i M = g Hh i M L HT i M - V i M L (37) Ï Step 5: Calculate the d's for the inner layers by error backpropagation: d i - = g Hh i - L Ï Step 6: Adapt all connection weights: j w ij d j, = M, M -,, 2 (38) w new ji = w old ji + Dw ji with Dw ji = h d - i V j (39) Ï Step 7: Go back to Step 2 for the next training pattern. Exaples FeedForwardNets.nb References Freean, J. A. Siulating Neural Networks with Matheatica. Addison-Wesley, Reading, MA, 994. Hertz, J., Krogh, A., and Paler, R. G. Introduction to the Theory of Neural Coputation. Addison-Wesley, Reading, MA, 99. Rojas, R. Neural Networks: A Systeatic Introduction. Springer Verlag, Berlin,996.

Feedforward Networks

Feedforward Networks Gradient Descent Learning and Backpropagation Christian Jacob CPSC 433 Christian Jacob Dept.of Coputer Science,University of Calgary CPSC 433 - Feedforward Networks 2 Adaptive "Prograing"