Gradient Descent Learning and Backpropagation

Size: px

Start display at page:

Download "Gradient Descent Learning and Backpropagation"

Deirdre Paul
5 years ago
Views:

1 Artfcal Neural Networks (art 2) Chrstan Jacob Gradent Descent Learnng and Backpropagaton CSC 533 Wnter 200 Learnng by Gradent Descent Defnton of the Learnng roble Let us start wth the sple case of lnear cells, whch we have ntroduced as perceptron unts. The lnear network should learn appngs (for =,, ) between Ë an nput pattern x = Hx,,x N L and Ë an assocated target pattern T.

2 Backprop-rntout.nb Fgure. erceptron The output O of cell for the nput pattern x s calculated as O = Hw k ÿx k L k () The goal of the learnng procedure s, that eventually the output O for nput pattern x corresponds to the desred output T : O =! T = Hw k ÿx k L k (2) Explct Soluton (Lnear Network) For a lnear network, the weghts that satsfy Equaton (2) can be calculated explctly usng the pseudo-nverse: w k = ÅÅÅÅ l T HQ k - L l x k l (3)

3 05.2-Backprop-rntout.nb 3 Q l = ÅÅÅÅ k x k x k l (4) Correlaton Matrx Here Q l s a coponent of the correlaton atrx Q k of the nput patterns: 2 x k x k x k x k x k x k y Q k =.... j z k x k x k x k x k { (5) You can check that ths s ndeed a soluton by verfyng w k x k = T. k (6) Caveat Note that Q - only exsts for lnearly ndependent nput patterns. That eans, f there are a such that for all k =,, N a x k + a 2 x k a x k = 0, (7) then the outputs O cannot be selected ndependently fro each other, and the proble s NOT solvable. Learnng by Gradent Descent (Lnear Network) Let us now try to fnd a learnng rule for a lnear network wth M output unts. Startng fro a rando ntal weght settng w 0, the learnng procedure should fnd a soluton weght atrx for Equaton (2). Error Functon For ths purpose, we defne a cost or error functon EHw L:

4 05.2-Backprop-rntout.nb M E Hw L = ÅÅÅÅ 2 = M E Hw L = ÅÅÅÅ 2 = HT - O L 2 = j T - k k = Hw k ÿx k L y z { EHw L 0 wll approach zero as w = 8w k < satsfes Equaton (2).

4 Backprop-rntout.nb M E Hw L = ÅÅÅÅ 2 = M E Hw L = ÅÅÅÅ 2 = HT - O L 2 = j T - k k = Hw k ÿx k L y z { EHw L 0 wll approach zero as w = 8w k < satsfes Equaton (2). Ths cost functon s a quadratc functon n weght space. 2 (8) arabolod Therefore, EHw L s a parabolod wth a sngle global nu. << RealTe3D` lot3d@x 2 + y 2, 8x, -5, 5<, 8y, -5, 5<D;

5 05.2-Backprop-rntout.nb 5 Contourlot@x 2 + y 2, 8x, -5, 5<, 8y, -5, 5<D; If the pattern vectors are lnearly ndependent.e., a soluton for Equaton (2) exsts the nu s at E = 0. Fndng the Mnu: Followng the Gradent We can fnd the nu of EHw L n weght space by followng the negatve gradent - w EHw L = - EHw L ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ w We can pleent ths gradent strategy as follows: (9) Changng a Weght Each weght w k œ w s changed by Dw k proportonate to the E gradent at the current weght poston (.e., the current settngs of all the weghts):

6 Backprop-rntout.nb Dw k =-h E Hw L ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ w k (0) Steps Towards the Soluton Dw k =-h ÅÅÅÅÅÅÅÅÅÅÅÅ w k j k M ÅÅÅÅ 2 = j T - k n = Hw n ÿx n L y z { 2y z { Dw k =-h ÅÅÅÅ 2 = ÅÅÅÅÅÅÅÅÅÅÅÅ w k M j k = j T - k n Hw n ÿx n L y z { 2y z { () Dw k =-h ÅÅÅÅ 2 = 2 j T - Hw n ÿx n L y z H-x kl k n { Weght Adaptaton Rule Dw k =h HT - O L x k = (2) The paraeter h s usually referred to as the learnng rate. In ths forula, the adaptaton of the weghts are accuulated over all patterns. Delta, LMS Learnng If we change the weghts after each presentaton of an nput pattern to the network, we get a spler for for the weght update ter: or wth Dw k =h HT - O L x k Dw k =hd x k d = T - O. (3) (4) (5)

7 05.2-Backprop-rntout.nb 7 Ths learnng rule has several naes: Ë Delta rule Ë Adalne rule Ë Wdrow-Hoff rule Ë LMS (least ean square) rule. Gradent Descent Learnng wth Nonlnear Cells We wll now extend the gradent descent technque for the case of nonlnear cells, that s, where the actvaton/output functon s a general nonlnear functon g(x). The nput functon s denoted by hhxl. The output functon ghhhxll s assued to be dfferentable n x. Rewrtng the Error Functon The defnton of the error functon (Equaton (8)) can be sply rewrtten as follows: M E Hw L = ÅÅÅÅ 2 = M E Hw L = ÅÅÅÅ 2 = HT - O L 2 = j T - g j k k k = Hw k ÿx k L y y zz {{ 2 (6) Weght Gradents Consequently, we can copute the w k gradents: E Hw L ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ = HT w - g Hh LL ÿ g Hh L ÿx k k = (7)

8 Backprop-rntout.nb Fro Weght Gradents to the Learnng Rule Ths eventually (after soe ore calculatons) shows us that the adaptaton ter Dw k for w k has the sae for as n Equatons (0), (3), and (4), naely: where Dw k =hd x k d = HT - O L ÿ g Hh L (8) (9) Sutable Actvaton Functons The calculaton of the above d ters s easy for the followng functons g, whch are coonly used as actvaton functons: Hyperbolc Tangens: g HxL = tanh b x g HxL =bh - g 2 HxLL (20) Hyperbolc Tangens lot: lot@tanh@xd, 8x, -5, 5<D; - -

9 05.2-Backprop-rntout.nb 9 lot of the frst dervatve: lot@tanh'@xd, 8x, -5, 5<D; Check for equalty wth - tanh 2 x lot@ - Tanh@xD 2, 8x, -5, 5<D; Influence of the b paraeter: p@b_d := lot@tanh@b xd, 8x, -5, 5<, lotrange Ø All, DsplayFuncton Ø IdenttyD p2@b_d := lot@tanh '@b xd, 8x, -5, 5<, lotrange Ø All, DsplayFuncton Ø IdenttyD

10 Backprop-rntout.nb 8b,, 5<D;

11 05.2-Backprop-rntout.nb - - Table@Show@GraphcsArray@8p@bD, p2@bd<dd, 8b, 0.,, 0.<D;

12 Backprop-rntout.nb

13 05.2-Backprop-rntout.nb Sgod: g HxL = ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ + e-2 bx (2) g HxL = 2 b g HxL H - g HxLL Sgod lot: sgod@x_, b_d := ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ + E -2 b x lot@sgod@x, D, 8x, -5, 5<D; lot of the frst dervatve:

14 Backprop-rntout.nb bd, xd 2-2xb b ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ H + -2xb L 2 lot@d@sgod@x, D, xdêêevaluate, 8x, -5, 5<D; Check for equalty wth 2 ÿ g ÿ H - gl lot@2 sgod@x, D H - sgod@x, DL, 8x, -5, 5<D; Influence of the b paraeter:

15 05.2-Backprop-rntout.nb 5 p@b_d := lot@sgod@x, bd, 8x, -5, 5<, lotrange Ø All, DsplayFuncton Ø IdenttyD p2@b_d := lot@d@sgod@x, bd, xd êê Evaluate, 8x, -5, 5<, lotrange Ø All, DsplayFuncton Ø IdenttyD Table@Show@GraphcsArray@8p@bD, p2@bd<dd, 8b,, 5<D;

16 Backprop-rntout.nb b, 0.,, 0.<D;

17 05.2-Backprop-rntout.nb

18 Backprop-rntout.nb d Update Rule for Sgod Unts Usng the sgodal actvaton functon, the d update rule takes the sple for: d = O H - O L HT - O L. Learnng n Multlayer Networks (22) Multlayer networks wth nonlnear processng eleents have a wder capablty for solvng classfcaton tasks. Learnng by error backpropagaton s a coon ethod to tran ultlayer networks. Error Backpropagaton The backpropagaton (B) algorth descrbes an update procedure for the set of weghts w n a feedforward ultlayer network. The network has to learn nput-output patterns 8x k, T <. The bass for B learnng s, agan, a slar gradent descent technque as used for perceptron learnng, as descrbed above. Notaton We use the followng notaton: Ë x k :value of nput unt k for tranng pattern ; k =,, N; =,, Ë H j :output of hdden unt j

19 05.2-Backprop-rntout.nb 9 Ë O Ë w kj Ë W j :output of output unt, =,, M :weght of the lnk fro nput unt k to hdden unt j :weght of the lnk fro hdden unt j to output unt ropagatng the nput through the network For pattern the hdden unt j receves the nput N h j = w kj x k k= and generates the output H j = g Hh j L = g j kk= N w kj x y k z. { (23) (24) These sgnals are propagated to the output cells, whch receve the sgnals h = W j H j = W j g j j kk= j N w kj x y k z { (25) and generate the output O = g Hh L = g j k j N W j g j kk= y w kj x y k z { z { (26) Error functon We use the known quadratc functon as our error functon: M E Hw L = ÅÅÅÅ 2 = HT - O L 2 = (27) Contnung the calculatons, we get:

20 Backprop-rntout.nb M E Hw L = ÅÅÅÅ 2 = HT - g Hh LL 2 = M E Hw L = ÅÅÅÅ 2 = M E Hw L = ÅÅÅÅ 2 = = j T - g j k k j j T - g j k k j = N W j g j kk= W j H y y j zz {{ 2 yy w kj x y k z { zz {{ 2 (28) Updatng the weghts: hdden output layer For the connectons fro hdden to output cells we can use the delta weght update rule: DW j =-h E ÅÅÅÅÅÅÅÅÅÅÅÅ W j DW j =h HT - O L g Hh L H j DW j =h d H j (29) wth d = g Hh L HT - O L (30) Updatng the weghts: nput hdden layer Dw kj =-h E ÅÅÅÅÅÅÅÅÅÅÅÅ w kj Dw kj =-h j ÅÅÅÅÅÅÅÅÅÅ E k H j ÿ H j y ÅÅÅÅÅÅÅÅÅÅÅÅ z w kj { (3) After a few ore calculatons we get the followng weght update rule:

21 05.2-Backprop-rntout.nb 2 Dw kj =h d j x k (32) wth d j = g Hh j L W j d (33) The Backpropagaton Algorth For the B algorths we use the followng notatons: Ë V Ë V 0 Ë w j :output of cell n layer :corresponds to x, the -th nput coponent :the connecton fro V j - to V Backpropagaton Algorth Ï Step : Intalze all weghts wth rando values. Ï Step 2: Select a pattern x and attach t to the nput layer H = 0L: V j 0 =x j, " k (34) Ï Step 3: ropagate the sgnals through all layers: V = g Hh L = g j k j w j V y - j z, ", " { (35) Ï Step 4: Calculate the d's of the output layer: d M = g Hh M L HT M - V M L (36) Ï Step 5: Calculate the d's for the nner layers by error backpropagaton: d - = g Hh - L j w j d j,= M, M -,, 2 (37) Ï Step 6: Adapt all connecton weghts: w new j = w old j +Dw wth j Dw j =hd - V j (38)

22 Backprop-rntout.nb Ï Step 7: Go back to Step 2 for the next tranng pattern. Exaples TC Learnng Task XOR References Freean, J. A. Sulatng Neural Networks wth Matheatca. Addson-Wesley, Readng, MA, 994. Hertz, J., Krogh, A., and aler, R. G. Introducton to the Theory of Neural Coputaton. Addson-Wesley, Readng, MA, 99. Rojas, R. Neural Networks: A Systeatc Introducton. Sprnger Verlag, Berln,996.

Feedforward Networks. Gradient Descent Learning and Backpropagation. Christian Jacob. CPSC 533 Winter 2004

Feedforward Networks. Gradient Descent Learning and Backpropagation. Christian Jacob. CPSC 533 Winter 2004 Feedforward Networks Gradient Descent Learning and Backpropagation Christian Jacob CPSC 533 Winter 2004 Christian Jacob Dept.of Coputer Science,University of Calgary 2 05-2-Backprop-print.nb Adaptive "Prograing"