EEE : Lnear Systems Summary #: Backpropagaton BACKPROPAGATION The perceptron rule as well as the Wdrow Hoff learnng were desgned to tran sngle layer networks. They suffer from the same dsadvantage: they are only able to solve lnearly separable classfcaton problems. Ths mples that some smple classfcaton problems such as XOR cannot be solved usng neural networks. The obvous soluton s to use mult-layer networks. Backpropagaton s a learnng algorthm that can be used to tran multple networks. MULTILAYER PERCEPTRON Multlayer networks are shown n fgure. The network conssts of three cascaded networks, the output of the frst networks s the second s network nput. Superscrpts are used to ndcate the layer. In general the followng notaton s used for the total number of neurons n a network of M layers N L L L 3... L M () where N refers to the number of nputs. PATTERN CLASSIFICATION The exclusve XOR problem cannot be solved by a sngle layer network. Ths example was used by Mnsky Papert (969) to demonstrate the lmtatons of sngle layer networks. In order to llustrate ths, consder the nput/output pars for the XOR gate [ 0 {x, t 0] 0} () 0 {x, t } (3) [ {x 3, t 0] 3 } () [ {x, t ] 0} () A representaton s shown n fgure. A two layer network can solve the problem easly. One possblty s to use a two layer network where the frst layer creates two lnear decson boundares the second layer combnes the decson boundares as shown n fgure -top. Example Let us take x, x 0. We have n () + (0) y (6) n () (0) +. 0. y (7) n +() + (). 0. y (8) whch confrms the XOR operaton. FUNCTION APPROXIMATION The other mportant applcaton of neural networks s functon approxmaton (partcularly nonlnear functons). Functon approxmaton can be accomplshed usng multple layers. Consder the -- network shown n fgure -bottom. The network parameters are w 3 (9) 3 b (0) [ w () ] b 3 () Ths network gves the nonlnear functon shown n fgure 3. By changng the weghts we can approxmate other functons of smlar complexty. It has been shown that multlayer networks wth sgmod transfer functon n the hdden layer suffcently large number of hdden layers can approxmate any functon of nterest. The queston now s how to derve a learnng algorthm for multple layer networks. Ths s dscussed n the next secton. BACKPROPAGATION Backpropagaton s a learnng algorthm for multple layer networks. It uses an optmzaton technque called the steepest or gradent descent. We contnue to use the notaton where the superscrpt s used to ndcate the layer. The equaton descrbng the output at layer m s gven by y m+ f m+ (W m+ y m + b m+ ) (3) for m 0,,..., M, where M s the total number of layers. It s possble to wrte y 0 x. In a smlar way to the Wdrow Hoff algorthm, Backpropagaton uses the mean square error as the performance ndex. The algorthm s provded wth pars f tranng data: (x, t ), (x, t ),..., (x K, t K ) where x s the nput to the network t s the target output correspondng to x. The algorthm adjusts the weghts bases to mnmze the expectaton (E) of the square error, that s F (W ) E[e ] E[(t y) ] () where W n equaton () refers to the augmented weght vector (weghts bases). The expectaton of the square error s replaced by the squared error at teraton k as follows F (W ) (t y) T (t y) () (e T e) (6)
EEE, sprng 07 Backpropagaton Fg.. Mult layer networks. Top: general case bottom: network for functon approxmaton.8.6.. y 3.8 3.6 3. 3.. 0. 0 0.. x Fg.. XOR mplementaton Fg. 3. Functon approxmaton The steepest descent algorthm conssts of the followng equatons wj m (k + ) wj m (k) α wj m b m (k + ) b m (k) α b m (7) where α s the learnng rate. The calculaton of the partal dervatves was straghtforward n the case of sngle layer networks, t s not n ths case (multple layers) where the chan rule s used. Because the error s functon of the weghts, the cost functon also depends on the weghts. The chan rule can
EEE, sprng 07 Backpropagaton be summarzed as follows: assume f s a functon of x x s a functon of t, the dervatve of f wth respect to t s gven by the chan rule as follows df dt df dx (8) dx dt Usng the chan rule, the second terms n system (7) become w m j b m w m j b m (9) (0) For each layer, we know that n s a weghted sum of the nputs+ bas, that s n m L m j w m j y m j + b m () where L m s the number of neurons n layer m. Therefore, t s possble to calculate the partal dervatves as follows: w m j y m j () b m (3) We defne the senstvty of F as follows In ths case, we can wrte s m w m j b m () s m y m j () s m (6) Fnally, the gradent descent algorthm becomes or under matrx form: wj m (k + ) wj m (k) αs m y m j (7) b m (k + ) b m (k) αs m (8) W m (k + ) W m (k) αs m (y m j ) T (9) b m (k + ) b m (k) αs m (30) where the senstvty vector s gven by s m n m. (3) To calculate the weght the bas, equatons (9) (30), we need to have the prevous weght bas as well as the senstves. Therefore, we need to calculate the senstvtes Fg.. Illustraton of the relatonshp between n m+ n m frst to obtan next values for the weghts bases. In order to mplement the gradent descent, we need to estmate the senstvtes. BACKPROPAGATING THE SENSITIVITIES The chan rule s used for ths purpose, the senstvty at layer m s computed from the senstvty at layer m +. It s possble to derve the Jacoban matrx n m n m n m. Lm+ n m Lm+ Lm+ (3) We want to fnd an expresson for each element n the matrx. Element, j of the Jacoban matrx are: ( L m l wm+ l y m l + b m+ ) Note that n general, t s possble to wrte n m nm+ y m (33) y m n m (3) Thus, we get for each element n the Jacoban matrx where w m+ yj m j w m+ f m (n m j ) j (3) (36) w m+ j g m (n m j ) (37) g m (n m j ) f m (n m j ) (38) 3
EEE, sprng 07 Backpropagaton Therefore, the Jacoban matrx can be wrtten n matrx form as follows n m W m+ G m (n m ) (39) wth g m (n m ) 0 0 0 g m (n m ) 0 G m (n m ) 0 0 g m (n m 3 ). 0 0 0 g m (n m L m ) (0) Now we can wrte a recurrence relatonshp for the senstvtes: s m n m+ T n m n m () thus s m G m (n m )[W m+ ] T s m G m (n m ) [ W m+] T s m+ () (3) Equaton (3) shows that the senstvty s propagated backwards throughout the network from the last layer to the frst layer, that s s M s M s s () Let us take a look at the fnal layer: s M n M whch can be wrtten as therefore For the last layer That s y M n M s M s M or under matrx form (t y)t (t y) n M LM j (t j y j ) n M (t y ) y n M f M (n M ) n M () (6) g M (n M ) (7) s M (t y )g M (n M ) (8) s M G M (n M )(t y) (9) The algorthm conssts of three steps: Forward propagaton of the nputs throughout the network y m+ f m+ (W m+ y m + b m+ ) (0) wth m 0,,,..., M y 0 x () Backward propagaton of the senstvtes, we start wth y. 0. 0 0.. 0 Fg.. Neural netwok approxmaton of the functon gven by (7) the output layer x s M G M (n M )(t y M ) () s m G m (n m )(W m+ ) T s m+ (3) Update the weghts bases m M, M,...,, () W m (k + ) W m (k) αs m (y m ) T () Example: Functon approxmaton b m (k + ) b m (k) αs m (6) We propose to use a -- network to approxmate the nonlnear functon ( πx ) g(x) sn (7) n the nterval x [, ]. The backpropagaton algorthm converges to the followng soluton W.889 (8).39 b 0.77 (9) 0.730 W [.697 0.686 ] (60) b.90 (6) The actual functon the neural network Example Ths example llustrates the backpropagaton algorthm. The network s as shown n the fgure -bottom. Layer has a logstc sgmod transfer functon layer has a lnear transfer functon. The ntal guess for the wesghts bases s below. W 0.700 (6) 0.00
EEE, sprng 07 b 0.800 0.300 (63) W [ 0.0900 0.700 ] (6) b 0.8 (6) Forward progapagaton The ntal nput s x. The output of the frst layer s fnally n W x + b (66) n 0.700 0.800 + 0.00 0.300 n [ +e 0.7 +e 0. 0.700 0.00 Now, the output of the frst layer can be obtaned ] y 0.308 0.638 The output of the second layer s (67) (68) (69) n W y + b (70) n [ 0.0900 0.700 ] 0.308 + 0.8 0.69 0.638 (7) y n (7) The error would be s Backpropagaton ( y )(y) 0 0.09 [ 0.6 ] 0 ( y)(y ) 0.7 (8) s 0.00 (8) 0.0 The fnal step conssts of updatng the weghts (wth α 0.0) W W α s ( y ) T (83) b b α s (8) W W α s (x) T (8) b b α s (86) W [ 0.09 0.7 ] 0.0[ 0.63] [ 0.308 0.638 ] (87) Usng these equatons we get the numercal values below W [ 0.090 0.66 ] (88) b [ 0.86 ] (89) W 0.699 0.098 (90) b 0.799 0.98 (9) The dfference wth the ntal guess s small because the learnng rate s small. e t y sn(π/ ) 0.69 0.033 (73) Propagaton of the senstvtes The goal s obtan matrx G for each one of the layers. For layer, we have g (n) d ( ) dn + e n (7) ( g e n ) (n) ( + e n ) (7) For the second layer, we have g (n) ( y)(y) (76) g (n) (77) Now t s possble to wrte for the senstvtes s G (n )(t y ) (78) s 0.033 0.63 (79) The frst layer senstvty s s G (n )[W ] T s (80)