Ch 2: Variations on Backpropagation The basic backpropagation algorith is too slow for ost practical applications. It ay take days or weeks of coputer tie. We deonstrate why the backpropagation algorith is slow in converging. We saw the steepest descent is the slowest iniization ethod. The conjugate gradient algorith and Newton's ethod generally provide faster convergence.
Variations Heuristic odifications Moentu and rescaling variables Variable learning rate Standard nuerical optiization Conjugate gradient Newton s ethod (Levenberg-Maruardt) 2
Drawbacks of BP We saw that the LMS algorith is guaranteed to converge to a solution that iniizes the ean suared error, so long as the learning rate is not too large. Single layer uadratic function constant Hessian atrix constant curvature. Steepest Descent backpropagation (SDBP) is a generalization of the LMS algorith. Multilayer nonlinear Net any local iniu points the curvature can vary widely in different regions of the paraeter space. 3
Perforance Surface Exaple Network Architecture Noinal Function Paraeter Values w = 0 w 2 = 0 b = 5 b 2 = 5 2 w 2 = w 2 = b 2 = 4
Suared Error vs. w and w 2,, The curvature varies drastically over the paraeter space. So it is difficult to choose an appropriate learning rate for SD algorith. 5
Suared Error vs. w and b, w = 0 b = 5 6
Suared Error vs. b and b 2 b = 5 b 2 = 5 7
Convergence Exaple We use a variation of the standard algorith, called batching. In batching ode the paraeters are updated only after the entire training set has been presented. The gradients calculated at each training exaple are averaged together to produce a ore accurate estiate of the gradient. Soothing the training saple outliers Learning independent of the order of saple presentations Usually slower than in seuential ode 8
b a a: converge to the optial solution, but the convergence is slow. b: converge to a local iniu (w, =0.88, w 2,=38.6). 9
0
Learning Rate Too Large nnd2sd nnd2sd2
Moentu Filter yk = yk + w k 0 Exaple wk = + sin 2k ---- 6 2
Observations The oscillation of the filter output is less than the oscillation in the filter input (low pass filter). As γ is increased the oscillation in the filter output is reduced. The average filter output is the sae as average filter input, although as γ is increased the filter output is slower to respond. To suarize, the filter tends to reduce the aount of oscillation, while still tracking the average value. 3
Moentu Backpropagation Steepest Descent Backpropagation (SDBP) W k = s a T w 2, b k = s Moentu Backpropagation (MOBP) W k b k = W k s a = b k s T w, = 0.8 4
The batching for of MOBP, in which the paraeters are updated only after the entire exaple set has been presented. The sae initial condition and learning rate has been used as in the previous exaple, in which the algorith was not stable. The algorith now is stable and it tends to accelerate convergence when the trajectory is oving in a consistent direction. nnd2o 5
Variable Learning Rate (VLBP) If the suared error (over the entire training set) increases by ore than soe set percentage z after a weight update, then the weight update is discarded, the learning rate is ultiplied by soe factor (0<r<), and the oentu coefficient is set to zero. If the suared error decreases after a weight update, then the weight update is accepted and the learning rate is ultiplied by soe factor h>. If has been previously set to zero, it is reset to its original value. If the suared error increases by less than z, then the weight update is accepted, but the learning rate and the oentu coefficient are unchanged. 6
Exaple h =.05 r = 0.7 z = 4% nnd2vl 7
Suared Error α Learning Rate Convergence Characteristics Of Variable Learning Rate 8
Other algoriths Adaptive learning rate (delta-bar-delta ethod) [Jacobs 88]. Each weight w jk has its own rate α jk If w jk reains in the sae direction, increase α jk (F has a sooth curve in the vicinity of current W) If w jk changes the direction, decrease α jk (F has a rough curve in the vicinity of current W) delta-bar-delta also involves oentu ter 9
Quickprop algorith of Fahlan (988).(It assues that the error surface is parabolic and concave upward around the iniu point and that the effect of each weight can be considered independently) SuperSAB algorith of Tollenaere (990). (It has ore coplex rules for adjusting the learning rates).. Drawbacks In SDBP we have only one paraeter to select, but in heuristic odification soeties we have six paraeters to be selected. Soeties odifications fail to converge while SDBP will eventually find a solution. 20
Experiental Coparison Training for XOR proble (batch ode) 25 siulations: success if E averaged over 50 consecutive epochs is less than 0.04 Results. ethod siulations success Mean epochs BP 25 24 6,859.8 BP with oentu BP with deltabar-delta 25 25 2,056.3 25 22 447.3
Conjugate Gradient We saw SD is the siplest optiization ethod but is often slow in converging. Newton s ethod is uch faster, but reuires that the Hessian atrix and its inverse be calculated. The conjugate gradient is a coproise; it does not reuire the calculation of 2 nd derivatives, and yet it still has the uadratic convergence property. Now we describe how the conjugate gradient algorith can be used to train ultilayer network. This algorith is called Conjugate Gradient Backpropagation (CGBP). 22
Review Of CG Algorith. The first search direction is steepest descent. p 0 = g 0 g k F x 2. Take a step and choose the learning rate to iniize the function along the search direction. x k = + + x k k p k x = x k 3. Select the next search direction according to: where T g k g k k --------------- T g k p k p k = g k + k p k g T k g = or k k g = ------------- or k k = ------------- T g k g k T T g k g k g k 23
This cannot be applied to neural network training, because the perforance index is not uadratic. We cannot use to iniize the function along a line. - The exact iniu will not norally reached in a finite nuber of steps, and therefore the algorith will need to be reset after soe set nuber of iterations. Locating the iniu of a function Interval location Interval reduction k T Fx p x = x k = ------------------------ k = T p k 2 Fx p x = x k k g T k p ----------- k p T k A k p k 24
Interval Location 25
Interval Reduction 26
Golden Section Search t=0.68 Set c = a + (-t)(b -a ), F c =F(c ) d = b - (-t)(b -a ), F d =F(d ) For k=,2,... repeat If F c < F d then Set a k+ = a k ; b k+ = d k ; d k+ = c k c k+ = a k+ + (-t)(b k+ -a k+ ) F d = F c ; F c =F(c k+ ) else Set a k+ = c k ; b k+ = b k ; c k+ = d k d k+ = b k+ - (-t)(b k+ -a k+ ) F c = F d ; F d =F(d k+ ) end end until b k+ - a k+ < tolerance 27
For uadratic functions the algorith will converge to the iniu in at ost n (# of paraeters) iterations; this norally does not happen for ultilayer networks. The developent of the CG algorith does not indicate what search direction to use once a cycle of n iterations has been copleted. The siplest ethod is to reset the search direction to the steepest descent direction after n iterations. In the following function approxiate exaple we use the BP algorith to copute the gradient and the CG algorith to deterine the weight updates. This is a batch ode algorith. 28
Conjugate Gradient BP (CGBP) nnd2ls nnd2cg 29
Newton s Method x k g k + = x k A k A k Fx g k F x 2 x = x k x = x k If the perforance index is a su of suares function: Fx N = v 2 i x = i = v T xv x then the jth eleent of the gradient is F x F x j ------- 2 v x i x v i x = = ------- j x j N i = 30
Matrix For The gradient can be written in atrix for: Fx = 2J T xvx where J is the Jacobian atrix: v x ---------------- x v x ---------------- v x x 2 x n ---------------- J x = v 2 x ---------------- x v 2 x ---------------- v 2 x x 2 x n ---------------- v N x ----------------- x v N x ----------------- v N x x 2 x n ----------------- N n 3
Now we want to find the Hessian atrix 2 Fx k j 2 Fx = --------- = 2 x k x j N i = v i x -------- v ix ------- v x k x i x 2 v i x + --------- j x k x j F x F x j ------- 2 v x i x v i x = = ------- j x j N i = 2 Fx = 2J T xj x + 2Sx where Sx = N i = v i x 2 v i x 32
Gauss-Newton Method Approxiate the Hessian atrix as: 2 Fx 2J T xjx We had: Fx x k = (if we assue that S(x) is sall) 2J T xvx g k + = x k A k Newton s ethod becoes: x k + = x k 2J T x k Jx k = x k J T x k Jx k 2J T x k T J xk vx k vx k 33
We call this the Gauss-Newton ethod. Note that the advantage of Gauss-Newton over the standard Newton s ethod is that it does not reuire calculation of 2 nd derivatives. 34
Levenberg-Maruardt Gauss-Newton approxiates the Hessian by: H = This atrix ay be singular, but can be ade invertible as follows: J T J G = H + I If the eigenvalues and eigenvectors of H are: 2 n z z 2 z n then Gz i = H + Iz i = Hz i + z i = i z i + z i = i + z i Eigenvalues of G G can be ade positive definite by increasing µ until λ i + µ >0 for all i. + = x k J T x k Jx k + k I J T x k vx k x k 35
Adjustent of k x k As k 0, LM becoes Gauss-Newton. + = x k J T x k Jx k J T x k vx k As k, LM becoes Steepest Descent with sall learning rate. x k + x k ---J T x k vx k = x k ----Fx k 2 k Therefore, begin with a sall k to use Gauss-Newton and speed convergence. If a step does not yield a saller F(x), then repeat the step with an increased k until F(x) is decreased. F(x) ust decrease eventually, since we will be taking a very sall step in the steepest descent direction. 36
Application To Multilayer Network F x Q t a T t a = = = 2 = = Eual probability The perforance index for the ultilayer network is: Q = e T e Q = S M j = e j Where e j, is the jth eleent of the error for the th input/target pair. N i = v i 2 This is siilar to perforance index, for which LM was designed. In standard BP we copute the derivatives of the suared errors, with respect to weights and biases. To create atrix J we need to copute the derivatives of errors. 37
The error vector is: v T = v v 2 v N = e The paraeter vector is: e 2 e S M e 2 e M S Q x T = x x 2 x n = w w 2 w S R b bs 2 w M b M S The diensions of the two vectors are: N = Q S M, n = S R + + S 2 S + + + S M S M + If we ake these substitutions into the Jacobian atrix for ultilayer network training we have: 38
Jacobian Matrix e ------------- w e ------------- w 2 e ---------------- w S R e ------------ b e 2 ------------- w e 2 ------------- w 2 e 2 ---------------- w S R e 2 ------------ b J x = e M S --------------- w e M S --------------- w 2 e e S M ---------------- w S R e e S M --------------- b e 2 ------------- w e 2 ------------- w 2 e 2 ---------------- w S R e 2 ------------ b N n 39
Coputing The Jacobian SDBP coputes ters like: Fˆ x ------- x l = e T e -------- x l using the chain rule: Fˆ ------ w i j = Fˆ ----- n i n i ------ w i j where the sensitivity s i Fˆ ----- is coputed using backpropagation. n i For the Jacobian we need to copute ters like: v J h e h l = ---- = ------ k x l x l 40
J h l s i h J h l Maruardt Sensitivity v ------ h n i If we define a Maruardt sensitivity: = e ------ k n i We can copute the Jacobian as follows: v h e k e k n ---- ------ ------ i n ------ i = = = = s x i h ------ = l w i j n i weight w i j w i j s i h bias v h e k e k n i n i = ---- = ------ = ------ ------ = x s i h ------ = l b i b i b i n i, h S M + k = Δ a j s i h 4
Coputing the Sensitivities Initialization M s i h v = ----- h = e ------ k = t k a k -------------- - = M n i M n i M ~ M s f ( n i, i, h 0 M ) for M n i for i M i k k ------ M n i M a k Therefore when the input p has been applied to the M network and the corresponding network output a has been coputed, the LMBP is initialized with ~ M S M F ( n M ) 42
Where F ( n ) f ( n ) 0... 0 0 f ( n2 )... 0 : : : 0 0... f ( n ) S M Each colun of the atrix S ~ ust be backpropagated through the network using the following euation (Ch) to produce one row of the Jacobian atrix. s = F ( n ) W + T s + 43
The coluns can also be backpropagated together using ~ ~ S F ( n )( W ) T S The total Maruardt sensitivity atrices for each layer are then created by augenting the atrices coputed for each input: S S S2... S Q Note that for each input we will backpropagate S M sensitivity vectors. Because we copute the derivatives of each individual error, rather than the derivative of the su of suares of the errors. For every input we have S M errors. For each error there will be one row of the Jacobian atrix. 44
After the sensitivities have been backpropagated, the Jacobian atrix is coputed using: J h l v h e k e k n ---- ------ ------ i n ------ i = = = = s x i h ------ = l w i j n i w i j w i j s i h a j J h l v h e k e k n i n i = ---- = ------ = ------ ------ = x s i h ------ = l b i b i b i n i s i h 45
LMBP (suarized) Present all inputs to the network and copute the corresponding network outputs and the errors. Copute the su of suared errors over all inputs. e t a M F x Q t a T t a = = = 2 = = Q = e T e Q = S M j = e j N i = v i 2 Copute the Jacobian atrix. Calculate the sensitivities with the backpropagation algorith, after initializing. Augent the individual atrices into the Maruardt sensitivities. Copute the eleents of the Jacobian atrix. 46
~ M S M F ( n M ) ~ ~ S F ( n )( W ) T S, = M 2 S S S2... S Q J h l v h e k e k n ---- ------ ------ i n ------ i = = = = s x i h ------ = l w i j n i w i j w i j s i h a j J h l v h e k e k n i n i = ---- = ------ = ------ ------ = x s i h ------ = l b i b i b i n i s i h 47
Solve the following E. to obtain the change in the weights. x k = + k I J T x k vx k k k k + x k J T x k J x k x x x Recopute the su of suared errors with the new weights. If this new su of suares is saller than that coputed in step, then divide k by u, update the weights and go back to step. If the su of suares is not reduced, then ultiply k by u and go back to step 3. The algorith is assued to have converged when the nor of the gradient is less than soe predeterined value, or when the su of suares has been reduced to soe error goal. See P2.5 for a nuerical illustration of Jacobian coputation. 48
Exaple LMBP Step Black arrow: sall µ k (Gauss-Newton direction) Blue arrow : large µ k (SD direction) Blue curve: LM for interediate µ k 49
LMBP Trajectory nnd2s nnd2 Storage reuireent: n n for Hesssian atrix HW9 - Ch 2: 3,6,8,3,5 50