Variations on Backpropagation

Size: px

Start display at page:

Download "Variations on Backpropagation"

Sherilyn Cummings
5 years ago
Views:

1 2 Variations on Backpropagation

2 2 Variations Heuristic Modifications Moentu Variable Learning Rate Standard Nuerical Optiization Conjugate Gradient Newton s Method (Levenberg-Marquardt) 2

3 2 Perforance Surface Exaple Network Architecture Noinal Function Input p Σ w, Log-Sigoid Layer b n n 2 b 2 a w 2, a 2 w 2, Σ w 2,2 Log-Sigoid Layer Σ a logsig (W p + b ) a 2 logsig (W 2 a + b 2 ) b 2 n 2 a Paraeter Values w, w 2, b b 2 2 w, w, 2 2 b 2 3

4 2 Squared Error vs. w, and w2, w 2, w 2, w, - w, 4

5 2 Squared Error vs. w, and b b w, b 2 w,

6 2 Squared Error vs. b and b 2.4 b b b b 6

7 2 Convergence Exaple w 2, - - w, 7

8 2 Learning Rate Too Large w 2, - - w, 8

9 2 Moentu Filter yk ( ) γyk ( ) + ( γ )wk ( ) γ < Exaple wk ( ) + sin 2πk γ.9 γ

10 2 Moentu Backpropagation Steepest Descent Backpropagation (SDBP) W ( k) αs ( a ) T b ( k) αs w 2, Moentu Backpropagation (MOBP) W ( k) γ W ( k ) ( γ )αs ( a ) T b ( k) γ b ( k ) ( γ )αs - - w, γ.8

11 2 Variable Learning Rate (VLBP) If the squared error (over the entire training set) increases by ore than soe set percentage ζ after a weight update, then the weight update is discarded, the learning rate is ultiplied by soe factor ( > ρ > ), and the oentu coefficient γ is set to zero. If the squared error decreases after a weight update, then the weight update is accepted and the learning rate is ultiplied by soe factor η>. If γ has been previously set to zero, it is reset to its original value. If the squared error increases by less than ζ, then the weight update is accepted, but the learning rate and the oentu coefficient are unchanged.

12 2 Exaple η. w 2, ρ.7 ζ 4% - - w, Iteration Nuber 2 3 Iteration Nuber 2

13 2 Conjugate Gradient. The first search direction is steepest descent. p g g k F( x) x x k 2. Take a step and choose the learning rate to iniize the function along the search direction. x k + x k + α k p k 3. Select the next search direction according to: p k g k + β k p k where T g k g k g T k g k g β k T or β k T or k β k T g k p k g k g k g k g k T g k 3

14 2 Interval Location F(x + α p ) 4ε 8ε 2ε ε a b a 2 b 2 α a 3 a 4 a b 3 b 4 b 4

15 2 Interval Reduction F(x + α p ) F(x + α p ) a c b a c d b (a) Interval is not reduced. α α (b) Miniu ust occur between c and b.

16 2 Golden Section Search τ.68 Set c a + (-τ)(b -a ), F c F(c ) d b - (-τ)(b -a ), F d F(d ) For k,2,... repeat If F c < F d then Set a k+ a k ; b k+ d k ; d k+ c k c k+ a k+ + (-τ)(b k+ -a k+ ) F d F c ; F c F(c k+ ) else Set a k+ c k ; b k+ b k ; c k+ d k d k+ b k+ - (-τ)(b k+ -a k+ ) F c F d ; F d F(d k+ ) end end until b k+ - a k+ < tol 6

17 2 Conjugate Gradient BP (CGBP) Interediate Steps Coplete Trajectory w 2, w 2, w, w, 7

18 2 Newton s Method x k g k + x k A k A k 2 F( x) g x x k F( x) k x x k If the perforance index is a su of squares function: N F( x) v 2 i ( x) i v T ( x)vx ( ) then the jth eleent of the gradient is F( x) [ F( x) ] j v x i ( x) v i ( x ) j x j N i 8

19 2 Matrix For The gradient can be written in atrix for: F( x) 2J T ( x)vx ( ) where J is the Jacobian atrix: v ( x) x v ( x) v ( x ) x 2 x n Jx ( ) v 2 ( x) x v 2 ( x) v ( 2 x ) x 2 x n v N ( x) x v N ( x) v ( N x ) x 2 x n 9

20 2 Hessian 2 [ F( x 2 F( x) )] k, j x k x j N i v i ( x ) v i( x) v x k x i ( x) 2 v i ( x ) j x k x j 2 F( x) 2J T ( x)jx ( ) + 2S( x) N Sx ( ) v i ( x) 2 v i ( x) i 2

21 2 Gauss-Newton Method Approxiate the Hessian atrix as: 2 F( x) 2J T ( x)jx ( ) Newton s ethod becoes: + x k [ 2J T ( x k )Jx ( k )] 2J T ( x k )vx ( k ) x k x k [ J T ( x k )Jx ( k )] J T ( x k )vx ( k ) 2

22 2 Levenberg-Marquardt Gauss-Newton approxiates the Hessian by: H This atrix ay be singular, but can be ade invertible as follows: J T J G H + µi If the eigenvalues and eigenvectors of H are: { λ, λ 2,, λ n } { z, z 2,, z n } Eigenvalues of G then Gz i [ H + µi]z i Hz i + µz i λ i z i + µz i ( λ i + µ )z i + x k [ J T ( x k )Jx ( k ) + µ k I] J T ( x k )vx ( k ) x k 22

23 2 Adjustent of µ k As µ k, LM becoes Gauss-Newton. + x k [ J T ( x k )Jx ( k )] J T ( x k )vx ( k ) x k As µ k, LM becoes Steepest Descent with sall learning rate. x k + x k J T ( x µ k )vx ( k ) x k F( x) k 2µ k Therefore, begin with a sall µ k to use Gauss-Newton and speed convergence. If a step does not yield a saller F(x), then repeat the step with an increased µ k until F(x) is decreased. F(x) ust decrease eventually, since we will be taking a very sall step in the steepest descent direction. 23

24 2 Application to Multilayer Network The perforance index for the ultilayer network is: Q F( x) ( t q a q ) T T ( t q a q ) e q eq ( ) 2 q Q q Q q S M j e j, q N i ( v i ) 2 v T v v 2 v N e The error vector is:, e 2, e S M, e 2 The paraeter vector is:, e M S, Q x T x x 2 x n w, w 2, w S, R b bs 2 w, M b M S N Q S M The diensions of the two vectors are: n S ( R + ) + S 2 ( S + ) + + S M ( S M + ) 24

25 2 Jacobian Matrix e, w, e 2, w, e, e, w 2, e 2, w 2, w S, R e2, w S, R e, b e 2, b Jx ( ) e M S, w, e, 2 w, e M S, w 2, e 2, w 2, e es M , w S, R e, w S, R e es M , b e, 2 b 2

26 2 Coputing the Jacobian SDBP coputes ters like: Fˆ ( x) x l e q T eq x l using the chain rule: Fˆ w i, j Fˆ n i n i w i, j where the sensitivity s i Fˆ n i is coputed using backpropagation. For the Jacobian we need to copute ters like: J [ ] hl x l v h, x l e k, q 26

27 2 Marquardt Sensitivity If we define a Marquardt sensitivity: v h, n iq, s i h e k, q n iq, h ( q )S + k We can copute the Jacobian as follows: [ J] hl e k, q e k, q n iq n iq s i x, h s i, h l v h, w i, j n iq, weight, w i, j, w i, j a j, q J [ ] hl v h, bias e k, q e k, q n iq n iq x s i, h s i, h l b i b i b i n iq,,, 27

28 2 Coputing the Sensitivities M v h, M n iq, s i h e k, q ( t k, q a k, q ) M s i, h Initialization M n i, q M n i, q M f M ( n iq ) for i k, M for i k M a k, q M n iq, M S q Ḟ M M ( n q ) Backpropagation S q Ḟ ( n q )( W + ) T S q + S S S 2 S Q 28

29 2 LMBP Present all inputs to the network and copute the corresponding network outputs and the errors. Copute the su of squared errors over all inputs. Copute the Jacobian atrix. Calculate the sensitivities with the backpropagation algorith, after initializing. Augent the individual atrices into the Marquardt sensitivities. Copute the eleents of the Jacobian atrix. Solve to obtain the change in the weights. Recopute the su of squared errors with the new weights. If this new su of squares is saller than that coputed in step, then divide µ k by υ, update the weights and go back to step. If the su of squares is not reduced, then ultiply µ k by υ and go back to step 3. 29

30 2 Exaple LMBP Step w 2, - - w, 3

31 2 LMBP Trajectory w 2, - - w, 3

Ch 12: Variations on Backpropagation

Ch 12: Variations on Backpropagation Ch 2: Variations on Backpropagation The basic backpropagation algorith is too slow for ost practical applications. It ay take days or weeks of coputer tie. We deonstrate why the backpropagation algorith