Steepest descent algorithm. Conjugate gradient training algorithm. Steepest descent algorithm. Remember previous examples

Size: px

Start display at page:

Download "Steepest descent algorithm. Conjugate gradient training algorithm. Steepest descent algorithm. Remember previous examples"

Merry Hodge
5 years ago
Views:

1 Conjugate gradient training algorithm Steepest descent algorithm So far: Heuristic improvements to gradient descent (momentum Steepest descent training algorithm Can we do better? Definitions: weight vector at step j w j E[ w j ] gradient at step j search direction at step j Next: Conjugate gradient training algorithm Overview Derivation Examples Steepest descent algorithm Remember previous examples Choose an initial weight vector w and let d g Perform line minimization along, j, such that: E( w j η E( w j η, η 3 Let w j w j η 4 Evaluate 5 Let 6 Let j j and go to step E ω 5 E ω ω ω

2 Remember previous examples Steepest descent algorithm examples ω - E( ω, ω exp( 5ω ω - ω ω ω steepest descent 5 steps to convergence quadratic ω 5 ω steepest descent 4 steps to convergence non-quadratic Conjugate gradient algorithm (a sneak peek Conjugate gradient algorithm (sneak peak Choose an initial weight vector w and let d g Perform a line minimization along, such that: E( w j η E( w j η, η 3 Let w j w j η 4 Evaluate 5 Let where, ( g j 6 Let j j and go to step ω ω conjugate gradient steps to convergence quadratic ω 5 5 ω conjugate gradient 5 steps to convergence non-quadratic

3 SD vs CG Conjugate gradients: a first look In steepest descent: Key difference: new search direction Very little additional computation (over steepest descent No more oscillation back and forth gwt [ ( ] d( t (What does this mean? Key question: Why/How does this improve things? Knowledge of local quadratic properties of error surface Conjugate gradients: a first look Non-interfering directions: gwt [ ( ηd( t ] d( t, η (What the #$@!# does this mean? How do we achieve non-interfering directions? FAC: gwt [ ( ηd( t ] d( t, η implies d( t Hd( t (H-orthogonality, conjugacy Hmmm need to pay attention to nd-order properties of error surface E( w E b w --w Hw

4 Show H-orthogonality requirement Approximate about : w gw ( Let w w( t : by st-order aylor approximation gw ( gw ( ( w w { gw ( } gw ( gwt [ ( ] [ w w( t ] { gwt [ ( ]} Evaluate gw ( at w w( t ηd( t : gwt [ ( ηd( t ] gwt [ ( ] ηd( t H Show H-orthogonality requirement gwt [ ( ηd( t ] gwt [ ( ] ηd( t H Post-multiply by d( t : Left-hand side: gwt [ ( ηd( t ] d( t Right-hand side: (implication (assumption gwt [ ( ] d( t ηd( t Hd( t ηd( t Hd( t d( t Hd( t How do we achieve non-interfering directions? Proven fact: gwt [ ( ηd( t ] d( t, η implies d( t Hd( t (H-orthogonality Derivation of conjugate gradient algorithm Local quadratic assumption: E( w E b w --w Hw Assume: W mutually conjugate vectors Key: need to construct consecutive search directions d are conjugate ( H -orthogonal! (Side note: What is the implicit assumption of SD? that w Initial weight vector Question: How to converge to w in (at most W steps?

5 Step-wise optimization W ( w w α i i (why can I do this? Linear independence of conjugate directions heorem: For a positive-definite square matrix H, H - orthogonal vectors { d, d,, d k } are linearly independent W w w α i i Proof: Linear independence: α d α d α k d k iff α i, i j w j w α i w j i w j Linear independence of conjugate directions Linear independence: α d α d α k d k iff α i, i Pre-multiply by d i H : α d i Hd α d i Hd α k d i Hd k d i H However: Linear independence of conjugate directions α d i Hd α d i Hd α k d i Hd k reduces to: α i d i α d i Hd α d i Hd α k d i Hd k d i > herefore: (by assumption Note (by assumption: d i H, i j α i, i {,,, k}

6 Linear independence of conjugate directions From linear independence: H -orthogonal vectors form a complete basis set Any vector v can be expressed as: v W i α i So, why did we need this result? Step-wise optimization W ( w w α i i W w w α i i j w j w α i w j i w j (Ah-ha! So where are we now? On locally quadratic surface, can converge to minimum in, at most, W steps using: w j w j, j {,,, W} Big Questions: How to choose step size? How to construct conjugate directions? How can we do everything without computing H? Computing the correct step size Given: a set of W conjugate vectors Pre-multiply by H : W ( w w α i i Hw ( w d j H α i W i Hw ( w α i W i

7 Computing the correct step size Computing the correct step size By Hw ( w α i i H -orthogonality (conjugacy assumed: W Hw ( w H (why? Also: E( w E b w --w Hw gw ( b Hw At minimum: gw ( b Hw Hw b Computing the correct step size Computing the correct step size Hw ( w H Hw ( b Hw H ( b Hw H b ( b Hw d w j H w α i j Pre-multiply by H : Hw j Hw α i Hw j Hw j i j i ( b Hw H (what s the problem? ( b Hw j H

8 Computing the correct step size Important consequence H ( b Hw j H b Hw j (woo-hoo! heorem: Assuming a W -dimensional quadratic error surface, E( w E b w --w Hw and H -orthogonal vectors, i {,,, W} : w j w j H will converge in at most W steps to the minimum w (for what error surface? Why is this so? Orthogonality of gradient to previous search directions FAC: b Hw j d k, k w j w j How is this important? How is this different from steepest descent? Let s show that this is true Hw ( j w j ( w j w j H

9 Orthogonality of gradient to previous search directions Orthogonality of gradient to previous search directions H H ( d dj Hd j H j Pre-multiply by : d ( j g j d ( j g dh j j ( d dj Hd j H j d k, k j (need to show for all k< j Orthogonality of gradient to previous search directions H Orthogonality of gradient to previous search directions d k d k, k< j d k Pre-multiply by : d ( k g dh k j d k ( d k d k, k< j (why? By induction: d k, k For example: d j d j d j d j

10 So where are we now? On locally quadratic surface, can converge to minimum in, at most, W steps using: w j, j {,,, W} w j H Remaining Big Questions: How to construct conjugate directions? How can we do everything without computing H? heorem: Let Let g Let, d be defined as follows:, j, H H his construction generates W mutually H -orthogonal vectors, such that:, i j : Part I : Part I First goal: Show that, H H H d j H H H Begin with: Pre-multiply by H : H H H H d H j H d dj Hd j H j H H H d j H

11 Second goal, show that: d j, i Begin with: ranspose and post-multiply by, i< j : g j w j w j Hw j w j Remember that: w j ( α H j Hw j b Hw j w j ( g j w j g j (by assumption H (why? H H ---- ( ----g j ( g i g i α i g j (from before ---- ( g j g i g j g i α i ----g j ( g i g i α i ---- ( g i g i, i< j α i

12 ---- ( g i g i, i< j Now need to show that: g k, k α i From construction: d g, j so that we see that: d j, i is proven d k k g k γ l g l l (why? For example: d g d k k g k γ l g l l d g β d g β g ranspose and post-multiply by, k< j: d 3 g 3 β d g 3 β g β β g d k k g k γ l g l g k l, k g k γ l g l g k l (why? d k, k (because

13 k g k γ l g l g k d k, k l g k γ l g l g k, k g, j > By induction: k l g k k l Since g : d γ l g l g k d g, j > (why? g g (why? g g 3 γ g g 3 g g 4 (why? g g 4 γ g g 4 (why? (why? g, j > g 3 g 4 γ g g 4 γ g g 4 (why? hus: Where we are: g k, k d j H (Part I d j, i (Part II ---- ( g i g i, i< j α i d j, i Does this show what we want?, i j Kinda

14 home stretch d j H (Part I d j, i (Part II By induction: d Hd (Part I d 3 Hd (Part II d 3 Hd (Part I d 4 Hd (Part II d 4 Hd (Part II d 4 Hd 3 (Part I Etc, etc, etc : the So where are we now? Choose an initial weight vector w and let d g Update weight vector: w j, j {,,, W}, H w j 3 Evaluate H 4 Let where H 5 Let j j and go to step What s the problem? So where are we now? Computing without H Remaining Big Question: How can we do everything without computing H? From earlier: H H wo areas: H ---- ( H H H or ( ( (Hestenes-Stiefel

15 Computing without H Computing without H ranspose and post-multiply by : Since: d k, k g j g j d j ( ( d k, k ( g j (Polak-Ribiere Computing without H Computing without H ( g j g k, k hree choices: ( ( (Hestenes-Stiefel g j ( g j (Polak-Ribiere g j (Fletcher-Reeves g j (Fletcher-Reeves Which is best?

16 Computing without H Computing without H E( w E b w --w Hw Key: Replace, H with line minimization E( w j E b ( w j -- ( w j Hw ( j E( w j b ( Hw ( j ( ( w j H Computing without H Computing without H E( w j b ( Hw ( j b Hw j H ( ( w j H Since H is symmetric: H ( b Hw j Hw ( j ( w j H E( w j b Hw ( j b Hw ( j b Hw j H b Hw j H H Conclusion: line minimization computation

17 Complete conjugate gradient algorithm Choose an initial weight vector w and let d g Perform a line minimization along, such that: E( w j α E( w j α, α 3 Let w j w j α 4 Evaluate 5 Let where, ( g j 6 Let j j and go to step (Polak-Ribiere Comments Exploitation of reasonable assumption about local quadratic nature of error surface Little additional computation beyond steepest descent No Hessian computation required No hand-tuning of learning rate In practice, conjugate gradient algorithm must be reset every W steps (Why? What about violations of H > assumption? Quadratic example Nonquadratic example 5 5 ω ω steepest descent 5 steps to convergence ω 5 E ω ω ω conjugate gradient steps to convergence ω - E( ω, ω exp( 5ω ω - ω

18 Nonquadratic example Nonquadratic example ω ω - - ω ω 4 steepest descent 5 steps to convergence ω 4 conjugate gradient 4 steps to convergence -4-6 ω -4-4 ( ω, ω (, 5 (why these initial weights? Region where H > Nonquadratic example ( H < Simple NN training example 5 5 y sin( πx, x ω ω z y 6 4 ω steepest descent 4 steps to convergence ω conjugate gradient 5 steps to convergence ( ω, ω (, (gradient descent 94 steps! x x

19 Simple NN training example Error convergence Simple NN training example Error convergence -5-5 steepest descent algorithm log E --- N - -5 conjugate gradient algorithm log E N conjugate gradient algorithm number of epochs number of epochs Simple NN training example -5 Error convergence A closer look at convergence epoch epochs log E --- N - -5 gradient descent algorithm epochs epochs - conjugate gradient algorithm number of epochs

20 A closer look at convergence Final NN approximation: a closer look 8 epochs 8 93 epochs Hidden unit outputs ( z, z and z z epochs epochs - z 3 z Final NN approximation: a closer look Conjugate gradient conclusions Hidden unit outputs ( z, z and z 3 5 z z Exploitation of reasonable assumption about local quadratic nature of error surface 5 Little additional computation beyond steepest descent -5 - x z 3 No Hessian computation required No hand-tuning of learning rate -5 x Much faster rate of convergence

Conjugate gradient algorithm for training neural networks

Conjugate gradient algorithm for training neural networks . Introduction Recall that in the steepest-descent neural network training algorithm, consecutive line-search directions are orthogonal, such that, where, gwt [ ( + ) ] denotes E[ w( t + ) ], the gradient