IV. Performance Optmzaton A. Steepest descent algorthm defnton how to set up bounds on learnng rate mnmzaton n a lne (varyng learnng rate) momentum learnng examples B. Newton s method defnton Gauss-Newton method Levenberg-Marquardt method C. Conjugate gradent method defnton conjugate drecton theorem method mplementaton example References: [Hagan], [Moon] 7/4/6 EC446.SuFy6/MPF
Performance Optmzaton Goal: NN: How do we fnd optmum (mnmum) ponts located on the performance (error) surface f(x)? Progressvely trans (learns) when t s presented feature vectors Learnng s teratve Optmzaton schemes are teratve w x+ = x + α p x= b learnng rate search drecton Schemes nvestgated A. Steepest descent B. Newton s method C. Conjugate gradent ntal mnmzaton along a lne Gauss Newton Levenberg-Marquardt 7/4/6 EC446.SuFy6/MPF
A. Steepest Descent Goal: Fnd p so that x+ = x + α p x x : search dreecton Use a aylor seres expanson to fnd p (stop at frst order approxmaton) = ( + ) F x F x x + For a one-dmensonal case: = F x + F x x+ + x Pc p so that * x F x x < p = F x 7/4/6 EC446.SuFy6/MPF 3
For F( x) = /x Ax+ d x+ c F( x) = F( x) = 7/4/6 EC446.SuFy6/MPF 4
Example: F( x) = x + 5x fnd F x, F x expresson for x(), and the teratve 7/4/6 EC446.SuFy6/MPF 5
A. What s the effect of α on the teratve scheme behavor? overdamped behavor - - - - α ncreases - underdamped behavor - - - - - - - unstable behavor 7/4/6 EC446.SuFy6/MPF 6
A.3 How to set up bounds on the learnng rate α? V F( x) F( x) = x Ax+ d x+ c VF( x) = = x + = 7/4/6 EC446.SuFy6/MPF 7
Overdamped/Underdamped Behavor defne c = x x ( α ) x I A x α d + = + + opt ( α ) α ( α ) ( α ) = I A x d A d = I A x + A I A d ( I α A x ) A d = + + ( α ) H ( α ) ( α ) c = I A c x opt = Q I Σ Q c ; A= QΣQ Q c = Q Q I Σ Q c H H H + I H ( α ) V I V + = Σ ( α ) V = I Σ V 7/4/6 EC446.SuFy6/MPF 8
x = c + x opt opt ( α ) = QV + x = q V + x = q Λ V + x change of sgn f Λ < o nsure overdamped behavor Select α > Λ for all so that doesn t flp sgn dependng f s even or odd. ( α ) Λ opt opt 7/4/6 EC446.SuFy6/MPF 9
Example: F( x) = x + x + x x + x Fnd the upper bound on α α =.39 α =.4 - - - - - - - - 7/4/6 EC446.SuFy6/MPF
A.4 Mnmzaton on a lne Alternatve for estmatng α α ( ) Mnmze F x + α κ at each teraton wth respect to F( x + ) = F( x + α p ), p = F( x ) Arbtrary functon s dffcult --> loo at quadratc case frst d d dx A= F x + α p = F( x) dα dx dα = F( x ) = F( x + α p ) p Use partal fracton expanson: p ( α ) = F( x ) + F( x ) p p ( ) α ( ) A = F x p + p F x p A = α = F( x ) p p F( x ) p ( ) x = x + α F x + 7/4/6 EC446.SuFy6/MPF
Contour Plot x - - - - x Recall: α computed so that F(x + α p ) s mnmum along the gradent lne F(x + α p ) s mnmum at x so that F(x ) = gradent at x + s orthogonal to gradent at x 7/4/6 EC446.SuFy6/MPF
Example: 9 f( n) x x x = + = x x Do teratons usng mnmzaton on a lne. x 9 = F( x) = = F x 7/4/6 EC446.SuFy6/MPF 3
7/4/6 EC446.SuFy6/MPF 4
Example: Pattern recognton ( classes) Steepest descent for a -3-- NN, step sze α =. step sze 4 Fgure 4.7: Pattern-recognton problem for a neural networ. Decson output (sold lne) NN output (dashed lne) 7/4/6 EC446.SuFy6/MPF 5
A.5 Momentum learnng speed of convergence for steepest descent may mprove f oscllatons n the teraton scheme are reduced. oscllatons may be vewed as hgh-frequency whch can be smoothed out by a low-pass flter. basc steepest descent teraton x + = x + x = x α F x Modfy as follows: wth ( ) x = x α γ F x γα F x + γ [,] γ = = α x+ x F x γ = = α basc steepest descent x+ x F x no slope update Impact of momentum: when both dervatves are of the same sgn, accelerate n that drecton. when both dervatves have dfferent sgns, momentum provdes a drag, whch tends to mnmze oscllatons and stablze behavor. 7/4/6 EC446.SuFy6/MPF 6
Why does momentum learnng wor? rajectory wth momentum 5 5-5 -5 5 5.5 error.5 3 Iteraton Number 7/4/6 EC446.SuFy6/MPF 7
Effects of momentum learnng and step sze on -3-- NN example α: step sze µ: momentum constant 7/4/6 EC446.SuFy6/MPF 8
B. Newton s Method Recall the steepest descent scheme s based on: F x + = F x + x = F x + F x x Newton s method s an extenson of expanson to nd order ( + ) = ( + ) F x F x x = F x + F x x + x F( x ) x Restrctng to quadratc functons F( x) = x Ax+ d x+ c F( x) = = x * F( x) = 7/4/6 EC446.SuFy6/MPF 9
d d x Fnd x so that F( x + x ) s mnmum d d x d F( x ) = F( x ) + F( x ) x + d x = F x + F x x = = F x + x F x F x For quadratc functons, the teraton becomes: [ ] x = x + x = x F x F x + F( x) = x Ax+ d + c F x = F x = 7/4/6 EC446.SuFy6/MPF
B. What happens when F(x) s not quadratc and we use the Newton s method n the teratve scheme? true approxmated - - - - - true - - - teraton of Newton s scheme from x = [.5 ] approxmated - - - - - - - - teraton of Newton s scheme from x = [.5 ] 7/4/6 EC446.SuFy6/MPF
Newton s method summary Newton method s based on a local approxmaton of F( x) by a quadratc functon. f F( x) s a quadratc CV n step f F( x) s not a quadratc may CV to a local mnmum saddle pont may oscllate Newton s method s expensve. need to solve F( x) at each teraton need to compute F x at each teraton 7/4/6 EC446.SuFy6/MPF
B. Gauss-Newton Method F( x) = x Ax+ d x+ c x = x F x F x + x+ = x A g expensve to compute need to approxmate F x = v x = v ( x) v( x) Rewrte N = = N F x v x F( x) = = v ( x) j x x j j v ( x) N v ( x) x x F( x) = x N v ( x) v ( x) = x Assume x= = Assume N = v x v x v( x) + v( x) x x F( x) = v( x) v( x) v( x) + v( x) x x 7/4/6 EC446.SuFy6/MPF 3
v x v x x x v x ( x) x x = J F x = v( x) v( x) v F x ( x) V ( x) v x v x v x v x v ( x) + + v ( x) + + = + x x x x v( x) v ( x) x x v x v x v x v x x x x x = + v x v x v x v x x x x x = J x J x + S x nvolves nd order dervatves terms whch can be neglected ( ) x J x J x J x V x + 7/4/6 EC446.SuFy6/MPF 4
B.3 Levenberg-Marquardt Scheme Gauss-Newton x + = x J x J x J x V x Addtonal robustness to numercal mplementaton x+ = x J x J x + ( ) J x x 7/4/6 EC446.SuFy6/MPF 5
C. Conjugate gradent method (CG method) Usng nd order nformaton s often too expensve. Go bac to st order approxmaton. For small problems: CG less effcent than Newton scheme. For large problems: CG s a leadng contender Note: ) Assume F( x) = x Ax+ d x+ c Assume we want to compute the mnmum of F(x) to fnd mn(f(x)): ) Defnton: Mutually conjugate vectors wth respect to a matrx A. (A-orthogonal) { p } Defnton: A set of vectors are mutually conjugate wth respect to a P.D. Hessan matrx A ff: p A p = j j Consequence: the egenvectors of A are A- conjugate 7/4/6 EC446.SuFy6/MPF 6
3) K { } If the vectors p are non-zero and are A- = conjugate for a P.D. matrx A; then the set of vectors p are lnearly ndependent. { } Consequences: ) Can mnmze a quadratc by searchng along egenvectors as they are the man axes of ellpsods (however, egenvectors requre the Hessan whch s expensve) ) Have a set of exact lne searches along a set of conjugate vectors mnmum can be reached n n steps comes down to computng conjugate drectons 7/4/6 EC446.SuFy6/MPF 7
C. Conjugate drecton theorem. Assume where Let F ( x) = x Ax+ d x+ c x= n be a set of A-conjugate vectors. For any ntal condton x, the teraton: where { p, } o p n x + = x +α p α = g p p Ap, g = F x = Ax + d converges to the unque mnmum x* of F(x) n n steps. Proof: 7/4/6 EC446.SuFy6/MPF 8
x x = α p = ( x ) p A x = because {p } are A-conjugate {p } are lnearly ndependent * ( ) * p A( x x ) p A x x = p Apα α = =,..., n p Ap * ( + ) p A x x x x = p Ap p A x = p Ap p Ap * ( x ) p A( x x ) + * ( x ) p A x = + p Ap 7/4/6 EC446.SuFy6/MPF 9 * n x x = α p p A α j p j= p Ap = * * ( ) ( + ( + )) p Ax Ax p Ax d Ax d = = p Ap p Ap * p ( F( x ) F( x ) ) = p Ap p g = p Ap j
C. CG method mplementaton CG requres nowledge of the conjugate drecton vectors p usually p are computed as the method progresses (not beforehand) recall x + = x +α p α = chosen to mnmze F(x) n the drecton p we need to select p, p j, so that; j j j, j p Ap = α p Ap = = x Ap = g p = Recall: g + g = (Ax + + c) (Ax + c) = A(x + x ) = g - we need to fnd p j so that: g p p Ap j 7/4/6 EC446.SuFy6/MPF 3
Iteraton p = -g (SD drecton) α x = x + α p, α = g = F( x ) = Ax() + c = g p p Ap K =, pc p so that p g so p = g + β p wth β so that p g g g ( g β ) p + = g p p Ap p g g g p = 7/4/6 EC446.SuFy6/MPF 3
Overall teraton scheme: x = x + α p α + + = g p + + + + p Ap g = Ax + c p = g + β p wth β = g g / g g + + + 7/4/6 EC446.SuFy6/MPF 3
Example:.8 F x = x x, x =.5 Implement the conjugate gradent scheme 7/4/6 EC446.SuFy6/MPF 33
CG - - - - Contour Plot Steepest Descent x - - - - x 7/4/6 EC446.SuFy6/MPF 34