15.93 Optimization Methods Lecture 19: Line Searches and Newton s Method
1 Last Lecture Necessary Conditions for Optimality (identifies candidates) x local min f(x ) =, f(x ) PSD Slide 1 Sufficient Conditions for Optimality f(x ) = f(x) psd x B ǫ (x ) x loc. min. Characterizations of Convexity a. f convex f(y) f(x) + f(x) (y x) x, y b. f convex f(x) PSD x f convex then global min f(x ) = Steepest descent.1 The algorithm Step Given x, set :=. Step 1 Step d := f(x ). If d ǫ, then stop. Solve min λ h(λ) := f(x + λd ) for the step-length λ, perhaps chosen by an exact or inexact line-search. Step 3 Set x +1 x + λ d, + 1. Go to Step 1. 3 Outline 1. Bisection Method - Armijo s Rule Slide Slide 3. Motivation for Newton s method 3. Newton s method 4. Quadratic rate of convergence 5. Modification for global convergence 4 Choices of step sizes Min λ f(x + λd ) Slide 4 Limited Minimization: Min λ [,s] f(x + λd ) Constant stepsize λ = s constant 1
Diminishing stepsize: λ, Armijo Rule λ = 4.1 Bisection Line- Search Algorithm 4.1.1 Convex functions λ := arg min h(λ) := argmin f(x + λd ) λ λ Slide 5 If f(x) is convex, h(λ) is convex. Find λ for which h (λ) = h (λ) = f(x + λd ) d 4.1. Algorithm Step. Set =. Set λ L := and λ U := λˆ. Slide 6 Step. Set λ = If h λ U + λ L and compute h (λ ). (λ ) >, re-set λ U := λ. Set + 1. If h (λ ) <, re-set λ L := λ. Set + 1. If h (λ ) =, stop. 4.1.3 Analysis At the th iteration of the bisection algorithm, the current interval [λ L, λ U ] must contain a point λ : h (λ ) = Slide 7 At the th iteration of the bisection algorithm, the length of the current interval [λ L, λ U ] is ( ) length = 1 (λˆ). A value of λ such that λ λ ǫ can be found in at most ( ) λˆ log ǫ steps of the bisection algorithm.
4.1.4 Example Slide 8 h(λ) = (x 1 + λd 1 ).6(x + λd ) + 4(x 3 + λd 3 ) + 4 +.5(x 4 + λd 4 ) log(x i + λd i ) i=1 ( ) 4 4 log (5 x i λ d i i=1 i=1 where (x 1, x, x 3, x 4 ) = (1, 1, 1, 1) and (d 1, d, d 3, d 4 ) = ( 1,.6, 4,.5). Slide 9 and h(λ) = 4.65 17.45λ log(1 λ) log(1 +.6λ) log(1 4λ) log(1.5λ) log(1 + 4.65λ) 1.6 h (λ) = 17.45 + + 4 + 1 λ 1 +.6λ 1 4λ.5 4.65 + 1.5λ 1 + 4.65λ Slide 1 1.5.1.15..5 1 Slide 11 1.. 3. 4.15 5.1875 6.1875 7.1875 8 9 1.195315.195315.195315 λl λ u h (λ ) 1. NaN.5 NaN.5 11.549338348.5.9591763683.5 13.863869418.1875.5969.315.65144155.315.831883373151.199188.8736915988.197656.65313496 Slide 1 3
l λ λ u h (λ ) 3 4 41 4 43 44 45 46 47.19753.1977.1843191.146531.189.3.59.18.3.8.. 4. Armijo Rule Slide 13 Bisection is accurate but may be expensive in practice Need cheap method guaranteeing sufficient accuracy Inexact line search method. Requires two parameters: ǫ (, 1), σ > 1. h (λ) = h() + λǫh () λ acceptable by Armijo s rule if: h(λ ) h (λ ) h(σλ ) h (σλ ) (prevents the step size be small) (i.e. f(x + λd ) f(x ) + λǫ f(x ) d ) Slide 14 We get a range of acceptable stepsizes. Step : Set =, λ = λ > Step : If h(λ ) h (λ ), choose λ stop. If h(λ ) > h (λ ) let λ +1 1 = σλ (for example, σ = ) 5 Newton s method 5.1 History Steepest Descent is simple but slow Newton s method complex but fast Origins not clear Slide 15 Raphson became member of the Royal Society in 1691 for his boo Analysis Aequationum Universalis with Newton method. Raphson published it 5 years before Newton. 4
6 Newton s method 6.1 Motivation Consider min f(x) Slide 16 Taylor series expansion around x f(x) g(x) = f(x ) + f(x ) (x x )+ 1 (x x ) f( x)(x x ) Instead of minf(x), solve min g(x), i.e., g(x) = f(x ) + f(x )(x x ) = x x = ( f(x )) 1 f(x ) The direction d = ( f(x )) 1 f(x ) is the Newton direction 6. The algorithm Step Given x, set :=. Step 1 d := ( f(x )) 1 f(x ). If d ǫ, then stop. Step Set x +1 x + d, + 1. Go to Step 1. 6.3 Comments The method assumes that f(x ) is nonsingular at each iteration There is no guaranteee that f(x +1 ) f(x ) We can augment the algorithm with a line-search: Slide 17 Slide 18 min f(x + λd ) λ 6.4 Properties Theorem If H = f(x ) is PD, then d is a descent direction: f(x ) d < Proof: f(x ) d = f(x ) H 1 f(x ) < if H 1 is PD. But, Slide 19 < v (H 1 ) HH 1 v = v H 1 v H 1 is PD 5
6.5 Example 1 f(x) = 7x log x, 1 x = =.14857143 7 f (x) = 7 1, f (x) = x ( d = (f (x)) 1 1 f (x) = x 1 x ) 1 ( 7 1 x ( x+1 = x + x 7(x ) ) = x 7(x ) ) = x 7x Slide Slide 1 1 3 4 5 6 7 8 x 1-5 -185-39945 -4E11-11E x x.1.193.3599.6917.9814.1884978.1414837.14843938.1485714 x.1.13.1417.1484777.1485714.14857143.14857143.14857143.14857143 6.6 Example Slide f(x 1, x ) = log(1 x 1 x ) log x 1 log x 1 1 f(x 1, x ) = 1 x 1 x x 1 1 1 1 x 1 x x ( ) ( ) ( ) 1 1 1 + 1 x 1 x x 1 1 x 1 x f(x 1, x ) = ( ) ( ) ( ) 1 1 1 + 1 x 1 x 1 x 1 x x 1 Slide 3 ( ) 1 1, x ) = 3, 1 3 (x, f(x 1, x ) = 3.95836867 6
1 x.85 1.71768.51975 3.3547858 4.338449 5.3333377 6.33333334 7.33333333 x.5.9659864.17647971.734878.36387.3335933.33333333.33333333 x x.5895565.458316.384835.63613.874717 7.4133E-5 1.1953E-8 1.571E-16 7 Quadratic convergence Recall z n+1 z z = lim z n, limsup = δ < n n z n z Slide 4 Example: z n = a n, < a < 1, z = z n+1 z a n+1 z n z = = 1 a n+1 7.1 Intuitive statement Theorem Suppose f(x) C 3 (thrice cont. diff/ble), x : f(x ) = and f(x ) is nonsingular. If Newton s method is started sufficiently close to x, the sequence of iterates converges quadratically to x. Lemma: Suppose f(x) C 3 on R n and x a given pt. Then ǫ >, β > : x y ǫ Slide 5 Slide 6 f(x) f(y) H(y)(x y) β x y 7. Formal statement Theorem Suppose f twice cont. diff/ble. Let g(x) = f(x), H(x) = f(x), and x : g(x ) =. Suppose f(x) is twice differentiable and its Hessian satisfies: H(x ) h Slide 7 H(x) H(x ) L x x, x : x x β (Reminder: If A is a square matrix: A = max Ax = max{ λ 1,..., λ n }) x: x =1 7
Suppose ( ) Slide 8 x x h min β, = γ. 3L Let x be the next iterate in the Newton method. Then, ( ) x x x x h and 3L x x < x x < γ. 7.3 Proof x x Slide 9 = x H(x) 1 g(x) x ( ) = x x + H(x) 1 g(x ) g(x) = x x + H(x) 1 1 H(x + t(x x))(x x)dt (Lemma 1) x x Now = H(x) 1 (H(x + t(x x)) H(x)) (x x)dt 1 H(x) 1 1 H(x + t(x x)) H(x) x x dt 1 H(x) 1 x x L x x t dt = 1 L H(x) 1 x x h H(x ) = H(x) + H(x ) H(x) H(x) + H(x ) H(x) H(x) + L x x H(x) h L x x H(x) 1 1 H (x) 1 h L x x Slide 3 Slide 31 Slide 3 8
x x x x L (h L x x ) x x L (h 3 h) = 3L x x h 7.4 Lemma 1 Fix w. Let φ(t) = g(x + t(x x)) w Slide 33 φ() = g(x) w, φ(1) = g(x ) w φ (t) = w H(x + t(x x))(x x) 1 φ(1) = φ() + φ (t)dt w 1 w : w (g(x ) g(x)) = H(x + t(x x))(x x)dt Slide 34 1 g(x ) g(x) = H(x + t(x x))(x x)dt 7.5 Critical comments The iterates from Newton s method are equally attracted to local min and local max. Slide 35 We do not now β, h, L in general. Note, however, that they are only used in the proof, not in the algorithm. We do not assume convexity, only that H(x ) is nonsingular and not badly behaved near x. Slide 36 7.6 Properties of Convergence Proposition: Let r = x x and C = 3L. If x x < γ then h r 1 (Cr ) C Slide 37 9
Proposition: If x x > ǫ > x x < ǫ 1 log( Cǫ ) log( log( 1 ) Cr log 8 Solving systems of equations g(x) = g(x t+1 ) g(x t ) + g(x t )(x t+1 x t ) x t+1 = x t ( g(x t )) 1 g(x t ) Application in optimization: g(x) = f(x) 9 Modifications for global convergence Perform line search Slide 38 Slide 39 When Hessian is singular or near singular, use: ( ) 1 d = f(x ) + D f(x ) D is a diagonal matrix: f(x ) + D is PD 1 Summary 1. Line search methods: Bisection Method. Armijo s Rule. Slide 4. The Newton s method 3. Quadratic rate of convergence 4. Modification for global convergence 1
MIT OpenCourseWare http://ocw.mit.edu 15.93J / 6.55J Optimization Methods Fall 9 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.