5.93 Optimization Methods Lecture 8: Optimality Conditions and Gradient Methods for Unconstrained Optimization
Outline. Necessary and sucient optimality conditions Slide. Gradient m e t h o d s 3. The steepest descent algorithm 4. Rate of convergence 5. Line search algorithms Last Lecture Nonlinear Optimization Applications Portfolio Selection Slide Facility Location (Geometry Problems) Trac Assignment, Routing 3 The general problem f : < n 7! < is a continuous (usually dierentiable) function of n variables Slide 3 g i : < n 7! < i = : : : m h j : < n 7! < j = : : : l N L P : min f (x) s.t. g (x). g m (x) h (x) =. h l (x) = 3. Local vs Global Minima Slide 4 x F is a local minimum of N LP if there exists > su ch that f (x) f (y) for all y B(x ) \ F x F is a global minimum of N LP if f (x) f (y) for all y F.
4 Convex Sets and Functions n A subset S < is a convex set if Slide 5 x y S ) x + ( ; )y S 8 [ ] A function f (x) is a convex function if f (x + ( ; )y) f (x) + ( ; )f (y) 8x y 8 [ ] 5 Convex Optimization 5. Convexity and Minima COP is called a convex optimization problem if f (x) g (x) : : : g m(x) are convex functions This implies that the objective function is convex and the feasible region F is a convex set. Implication: If COP i s a c o n vex optimization problem, then any local minimum will be a global minimum. 6 Optimality Conditions Necessary Conds for Local Optima \If x is local optimum then x must satisfy..." Slide 6 Slide 7 Identies all candidates for local optima. Sucient Conds for Local Optima \If x satises...,then x must be a local optimum " 7 Optimality Conditions 7. Necessary conditions Consider min f (x) x< n Slide 8
Zero rst order variation along all directions Theorem Let f (x) b e c o n tinuously dierentiable. If x < n is a local minimum of f (x), then rf (x ) = and r f (x ) P SD 7. Proof Zero slope at local min x f (x ) f (x + d) for all d < n < Slide 9 Pic > f (x + d) ; f (x ) Tae limits as! r f (x ) n d 8d < Since d arbitrary, replace with ;d ) r f (x ) = : Nonnegative curvature at a local min x ) f (x + d) ; f (x = rf (x ) (d) + (d) r f (x )(d) + jjdjj R(x d) where R(x y)! as y!. Since rf (x ), = Slide = d r f (x )d + jjdjj R(x d) ) f (x + d) ; f (x ) d = r f (x )d + jjdjj R(x d) If r f (x ) is not PSD, 9d : d r f (x )d < ) f (x + d ) < f ( 8 x), su. QED. small 7.3 Example f (x) = x + x :x + x ; 4x ; 4x ; x 3 rf (x) = ( x + x ; 4 x + 4 x ; 4 ; 3x ) Candidates x = (4 ) and x = (3 ) r f (x) = 4 ; 6x r f (x ) = 4 PSD x = (3 ) r f ( x) = ; Indenite matrix x is the only candidate for local min Slide Slide 3
7.4 Sucient conditions Theorem f twice continuously dierentiable. If rf (x ) = and r f (x) P S D in B(x ), then x is a local minimum. Proof: Taylor series expansion: For all x B(x ) Slide 3 for some [ ] f (x) = f (x ) + rf (x ) (x ; x ) + (x ; x ) r f (x + (x ; x ))(x ; x ) ) f (x) f (x ) 7.5 Example Continued... Slide 4 At x = ( 4 ), rf (x ) = a n d r f (x) = 4 ; 6x is PSD for x B(x ) Slide 5 3 f (x) = x + x and rf (x) = ( 3 x x ) x = ( ) r 6x f (x) = is not PSD in B( ) f (; ) = ; 3 < = f (x ) 7.6 Characterization of convex functions Theorem Let f (x) b e c o n tinuously dierentiable. Then f (x) i s c o n vex if and only if Slide 6 rf (x) (x ; x) f (x) ; f (x) 7.7 Proof By convexity As!, f (x + ( ; )x) f (x) + ( ; )f (x) f (x + (x ; x)) ; f (x) f (x) ; f (x) rf (x) (x ; x) f (x) ; f (x) Slide 7 4
If ( rf x) d <, then for small, > 7.8 Convex functions Theorem Let f (x) b e a c o n tinuously dierentiable convex function. Then x is a m in im um of f if and only if Slide 8 Proof: If f convex and rf (x ) = rf (x ) = ) f (x) ; f (x ) r f (x (x ; x ) = 7.9 Descent Directions Interesting Observation Slide 9 f di/ble at x 9d: rf ( x) d < ) 8 >, su. small, f ( x + d) < f ( x) (d: descent direction) 7. Proof where R( x d) ;!! f ( x + d) = f ( x) + rf ( x) t d + jjdjjr( x d ) Slide f ( x + d) ; f ( x) = rf ( x) t d + jjdjjr( x d ) x) rf ( t d <, R( x d) ;!! ) 8 > su. sm all f ( x + d) < f ( x). QED 8 Algorithms for unconstrained optimization 8. Gradient Methods-Motivation Slide Decrease f (x) until rf (x ) = f ( x + d) f ( x) + rf x) ( d x + d) < ( f ( f x) 5
Principal example: 9 Gradient Methods 9. A generic algorithm + x = x + d If rf (x ) 6=, direction d satises: rf (x ) d < Slide Step-length > D + = x x ; D rf (x ) positive denite symmetric matrix 9. Principal directions Steepest descent: Newton's method: 9.3 Other directions Diagonally scaled steepest descent + x = x ; rf (x ) x = x ; (r f (x )) ; rf (x ) + D = Diagonal approximation to (r f (x )) ; Slide 3 Slide 4 Modied Newton's method D = Diagonal approximation to (r f (x )) ; = Gauss-Newton method for least squares problems f (x) = jjg(x)jj D (rg(x )rg(x ) ) Steepest descent. The algorithm Step Given x, set :=. Step d := ;rf (x ). If jjd jj, then stop. Slide 5 Step Solve min h() := f (x + d ) for the step-length, perhaps chosen by an exact or inexact line-search. + Step 3 Set x x + d,. +. Go to Step 6
. An example minimize f (x x ) = 5 x + x + 4 x x ; 4x ; 6x + x = ( x x ) = ( ) Slide 6 Given x f (x ) = Slide 7 d d ;x = ;rf(x x ) = ; 4x + 4 = ;x d ; 4x + 6 ) h() = f (x + d = 5(x ) + ( x + d )(x + d + d ) + 4( x + d ); ;4(x + d ) ; 6(x + d ) + (d ) + ( d ) = Slide 8 (5(d ) + ( d ) + 4 d d ) Start at x = ( ) ;6 " = x x d d jjd jj f (x ) : : ;6: ;4: 9:59646 :866 6: ;:578 8:786963 :379968 ;:56798 :9734 :8 :576 3 :755548 3:64 ;6:355739 ;3:43 7:856659 :866 :98787 4 :485 :93535 :337335 ;:6648 :7583 :8 :73379 5 :9443 :53789 ;:55367 ;:83659 :7645895 :866 :7854 6 :8565 :4653 :846 ;:5344 :73934 :8 :43645 7 :98539 :3468 ;:379797 ;:456 :4335657 :866 :669 8 :95485 :3749 :58 ;:37436 :45845 :8 :68 9 :99649 :338 ;:984 ;:4999 :544577 :866 :638 :988385 :786 :498 ;:95 :3937 :8 :56 x x d d jjd jj f (x ) :9997 :7856 ;:695 ;: :577638 :866 :38 :9976 :6797 :5 ;:37 :5476 :8 :9 3 :999787 :9 ;:5548 ;:987 :637 :866 : 4 :99936 :66 :94 ;:547 :69 :8 : 5 :999948 :469 ;:356 ;:73 :543 :866 : 6 :99983 :46 :7 ;:34 :583 :8 : 7 :999987 :5 ;:33 ;:79 :37653 :866 : 8 :999959 :99 :8 ;:33 :37 :8 : 9 :999997 :8 ;:8 ;:44 :94 :866 : :99999 :4 :4 ;:8 :97 :83 : :999999 :7 ;: ;: :5 :866 : :999998 :6 : ;: : :87 : 3 : : ;:5 ;:3 :55 :866 : 4 :999999 : : ;: :54 : : Slide 9 Slide 3 7
5 x +4 y +3 x y+7 x+ 8 6 4 5 y 5 5 x 5 Slide 3 8 6 4 5 5.3 Important Properties f(x + ) <f(x ) < <f(x ) (because d are descent directions) Slide 3 Under reasonable assumptions of f(x), the sequence x x ::: will have at least one cluster point x Every cluster point x will satisfy rf(x) = Implication: If f(x) is a convex function, x will be an optimal solution 8
Global Convergence Result Theorem: f : R n! R is continuously di/ble on F = fx R n : f (x) f (x )g closed, bo unded set Every cluster point x of fx g satises rf ( x) =.. Wor Per Iteration Two computation tass at each iteration of steepest descent: Slide 33 Slide 34 Compute rf (x ) (for quadratic objective functions, it taes O(n ) steps) to determine d = ;rf (x ) Perform line-search o f h() = f (x + d ) to determine = arg min h() = arg min f (x + d ) Rate of convergence of algorithms Let z : : : z n : : :! z be a c o n vergent sequence. We s a y that the order of convergence of this sequence is p if jz + ; zj p = sup p : lim sup <! jz ; zj p Let jz + ; zj = lim sup jz ; zj p! The larger p, the faster the convergence. Types of convergence. p = < <, then linear (or geometric) rate of convergence Slide 35 Slide 36. p = =, super-linear convergence 3. p = =, sub-linear convergence 4. p = quadratic convergence 9
. Examples z = a < a < c o n verges linearly to zero, = a Slide 37 z = a < a < c o n verges quadratically to zero z = converges sub-linearly to zero z = converges super-linearly to zero.3 Steepest descent z = f (x ), z = f (x ), where x = arg min f (x) Slide 38 Then an algorithm exhibits linear convergence if there is a constant < such that + f (x ) ; f (x ) f (x ) ; f (x, ) for all suciently large, where x is an optimal solution..3. Discussion + f (x ) ; f (x ) < f (x ) ; f (x ) Slide 39 optimal alue. If = :, every iteration adds another digit of accuracy to the objective v If = :9, every iterations add another digit of accuracy to the optimal objective v alue, because (:9) :. 3 Rate of convergence of steepest descent 3. Quadratic Case 3.. Theorem Suppose f (x) = Q is psd x Qx ; c x Slide 4
max = largest eigenvalue of Q min = smallest eigenvalues Q of Linear Convergence T h e orem : If f(x) is a quadratic function and Q is psd, then + f(x ) ; f(x max ) ; min @ A f(x ) ; f(x ) max min + 3.. Discussion f(x + ) ; f(x max ) min ; @ A f(x ) ; f(x ) max min + Slide 4 (Q) := (Q) max is the condition number of Q min (Q) plays an extremely important role in analyzing computation involving Q (Q) = max min + f(x ) ; f(x ) (Q) ; f(x ) ; f(x ) (Q) + Upper Bound on Number of Iterations to Reduce Convergence Constant the Optimality Gap by...3 3..5..67 6..96 58..98 6 4..99 3 For (Q) O() converges fast. For large (Q) Slide 4 Slide 43 Therefore In (Q) ; ) ( ; ; (Q) + (Q) (Q) (f(x ) ; f(x )) ( ; ) (f(x ) ; f(x )) (Q) (Q)(;ln ) iterations, nds x : (f(x ) ; f(x )) (f(x ) ; f(x ))
6 3. Example (Q) = 3 :34 (Q); = (Q)+ = :876 f (x) = x Qx ; c x + 4 5 Q = c = 5 Slide 44 Slide 45 3 4 5 6 7 8 9 x 4: 5:54693 6:77558 6:7635 7:47 :986 :34366 7:844 7:393573 4:684 x ;: ;99:6967 ;64:6683 ;64:468535 ;4:4698 ;4:8563 ;6:5894 ;6:9455 ;6:46575 ;5:98969 jjd jj 86:6934 77:697948 88:59488 5:375844 3:884577 33:64869 8:5579489 :437 53:653873 4:578836 :56 :56 :56 :56 :56 f (x ) 65: 398:6958 6:587793 74:8777 35:4663 747:5555 49:4977 34:53734 3:73595 4:9596 f (x ) ; f (x ) f (x ; ) ; f (x) Slide 46 x :46997 3 ;:5998 4 ;:48 5 ;:36 6 ;:3395 7 ;:3336 8 ;:3333 9 ;:33333 x :948466 3:3899 3:975 3:3885 3:3378 3:33365 3:33335 3:33333 jjd jj :7984766 :9698 :739574 :338 :473 :55 :636 :78 : f (x ) 3:666 :96583 :93388 :93334 :933333 :933333 :933333 :933333 f (x ) ; f (x ) f (x ; ) ; f (x) :65878 :6585 :654656 : Slide 47 3.3 Example 3 f (x) = x Qx ; c x + 5 4 Q = c = 5 6 6 Slide 48
3 (Q) = :854 (Q) ; = = :896 (Q) + Slide 49 3 4 5 6 7 8 9 3 4 5 x 4: 9:8678 :534 :563658 :74549 :736 :66755 :659643 :6578 :6577 :657636 :65763 :65768 :65767 :65767 x ;: ;:56 ;4:5558 :35 ;:53347 :66834 :5898 :69366 :68996 :69486 :69468 :6949 :6949 :6949 :6949 jjd jj 434:7933649 385:96565 67:67355 8:445 3:98573 :8586649 :554644 :44973 :764 :99 :3349 :99 :58 :45 :75 :74 :459 :74 :459 :74 :459 :74 :459 :74 :459 :74 :459 :74 :459 : f (x ) 765: 359:6537 74:5893 :8678 5:64475 4:95886 4:888973 4:88875 4:88837 4:88836 4:88836 4:88836 4:88836 4:88836 4:88836 f (x ) ; f (x ) f (x ; ) ; f (x) :4766 :4766 :4766 :4766 :4766 :4766 :4766 :4766 :4766 :4766 :476 :4768 :45 : Slide 5 3.4 Empirical behavior The convergence constant bound is not just theoretical. It is typically experienced in practice. Slide 5 Analysis is due to Leonid Kantorovich, who won the Nobel Memorial Prize in Economic Science in 975 for his contributions to optimization and economic planning. Slide 5 What about non-quadratic functions? { Suppose x = a r g m i n x f (x)
{ r f (x ) is the Hessian of f (x) at x = x { Rate of convergence will depend on (r f (x )) 4 Summary. Optimality Conditions Slide 53. The steepest descent algorithm - Convergence 3. Rate of convergence of Steepest Descent 5 Choices of step sizes M in f (x + d ) Slide 54 Limited Minimization: M in [ s]f (x + d ) Constant stepsize constant = s P Diminishing stepsize:! = Armijo Rule 4