Differentiable Convex Functions

Differetiable Covex Fuctios The followig picture motivates Theorem 11. f ( x) f ( x) f '( x)( x x) ˆx x 1 Theorem 11 : Let f : R R be differetiable. The, f is covex o the covex set C R if, ad oly if for all x, xc. f( x) f( x) f( x) T ( xx) 1

Differetiable Covex Fuctios f C x Proof : Assume is covex o. The, for, x C we have f ( x(1 ) x) f( x) (1 ) f( x ) all [0, 1] or f( x( xx)) f( x) ( f( x) f( x)) all [0, 1] or f( x( xx)) f( x) f( x) f( x) all (0, 1] T f( x ( xx)) f( x) f( x) ( xx) lim f( x) f( x) 0 2

Differetiable Covex Fuctios Coversely, assume f( x) f( x) f( x) T ( xx) for all x, xc. 1 2 1 2 Let x, x C, ad let [0, 1]. Defie x x (1 ) x ; the f f f 1 1 ( x ) ( x) ( x) T ( x x) ad: f f f 2 2 ( x ) ( x) ( x) T ( x x) f f f f 1 2 T 1 2 ( x ) (1 ) ( x ) ( x) ( x) ( x (1 ) x x) f 1 2 ( x) f( x (1 ) x ) 3

Differetiable Covex Fuctios HW 31 : We already ow, for ay covex fuctio f : R * * ad ay covex set C, that if x is a local miimum the x is a global miimum. Without usig this owledge, show directly that the first order ecessary coditios of Theorem 4 are also sufficiet for * x to be a global miimum for mi f ( x) whe f ad C are covex. xc R 1 4

Differetiable Covex Fuctios We ow provide a ecessary ad sufficiet coditio for f : R to be covex whe, i additio, f has cotiuous secod partials. Before doig this, ote the followig rather obvious geometric fact. 1 1 If f : R R is covex ad f has a secod derivative f '', 1 the f ''( x) 0 all x R. We will ot prove this result. R 1 5

Differetiable Covex Fuctios 1 Theorem 12 : Let C R be covex ad ope ad let f : C R have cotiuous secod partials o. The f is covex o C if, ad oly if, the Hessia matrix of f, H, is PSD o all of C. C xc (i.e., Hx ( ) is PSD for all ) Proof : Assume H is PSD o C. By Taylor s Theorem we have, x xc for ay,, 1 f x f x f x xx xx H x x xx 2 T f( x) f( x) ( xx) T T ( ) ( ) ( ) ( ) ( ) ( (1 ) )( ) 6

Differetiable Covex Fuctios ad therefore, by Theorem 11, f is covex o C. Coversely, assume f is covex o C. Let xc ad let d R ad defie g by g( ) f( xd) Sice C is ope ad f is covex o C, it follows that g is covex i some eighborhood of 0. Therefore, it must be the case that g ''(0) 0 ad, i fact, g ''( ) 0 i this eighborhood of 0. Therefore dhxd Hx is PSD for all xc T 0 g ''(0) ( ) ( ) 7

Differetiable Covex Fuctios HW32 : I additio to the hypothesis of Theorem 12, assume H is PD o C. Does this imply f is strictly covex o C? If f is strictly covex o C is it true that H is PD o C? 8

Feasible Directio Philosophy The cocept of directioal derivative is very strogly used for may types of algorithms for solvig the problem mi f ( x) xfr 1 where f : R R is differetiable. The simple idea is merely: whe at a poit x F move away from x i the best directio ad as far as possible. 9

Feasible Directio Philosophy Step 0 : Set 0 ad select x F. Step 1 : Fid a d D( x, F) which provides a strict improvemet for f. If oe, go to Step 3. Step 2 : Let 0 be chose so that f( x d ) f( x ) ad 1 x d F. Set x x d, 1, ad retur to Step 1. Step 3 : Stop. 10

Remars o the Above Procedure Some remars are i order. R1 (Step 1) : This step is ofte carried out by solvig the optimizatio problem (which we ll call the directio fidig problem, deoted by D ) T mi f ( x ) d s. t. d D( x, F) (D ) d 1 11

Remars o the Above Procedure Note that if (P) is a ucostraied problem (i.e., F R ) ad if we tae d i1 d 2 i 1 2 the the optimal vector, or directio, for D is, of course, d f ( x ) f ( x ) 12

Remars o the Above Procedure HW33 : Assume, agai, F R ad we are usig d the so-called sup orm. Write a expressio for d, a optimal solutio for D. max { d } j1,, j [Note : D ca equivaletly be writte by replacig d 1 with d 1. Why?] 13

Remars o the Above Procedure HW34 : Repeat HW33 by usig d d j. If we let d D( x, F) deote a optimal vector for D, ad if it turs out that j1 f T ( x ) d 0 the the directioal derivative, from x, i every feasible directio d D( x, F) is positive ad, therefore, a move i ay such directio will actually icrease the objective from its curret value f ( x ). Therefore, by our above procedure, we go to Step 3, i.e., we stop at x F. 14

Remars o the Above Procedure T HW35 : Does f ( x ) d 0 imply there is o feasible directio of strict decrease for f? Note that D may ot be easily solved. However, if F is covex, D x F yx yf, y x the sice (, ) may be tae to be we see that D ca the be writte as T mi f ( x ) ( yx ) s. t. y F y x 1 (DC ) where (DC ) meas the directio fidig problem whe F is covex. 15

Remars o the Above Procedure Note also that if, where X xr x0 R, we have that (DC ) becomes mi f ( x ) T ( yx ) Ay b y 0 y x ad further ote that, if we use F x X Ax b s. t. 1 z max ( z ) i1,, i this directio fidig problem is a liear programmig problem. 16

Remars o the Above Procedure R2 (Step 2) : If a d D( x, F) is foud at Step 1 which does lead to a strict decrease for f, from x, the ofte is chose as a miimizer for the oe-dimesioal optimizatio problem, ofte also called the lie-search problem mi f ( x d ) 0 s.t. x d F (LS ) T Note : If d D( x, F ) is such that f ( x ) d 0 the (if it exists) is such that 0. 17

Remars o the Above Procedure R3 (Step 3) : Whe we eter Step 3 (if we ever do) there is certaily o guaratee that x is optimal. T HW36 : Provide a example where f ( x ) d 0 all dd( x, F) x is globally optimal for mi f ( x). What about locally optimal? xf However, because of HW31, if f is covex o F covex ad T if f ( x ) d 0 for all d D( x, F) the it does follow that x is globally optimal. 18

Remars o the Above Procedure R4 : Note also that we have ot discussed covergece i.e., if we ever eter Step 3, does the algorithm produce a sequece of poits which evetually comes close to some poit from which we caot escape (by usig the algorithm)? Such topics will be discussed later. 19

Cauchy s Steepest Descet Algorithm For the ucostraied differetiable problem mi ( x) xfr the above feasible directio procedure is called Cauchy s Steepest 2 Descet Algorithm whe (i) we use x x i (i.e., the directio vector at Step 1 is tae to be f ( x ) d f ( x ) f ad (ii) at Step 2 we solve mi f ( x d ) 0 20

Cauchy s Steepest Descet Algorithm HW37 : f ( x ) (a) Show that f ( x ) 0 implies d is a directio f ( x ) (feasible) of strict decrease for f at x (for the ucostraied problem). (b) Show that Cauchy s method for the ucostraied problem zig-zags 1 (i.e., dd 0 ). 21

Cauchy s Steepest Descet Algorithm HW38 : (a) Do 2 or 3 iteratios of Cauchy s method for solvig mi x 2x ( x, x ) R 1 2 startig with the poit. x 0 2 (1,1) 2 2 1 2 2 2 (b) Now mae the substitutio y 2x ad use Cauchy s method for mi ( x, y) R 1 2 x y 2 2 1. What happes? 2 22

Newto-Raphso Algorithm Agai, cosider the ucostraied differetiable optimizatio problem mi f ( x) xfr Let A be a symmetric positive defiite matrix. Cosider the fuctio defied by : R R A 1 x A ( x T Ax) 1/2 23

Newto-Raphso Algorithm HW39 : Show that is a orm. That is, show A (a) x 0 for all x R ad x 0 x0. A (b) x x for all 1 x R ad all R. A (c) xy x y. A A A A A 24

Newto-Raphso Algorithm HW40 : Let x R ad cosider the directio fidig problem mi f ( x ) T d d A 1 where A is symmetric ad PD. Show that d A 1 1 f A ( x ) A f( x ) f f f ( x ) T 1 1 ( x ) A ( x ) A 1/2 is a optimal vector for D K. 25

Newto-Raphso Algorithm If each time we eter Step 1, we solve the directio fidig problem D K usig as our orm o the fuctio (If the Hessia is PD R H( x ) at x ) we see that a optimal directio (because of HW 40) is d 1 H x f x f ( ) ( ) ( x ) 1 H ( x ) Whe this directio is used for the ucostraied differetiable optimizatio problem the feasible directio algorithm above is called the Newto-Raphso procedure. Also, whe Step 2 is etered, is chose accordig to mi f ( x d ). 0 26

Newto-Raphso Algorithm HW41 : 1 (a) Show that if A is PD the so is A. (b) Show that if f ( x ) 0 ad if Hx ( ) is PD the d is a directio of strict decrease for f at x ad, therefore, at Step 2 we ll have 0. What may happe if f ( x ) 0 ad if 1 H ( x ) is ot PSD; that is if we still use 1 d H ( x ) f ( x ). Is it ecessarily true that d will be a directio of strict decrease for f? 27

Newto-Raphso Algorithm 1 HW42 : Let f : R R be differetiable ad cosider the Newto-Raphso procedure for mi f ( x). Show that at each iteratio (assume Hx ( ) is xfr PD o R ) the directio 1 ( ) ( d H x f x ) poits, from x, to the vector which miimizes the quadratic approximatio for f at x. 1 T 1 T HW43 : Let f : R R be defied by f ( x) c x x Qx where Q is 2 a symmetric positive defiite matrix. Cosider mi f ( x) startig from xfr 0 ay poit x R. Show that the Newto-Raphso procedure termiates i 1 0 oe step. (Note: the problem of HW38 is of this form : c 0, Q ). 0 2 28

A The Fra-Wolfe Algorithm Let be a m matrix ad let xr { xr x0}. 1 Let f : R R be differetiable ad cosider the problem mi f ( x) Axb x0 I this case, the directio fidig problem, at iteratio, is T mi f ( x ) ( y x ) s. t. Ay b y 0 y x 1 29

The Fra-Wolfe Algorithm Note that if F { x0 Axb} is bouded, the we ca also D tae to be T mi f ( x ) ( yx ) s. t. Ay b y b sice feasible y s ca ot get too large i magitude F (i.e., y is bouded o ). 30

The Fra-Wolfe Algorithm HW44 : Suppose F is ot bouded ad at some iteratio, mi f ( x ) T ( yx ). y0 Ayb Show that we ca still choose a directio d which yields a strict decrease for f at x. [Hit: Cosider the extreme-ray foud by the simplex algorithm whe solvig D.] 31

The Fra-Wolfe Algorithm If y F is optimal for D ad f ( x ) T ( yx ) 0, the Step 2 is etered with d y x ad we the solve for 0 by solvig [0,1] mi f x ( y x ) 32

The Fra-Wolfe Algorithm x y x F Note: By covexity of F we have ( ) for all [0,1]. If ( ) T ( f x yx ) 0, we eter Step 3 with x. Note: If f is covex, the x is optimal if T f ( x ) ( yx ) 0. 33

The Fra-Wolfe Algorithm Proof: By defiitio of y we have that, for all y F, T f ( x ) ( yx ) 0. By covexity of f we have, for all y f( y) f( x ) f( x ) T ( yx ) f( x ) To be specific we formally state the F.-W. algorithm for mi f ( x) Ax b x 0 F 34

The Fra-Wolfe Algorithm The emphasis is upo the otio of feasible directio we are ot ecessarily recommedig this particular method for solvig oliear problems with liear costraits. 35

The Fra-Wolfe Algorithm Step 0: Set 0 ad select x F{ x0 Axb }. Step 1: Let T y solve f x yx Ayb y0 mi ( ) ( ) (D ) ad set T d y x. If f ( x ) d 0, go to Step 4. T If f ( x ) d 0, go to Step 3. If, o the other had, D has o optimal solutio, go to Step 2. 36

The Fra-Wolfe Algorithm Step 2: Let y be the extreme ray located by the simplex D algorithm whe solvig the liear program. Set d y x ad go to Step 3. Step 3: Let 0 solve mi f ( x d ) ad set [0,1] 1 x x d. Set 1 Step 4: Stop at the vector x. ad retur to Step 1. 37

The Fra-Wolfe Algorithm Note that if we do eter Step 4 at some iteratio the x is globally optimal if f is covex. If f is ot covex, the it will geerally be the case that x is ot globally optimal. 1 Defiitio: A differetiable fuctio f : R R (or f : C where C is a ope covex set) is said to be pseudocovex if f ( x) ( xx ) 0 implies f ( x) f ( x) (or, T equivaletly, f ( x) f ( x ) implies f ( x) T ( xx) 0 ). R 1 38

The Fra-Wolfe Algorithm Pseudo-covex Not pseudo-covex It follows immediately from the defiitio that a ecessary ad sufficiet coditio for x * F covex to globally optimize mi f ( x) where f is pseudo-covex o R (or xf o some ope covex set cotaiig F ) is for all xf. f * T * ( x ) ( x x ) 0 39

The Fra-Wolfe Algorithm HW45: (a) Show this latter claim (b) Show Step 4 of the F.-W. algorithm implies global optimality of x if f is pseudo-covex (c) Show that a differetiable covex fuctio is pseudocovex. Note: f is pseudo-cocave if f is pseudo-covex T (or f ( x) ( xx ) 0 f( x) f( x) ). 40

The Fra-Wolfe Algorithm Note: We have ot yet discussed covergece of the F.-W. algorithm. That is, if we ever eter Step 4, does the algorithm evetually come close to some poit at which we may wish to stop? We ll discuss this later i the course. Note that i the F.-W. algorithm (as well as for Cauchy s Steepest Descet, ad the Newto-Raphso method (if Hx ( ) is PD) we have 1 f( x ) f( x ) for all. Therefore, a upper boud for mi f ( x) is f ( x ) at iteratio. For the F.-W. algorithm, whe f is covex, we ca also develop a lower boud. xf 41

The Fra-Wolfe Algorithm By covexity we have T f( x ) f( x ) ( yx ) f( y), for all yf T f( x ) f( x ) ( yx ) mi f( y), at each iteratio ad, therefore, at iteratio K we may tae 0,1,, K Therefore, at iteratio K of the F.-W. algorithm we have a lower K boud L ad a upper boud U f( x ). K yf L max f( x ) f( x )( yx ) K 42

The Fra-Wolfe Algorithm Defiitio: Cosider a arbitrary optimizatio problem mi f( x). The vector 1 xf x F is said to be a -optimal solutio ( 0) 1 * * if f( x ) f( x ) where x is optimal. Therefore, whe f is covex ad we are applyig the F.-W. algorithm if ever we have U L K K for some preassiged tolerace parameter 0, the we may stop at the vector x with the assurace that f ( x ) differs form the optimal value by o more tha. 43