Differentiable Convex Functions

Similar documents
Brief Review of Functions of Several Variables

Introduction to Optimization Techniques. How to Solve Equations

6.3 Testing Series With Positive Terms

Optimization Methods MIT 2.098/6.255/ Final exam

Chapter 6 Infinite Series

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

Sequences and Series of Functions

Homework 4. x n x X = f(x n x) +

Math 312 Lecture Notes One Dimensional Maps

Ma 530 Introduction to Power Series

Math 113, Calculus II Winter 2007 Final Exam Solutions

We are mainly going to be concerned with power series in x, such as. (x)} converges - that is, lims N n

Optimally Sparse SVMs

( ) (( ) ) ANSWERS TO EXERCISES IN APPENDIX B. Section B.1 VECTORS AND SETS. Exercise B.1-1: Convex sets. are convex, , hence. and. (a) Let.

MATH 413 FINAL EXAM. f(x) f(y) M x y. x + 1 n

MATH 10550, EXAM 3 SOLUTIONS

THE SOLUTION OF NONLINEAR EQUATIONS f( x ) = 0.

Introduction to Optimization Techniques

Solutions to Tutorial 5 (Week 6)

MIDTERM 3 CALCULUS 2. Monday, December 3, :15 PM to 6:45 PM. Name PRACTICE EXAM SOLUTIONS

Ma 530 Infinite Series I

CHAPTER 10 INFINITE SEQUENCES AND SERIES

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

n=1 a n is the sequence (s n ) n 1 n=1 a n converges to s. We write a n = s, n=1 n=1 a n

Infinite Sequences and Series

MA131 - Analysis 1. Workbook 3 Sequences II

Lesson 10: Limits and Continuity

Recurrence Relations

Markov Decision Processes

INEQUALITIES BJORN POONEN

Chapter 2 The Solution of Numerical Algebraic and Transcendental Equations

It is often useful to approximate complicated functions using simpler ones. We consider the task of approximating a function by a polynomial.

v = -!g(x 0 ) Ûg Ûx 1 Ûx 2 Ú If we work out the details in the partial derivatives, we get a pleasing result. n Ûx k, i x i - 2 b k

MAS111 Convergence and Continuity

Assignment 5: Solutions

Linear Support Vector Machines

Definitions and Theorems. where x are the decision variables. c, b, and a are constant coefficients.

Supplemental Material: Proofs

Topics. Homework Problems. MATH 301 Introduction to Analysis Chapter Four Sequences. 1. Definition of convergence of sequences.

AP Calculus BC Review Applications of Derivatives (Chapter 4) and f,

Review Problems 1. ICME and MS&E Refresher Course September 19, 2011 B = C = AB = A = A 2 = A 3... C 2 = C 3 = =

PAPER : IIT-JAM 2010

Math 210A Homework 1

NUMERICAL METHODS FOR SOLVING EQUATIONS

15.081J/6.251J Introduction to Mathematical Programming. Lecture 21: Primal Barrier Interior Point Algorithm

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Sequences and Limits

M17 MAT25-21 HOMEWORK 5 SOLUTIONS

PROBLEM SET I (Suggested Solutions)

Math Solutions to homework 6

A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

lim za n n = z lim a n n.

Math 451: Euclidean and Non-Euclidean Geometry MWF 3pm, Gasson 204 Homework 3 Solutions

Homework Set #3 - Solutions

Linear Programming and the Simplex Method

sin(n) + 2 cos(2n) n 3/2 3 sin(n) 2cos(2n) n 3/2 a n =

Sequence A sequence is a function whose domain of definition is the set of natural numbers.

Lecture 6: Integration and the Mean Value Theorem. slope =

ACO Comprehensive Exam 9 October 2007 Student code A. 1. Graph Theory

2.1. The Algebraic and Order Properties of R Definition. A binary operation on a set F is a function B : F F! F.

1. By using truth tables prove that, for all statements P and Q, the statement

CHAPTER 5. Theory and Solution Using Matrix Techniques

CALCULUS AB SECTION I, Part A Time 60 minutes Number of questions 30 A CALCULATOR MAY NOT BE USED ON THIS PART OF THE EXAM.

1 Duality revisited. AM 221: Advanced Optimization Spring 2016

Advanced Analysis. Min Yan Department of Mathematics Hong Kong University of Science and Technology

The Simplex algorithm: Introductory example. The Simplex algorithm: Introductory example (2)

Chapter 10: Power Series

Math 341 Lecture #31 6.5: Power Series

Chapter 6 Principles of Data Reduction

} is said to be a Cauchy sequence provided the following condition is true.

7 Sequences of real numbers

Lecture Notes for Analysis Class

18.657: Mathematics of Machine Learning

SOLUTIONS TO EXAM 3. Solution: Note that this defines two convergent geometric series with respective radii r 1 = 2/5 < 1 and r 2 = 1/5 < 1.

Chapter 3 Inner Product Spaces. Hilbert Spaces

2 Banach spaces and Hilbert spaces

[ 11 ] z of degree 2 as both degree 2 each. The degree of a polynomial in n variables is the maximum of the degrees of its terms.

Find quadratic function which pass through the following points (0,1),(1,1),(2, 3)... 11

Section A assesses the Units Numerical Analysis 1 and 2 Section B assesses the Unit Mathematics for Applied Mathematics

The picture in figure 1.1 helps us to see that the area represents the distance traveled. Figure 1: Area represents distance travelled

ON WELLPOSEDNESS QUADRATIC FUNCTION MINIMIZATION PROBLEM ON INTERSECTION OF TWO ELLIPSOIDS * M. JA]IMOVI], I. KRNI] 1.

Introduction to Extreme Value Theory Laurens de Haan, ISM Japan, Erasmus University Rotterdam, NL University of Lisbon, PT

Statistical Inference Based on Extremum Estimators

INFINITE SEQUENCES AND SERIES

Solutions to Homework 7

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors

Lecture 8: Solving the Heat, Laplace and Wave equations using finite difference methods

Chapter 7: The z-transform. Chih-Wei Liu

Singular Continuous Measures by Michael Pejic 5/14/10

10.1 Sequences. n term. We will deal a. a n or a n n. ( 1) n ( 1) n 1 2 ( 1) a =, 0 0,,,,, ln n. n an 2. n term.

The Method of Least Squares. To understand least squares fitting of data.

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

Real Variables II Homework Set #5

Notes on iteration and Newton s method. Iteration

Chapter Vectors

PUTNAM TRAINING INEQUALITIES

Solutions to Tutorial 3 (Week 4)

Lecture 4. We also define the set of possible values for the random walk as the set of all x R d such that P(S n = x) > 0 for some n.

Chapter 7 Isoperimetric problem

Transcription:

Differetiable Covex Fuctios The followig picture motivates Theorem 11. f ( x) f ( x) f '( x)( x x) ˆx x 1 Theorem 11 : Let f : R R be differetiable. The, f is covex o the covex set C R if, ad oly if for all x, xc. f( x) f( x) f( x) T ( xx) 1

Differetiable Covex Fuctios f C x Proof : Assume is covex o. The, for, x C we have f ( x(1 ) x) f( x) (1 ) f( x ) all [0, 1] or f( x( xx)) f( x) ( f( x) f( x)) all [0, 1] or f( x( xx)) f( x) f( x) f( x) all (0, 1] T f( x ( xx)) f( x) f( x) ( xx) lim f( x) f( x) 0 2

Differetiable Covex Fuctios Coversely, assume f( x) f( x) f( x) T ( xx) for all x, xc. 1 2 1 2 Let x, x C, ad let [0, 1]. Defie x x (1 ) x ; the f f f 1 1 ( x ) ( x) ( x) T ( x x) ad: f f f 2 2 ( x ) ( x) ( x) T ( x x) f f f f 1 2 T 1 2 ( x ) (1 ) ( x ) ( x) ( x) ( x (1 ) x x) f 1 2 ( x) f( x (1 ) x ) 3

Differetiable Covex Fuctios HW 31 : We already ow, for ay covex fuctio f : R * * ad ay covex set C, that if x is a local miimum the x is a global miimum. Without usig this owledge, show directly that the first order ecessary coditios of Theorem 4 are also sufficiet for * x to be a global miimum for mi f ( x) whe f ad C are covex. xc R 1 4

Differetiable Covex Fuctios We ow provide a ecessary ad sufficiet coditio for f : R to be covex whe, i additio, f has cotiuous secod partials. Before doig this, ote the followig rather obvious geometric fact. 1 1 If f : R R is covex ad f has a secod derivative f '', 1 the f ''( x) 0 all x R. We will ot prove this result. R 1 5

Differetiable Covex Fuctios 1 Theorem 12 : Let C R be covex ad ope ad let f : C R have cotiuous secod partials o. The f is covex o C if, ad oly if, the Hessia matrix of f, H, is PSD o all of C. C xc (i.e., Hx ( ) is PSD for all ) Proof : Assume H is PSD o C. By Taylor s Theorem we have, x xc for ay,, 1 f x f x f x xx xx H x x xx 2 T f( x) f( x) ( xx) T T ( ) ( ) ( ) ( ) ( ) ( (1 ) )( ) 6

Differetiable Covex Fuctios ad therefore, by Theorem 11, f is covex o C. Coversely, assume f is covex o C. Let xc ad let d R ad defie g by g( ) f( xd) Sice C is ope ad f is covex o C, it follows that g is covex i some eighborhood of 0. Therefore, it must be the case that g ''(0) 0 ad, i fact, g ''( ) 0 i this eighborhood of 0. Therefore dhxd Hx is PSD for all xc T 0 g ''(0) ( ) ( ) 7

Differetiable Covex Fuctios HW32 : I additio to the hypothesis of Theorem 12, assume H is PD o C. Does this imply f is strictly covex o C? If f is strictly covex o C is it true that H is PD o C? 8

Feasible Directio Philosophy The cocept of directioal derivative is very strogly used for may types of algorithms for solvig the problem mi f ( x) xfr 1 where f : R R is differetiable. The simple idea is merely: whe at a poit x F move away from x i the best directio ad as far as possible. 9

Feasible Directio Philosophy Step 0 : Set 0 ad select x F. Step 1 : Fid a d D( x, F) which provides a strict improvemet for f. If oe, go to Step 3. Step 2 : Let 0 be chose so that f( x d ) f( x ) ad 1 x d F. Set x x d, 1, ad retur to Step 1. Step 3 : Stop. 10

Remars o the Above Procedure Some remars are i order. R1 (Step 1) : This step is ofte carried out by solvig the optimizatio problem (which we ll call the directio fidig problem, deoted by D ) T mi f ( x ) d s. t. d D( x, F) (D ) d 1 11

Remars o the Above Procedure Note that if (P) is a ucostraied problem (i.e., F R ) ad if we tae d i1 d 2 i 1 2 the the optimal vector, or directio, for D is, of course, d f ( x ) f ( x ) 12

Remars o the Above Procedure HW33 : Assume, agai, F R ad we are usig d the so-called sup orm. Write a expressio for d, a optimal solutio for D. max { d } j1,, j [Note : D ca equivaletly be writte by replacig d 1 with d 1. Why?] 13

Remars o the Above Procedure HW34 : Repeat HW33 by usig d d j. If we let d D( x, F) deote a optimal vector for D, ad if it turs out that j1 f T ( x ) d 0 the the directioal derivative, from x, i every feasible directio d D( x, F) is positive ad, therefore, a move i ay such directio will actually icrease the objective from its curret value f ( x ). Therefore, by our above procedure, we go to Step 3, i.e., we stop at x F. 14

Remars o the Above Procedure T HW35 : Does f ( x ) d 0 imply there is o feasible directio of strict decrease for f? Note that D may ot be easily solved. However, if F is covex, D x F yx yf, y x the sice (, ) may be tae to be we see that D ca the be writte as T mi f ( x ) ( yx ) s. t. y F y x 1 (DC ) where (DC ) meas the directio fidig problem whe F is covex. 15

Remars o the Above Procedure Note also that if, where X xr x0 R, we have that (DC ) becomes mi f ( x ) T ( yx ) Ay b y 0 y x ad further ote that, if we use F x X Ax b s. t. 1 z max ( z ) i1,, i this directio fidig problem is a liear programmig problem. 16

Remars o the Above Procedure R2 (Step 2) : If a d D( x, F) is foud at Step 1 which does lead to a strict decrease for f, from x, the ofte is chose as a miimizer for the oe-dimesioal optimizatio problem, ofte also called the lie-search problem mi f ( x d ) 0 s.t. x d F (LS ) T Note : If d D( x, F ) is such that f ( x ) d 0 the (if it exists) is such that 0. 17

Remars o the Above Procedure R3 (Step 3) : Whe we eter Step 3 (if we ever do) there is certaily o guaratee that x is optimal. T HW36 : Provide a example where f ( x ) d 0 all dd( x, F) x is globally optimal for mi f ( x). What about locally optimal? xf However, because of HW31, if f is covex o F covex ad T if f ( x ) d 0 for all d D( x, F) the it does follow that x is globally optimal. 18

Remars o the Above Procedure R4 : Note also that we have ot discussed covergece i.e., if we ever eter Step 3, does the algorithm produce a sequece of poits which evetually comes close to some poit from which we caot escape (by usig the algorithm)? Such topics will be discussed later. 19

Cauchy s Steepest Descet Algorithm For the ucostraied differetiable problem mi ( x) xfr the above feasible directio procedure is called Cauchy s Steepest 2 Descet Algorithm whe (i) we use x x i (i.e., the directio vector at Step 1 is tae to be f ( x ) d f ( x ) f ad (ii) at Step 2 we solve mi f ( x d ) 0 20

Cauchy s Steepest Descet Algorithm HW37 : f ( x ) (a) Show that f ( x ) 0 implies d is a directio f ( x ) (feasible) of strict decrease for f at x (for the ucostraied problem). (b) Show that Cauchy s method for the ucostraied problem zig-zags 1 (i.e., dd 0 ). 21

Cauchy s Steepest Descet Algorithm HW38 : (a) Do 2 or 3 iteratios of Cauchy s method for solvig mi x 2x ( x, x ) R 1 2 startig with the poit. x 0 2 (1,1) 2 2 1 2 2 2 (b) Now mae the substitutio y 2x ad use Cauchy s method for mi ( x, y) R 1 2 x y 2 2 1. What happes? 2 22

Newto-Raphso Algorithm Agai, cosider the ucostraied differetiable optimizatio problem mi f ( x) xfr Let A be a symmetric positive defiite matrix. Cosider the fuctio defied by : R R A 1 x A ( x T Ax) 1/2 23

Newto-Raphso Algorithm HW39 : Show that is a orm. That is, show A (a) x 0 for all x R ad x 0 x0. A (b) x x for all 1 x R ad all R. A (c) xy x y. A A A A A 24

Newto-Raphso Algorithm HW40 : Let x R ad cosider the directio fidig problem mi f ( x ) T d d A 1 where A is symmetric ad PD. Show that d A 1 1 f A ( x ) A f( x ) f f f ( x ) T 1 1 ( x ) A ( x ) A 1/2 is a optimal vector for D K. 25

Newto-Raphso Algorithm If each time we eter Step 1, we solve the directio fidig problem D K usig as our orm o the fuctio (If the Hessia is PD R H( x ) at x ) we see that a optimal directio (because of HW 40) is d 1 H x f x f ( ) ( ) ( x ) 1 H ( x ) Whe this directio is used for the ucostraied differetiable optimizatio problem the feasible directio algorithm above is called the Newto-Raphso procedure. Also, whe Step 2 is etered, is chose accordig to mi f ( x d ). 0 26

Newto-Raphso Algorithm HW41 : 1 (a) Show that if A is PD the so is A. (b) Show that if f ( x ) 0 ad if Hx ( ) is PD the d is a directio of strict decrease for f at x ad, therefore, at Step 2 we ll have 0. What may happe if f ( x ) 0 ad if 1 H ( x ) is ot PSD; that is if we still use 1 d H ( x ) f ( x ). Is it ecessarily true that d will be a directio of strict decrease for f? 27

Newto-Raphso Algorithm 1 HW42 : Let f : R R be differetiable ad cosider the Newto-Raphso procedure for mi f ( x). Show that at each iteratio (assume Hx ( ) is xfr PD o R ) the directio 1 ( ) ( d H x f x ) poits, from x, to the vector which miimizes the quadratic approximatio for f at x. 1 T 1 T HW43 : Let f : R R be defied by f ( x) c x x Qx where Q is 2 a symmetric positive defiite matrix. Cosider mi f ( x) startig from xfr 0 ay poit x R. Show that the Newto-Raphso procedure termiates i 1 0 oe step. (Note: the problem of HW38 is of this form : c 0, Q ). 0 2 28

A The Fra-Wolfe Algorithm Let be a m matrix ad let xr { xr x0}. 1 Let f : R R be differetiable ad cosider the problem mi f ( x) Axb x0 I this case, the directio fidig problem, at iteratio, is T mi f ( x ) ( y x ) s. t. Ay b y 0 y x 1 29

The Fra-Wolfe Algorithm Note that if F { x0 Axb} is bouded, the we ca also D tae to be T mi f ( x ) ( yx ) s. t. Ay b y b sice feasible y s ca ot get too large i magitude F (i.e., y is bouded o ). 30

The Fra-Wolfe Algorithm HW44 : Suppose F is ot bouded ad at some iteratio, mi f ( x ) T ( yx ). y0 Ayb Show that we ca still choose a directio d which yields a strict decrease for f at x. [Hit: Cosider the extreme-ray foud by the simplex algorithm whe solvig D.] 31

The Fra-Wolfe Algorithm If y F is optimal for D ad f ( x ) T ( yx ) 0, the Step 2 is etered with d y x ad we the solve for 0 by solvig [0,1] mi f x ( y x ) 32

The Fra-Wolfe Algorithm x y x F Note: By covexity of F we have ( ) for all [0,1]. If ( ) T ( f x yx ) 0, we eter Step 3 with x. Note: If f is covex, the x is optimal if T f ( x ) ( yx ) 0. 33

The Fra-Wolfe Algorithm Proof: By defiitio of y we have that, for all y F, T f ( x ) ( yx ) 0. By covexity of f we have, for all y f( y) f( x ) f( x ) T ( yx ) f( x ) To be specific we formally state the F.-W. algorithm for mi f ( x) Ax b x 0 F 34

The Fra-Wolfe Algorithm The emphasis is upo the otio of feasible directio we are ot ecessarily recommedig this particular method for solvig oliear problems with liear costraits. 35

The Fra-Wolfe Algorithm Step 0: Set 0 ad select x F{ x0 Axb }. Step 1: Let T y solve f x yx Ayb y0 mi ( ) ( ) (D ) ad set T d y x. If f ( x ) d 0, go to Step 4. T If f ( x ) d 0, go to Step 3. If, o the other had, D has o optimal solutio, go to Step 2. 36

The Fra-Wolfe Algorithm Step 2: Let y be the extreme ray located by the simplex D algorithm whe solvig the liear program. Set d y x ad go to Step 3. Step 3: Let 0 solve mi f ( x d ) ad set [0,1] 1 x x d. Set 1 Step 4: Stop at the vector x. ad retur to Step 1. 37

The Fra-Wolfe Algorithm Note that if we do eter Step 4 at some iteratio the x is globally optimal if f is covex. If f is ot covex, the it will geerally be the case that x is ot globally optimal. 1 Defiitio: A differetiable fuctio f : R R (or f : C where C is a ope covex set) is said to be pseudocovex if f ( x) ( xx ) 0 implies f ( x) f ( x) (or, T equivaletly, f ( x) f ( x ) implies f ( x) T ( xx) 0 ). R 1 38

The Fra-Wolfe Algorithm Pseudo-covex Not pseudo-covex It follows immediately from the defiitio that a ecessary ad sufficiet coditio for x * F covex to globally optimize mi f ( x) where f is pseudo-covex o R (or xf o some ope covex set cotaiig F ) is for all xf. f * T * ( x ) ( x x ) 0 39

The Fra-Wolfe Algorithm HW45: (a) Show this latter claim (b) Show Step 4 of the F.-W. algorithm implies global optimality of x if f is pseudo-covex (c) Show that a differetiable covex fuctio is pseudocovex. Note: f is pseudo-cocave if f is pseudo-covex T (or f ( x) ( xx ) 0 f( x) f( x) ). 40

The Fra-Wolfe Algorithm Note: We have ot yet discussed covergece of the F.-W. algorithm. That is, if we ever eter Step 4, does the algorithm evetually come close to some poit at which we may wish to stop? We ll discuss this later i the course. Note that i the F.-W. algorithm (as well as for Cauchy s Steepest Descet, ad the Newto-Raphso method (if Hx ( ) is PD) we have 1 f( x ) f( x ) for all. Therefore, a upper boud for mi f ( x) is f ( x ) at iteratio. For the F.-W. algorithm, whe f is covex, we ca also develop a lower boud. xf 41

The Fra-Wolfe Algorithm By covexity we have T f( x ) f( x ) ( yx ) f( y), for all yf T f( x ) f( x ) ( yx ) mi f( y), at each iteratio ad, therefore, at iteratio K we may tae 0,1,, K Therefore, at iteratio K of the F.-W. algorithm we have a lower K boud L ad a upper boud U f( x ). K yf L max f( x ) f( x )( yx ) K 42

The Fra-Wolfe Algorithm Defiitio: Cosider a arbitrary optimizatio problem mi f( x). The vector 1 xf x F is said to be a -optimal solutio ( 0) 1 * * if f( x ) f( x ) where x is optimal. Therefore, whe f is covex ad we are applyig the F.-W. algorithm if ever we have U L K K for some preassiged tolerace parameter 0, the we may stop at the vector x with the assurace that f ( x ) differs form the optimal value by o more tha. 43