A : k n. Usually k > n otherwise easily the minimum is zero. Analytical solution:

Similar documents
A : k n. Usually k > n otherwise easily the minimum is zero. Analytical solution:

A Brief Review on Convex Optimization

CSCI : Optimization and Control of Networks. Review on Convex Optimization

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Convex Optimization M2

4. Convex optimization problems

Lecture: Duality.

5. Duality. Lagrangian

Lecture: Duality of LP, SOCP and SDP

Convex optimization problems. Optimization problem in standard form

Lecture: Convex Optimization Problems

Convex Optimization Boyd & Vandenberghe. 5. Duality

10. Unconstrained minimization

Constrained Optimization and Lagrangian Duality

4. Convex optimization problems (part 1: general)

CS-E4830 Kernel Methods in Machine Learning

Convex Optimization & Lagrange Duality

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

12. Interior-point methods

4. Convex optimization problems

3. Convex functions. basic properties and examples. operations that preserve convexity. the conjugate function. quasiconvex functions

Unconstrained minimization

Convex Optimization and Modeling

3. Convex functions. basic properties and examples. operations that preserve convexity. the conjugate function. quasiconvex functions

Interior Point Algorithms for Constrained Convex Optimization

Optimisation convexe: performance, complexité et applications.

EE/AA 578, Univ of Washington, Fall Duality

Convex Functions. Daniel P. Palomar. Hong Kong University of Science and Technology (HKUST)

Optimality Conditions for Constrained Optimization

Optimization for Machine Learning

Subgradient. Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes. definition. subgradient calculus

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

minimize x subject to (x 2)(x 4) u,

Lecture 2: Convex Sets and Functions

Unconstrained minimization: assumptions

Convex Optimization. 9. Unconstrained minimization. Prof. Ying Cui. Department of Electrical Engineering Shanghai Jiao Tong University

EE/ACM Applications of Convex Optimization in Signal Processing and Communications Lecture 17

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Convex Optimization in Communications and Signal Processing

subject to (x 2)(x 4) u,

Duality. Lagrange dual problem weak and strong duality optimality conditions perturbation and sensitivity analysis generalized inequalities

Lecture 2: Convex functions

Convex Functions. Pontus Giselsson

Lagrange Duality. Daniel P. Palomar. Hong Kong University of Science and Technology (HKUST)

Linear and non-linear programming

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

Structural and Multidisciplinary Optimization. P. Duysinx and P. Tossings

14. Duality. ˆ Upper and lower bounds. ˆ General duality. ˆ Constraint qualifications. ˆ Counterexample. ˆ Complementary slackness.

Summer School: Semidefinite Optimization

CSCI 1951-G Optimization Methods in Finance Part 09: Interior Point Methods

12. Interior-point methods

HW1 solutions. 1. α Ef(x) β, where Ef(x) is the expected value of f(x), i.e., Ef(x) = n. i=1 p if(a i ). (The function f : R R is given.

Geometric problems. Chapter Projection on a set. The distance of a point x 0 R n to a closed set C R n, in the norm, is defined as

Convex Optimization and l 1 -minimization

Convex Optimization and Modeling

The proximal mapping

In English, this means that if we travel on a straight line between any two points in C, then we never leave C.

1. Introduction. mathematical optimization. least-squares and linear programming. convex optimization. example. course goals and topics

1. Introduction. mathematical optimization. least-squares and linear programming. convex optimization. example. course goals and topics

ORIE 6326: Convex Optimization. Quasi-Newton Methods

Lecture 18: Optimization Programming

Lagrangian Duality and Convex Optimization

I.3. LMI DUALITY. Didier HENRION EECI Graduate School on Control Supélec - Spring 2010

Subgradients. subgradients and quasigradients. subgradient calculus. optimality conditions via subgradients. directional derivatives

ICS-E4030 Kernel Methods in Machine Learning

N. L. P. NONLINEAR PROGRAMMING (NLP) deals with optimization models with at least one nonlinear function. NLP. Optimization. Models of following form:

ELE539A: Optimization of Communication Systems Lecture 15: Semidefinite Programming, Detection and Estimation Applications

Lecture 8. Strong Duality Results. September 22, 2008

Written Examination

Equality constrained minimization

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 4. Subgradient

Additional Homework Problems

Inequality Constraints

Lecture Note 5: Semidefinite Programming for Stability Analysis

Tutorial on Convex Optimization for Engineers Part II

10 Numerical methods for constrained problems

Convex functions. Definition. f : R n! R is convex if dom f is a convex set and. f ( x +(1 )y) < f (x)+(1 )f (y) f ( x +(1 )y) apple f (x)+(1 )f (y)

Convex Functions. Wing-Kin (Ken) Ma The Chinese University of Hong Kong (CUHK)

8. Geometric problems

8. Geometric problems

Lagrange duality. The Lagrangian. We consider an optimization program of the form

Optimality, Duality, Complementarity for Constrained Optimization

Lecture: Introduction to LP, SDP and SOCP

UNDERGROUND LECTURE NOTES 1: Optimality Conditions for Constrained Optimization Problems

Applications of Linear Programming

4TE3/6TE3. Algorithms for. Continuous Optimization

2. Dual space is essential for the concept of gradient which, in turn, leads to the variational analysis of Lagrange multipliers.

Appendix PRELIMINARIES 1. THEOREMS OF ALTERNATIVES FOR SYSTEMS OF LINEAR CONSTRAINTS

Primal/Dual Decomposition Methods

15. Conic optimization

Lecture 9 Sequential unconstrained minimization

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

Duality Theory of Constrained Optimization

Lecture 7: Convex Optimizations

Homework Set #6 - Solutions

Lecture 14 Barrier method

Motivation. Lecture 2 Topics from Optimization and Duality. network utility maximization (NUM) problem:

CHAPTER 2: CONVEX SETS AND CONCAVE FUNCTIONS. W. Erwin Diewert January 31, 2008.

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

Transcription:

1-5: Least-squares I A : k n. Usually k > n otherwise easily the minimum is zero. Analytical solution: f (x) =(Ax b) T (Ax b) =x T A T Ax 2b T Ax + b T b f (x) = 2A T Ax 2A T b = 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 1 / 199

1-5: Least-squares II Regularization, weights: 1 2 λx T x + w 1 (Ax b) 2 1 + + w k (Ax b) 2 k Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 2 / 199

2-4: Convex Combination and Convex Hull I Convex hull is convex Then x = θ 1 x 1 + + θ k x k x = θ 1 x 1 + + θ k x k αx + (1 α) x =αθ 1 x 1 + + αθ k x k + (1 α) θ 1 x 1 + + (1 α) θ k x k Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 3 / 199

2-4: Convex Combination and Convex Hull II Each coefficient is nonnegative and αθ 1 + + αθ k + (1 α) θ 1 + + (1 α) θ k = α + (1 α) = 1 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 4 / 199

2-7: Euclidean Balls and Ellipsoid I We prove that any satisfies Let x = x c + Au with u 2 1 (x x c ) T P 1 (x x c ) 1 A = P 1/2 because P is symmetric positive definite. Then u T A T P 1 Au = u T P 1/2 P 1 P 1/2 u 1. Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 5 / 199

2-10: Positive Semidefinite Cone I S n + is a convex cone. Let For any θ 1 0, θ 2 0, X 1, X 2 S n + z T (θ 1 X 1 + θ 2 X 2 )z = θ 1 z T X 1 z + θ 2 z T X 2 z 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 6 / 199

2-10: Positive Semidefinite Cone II Example: [ ] x y S y z 2 + is equivalent to x 0, z 0, xz y 2 0 If x > 0 or (z > 0) is fixed, we can see that has a parabolic shape z y 2 x Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 7 / 199

2-12: Interaction I When t is fixed, {(x 1, x 2 ) 1 x 1 cos t + x 2 cos 2t 1} gives a region between two parallel lines This region is convex Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 8 / 199

2-13: Affine Function I f (S) is convex: Let f (x 1 ) f (S), f (x 2 ) f (S) αf (x 1 ) + (1 α)f (x 2 ) =α(ax 1 + b) + (1 α)(ax 2 + b) =A(αx 1 + (1 α)x 2 ) + b f (S) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 9 / 199

2-13: Affine Function II f 1 (C) convex: means that Because C is convex, x 1, x 2 f 1 (C) Ax 1 + b C, Ax 2 + b C α(ax 1 + b) + (1 α)(ax 2 + b) =A(αx 1 + (1 α)x 2 ) + b C Thus αx 1 + (1 α)x 2 f 1 (C) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 10 / 199

2-13: Affine Function III Scaling: Translation αs = {αx x S} S + a = {x + a x S} Projection T = {x 1 R m (x 1, x 2 ) S, x 2 R n }, S R m R n Scaling, translation, and projection are all affine functions Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 11 / 199

2-13: Affine Function IV For example, for projection f (x) = [ I 0 ] [ ] x 1 + 0 x 2 I : identity matrix Solution set of linear matrix inequality C = {S S 0} is convex f (x) = x 1 A 1 + + x m A m B = Ax + b f 1 (C) = {x f (x) 0} is convex Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 12 / 199

2-13: Affine Function V But this isn t rigourous because of some problems in arguing f (x) = Ax + b A more formal explanation: C = {s R p2 mat(s) S p and mat(s) 0} is convex f (x) = x 1 vec(a 1 ) + + x m vec(a m ) vec(b) = [ vec(a 1 ) vec(a m ) ] x + ( vec(b)) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 13 / 199

2-13: Affine Function VI f 1 (C) = {x mat(f (x)) S p and mat(f (x)) 0} is convex Hyperbolic cone: C = {(z, t) z T z t 2, t 0} is convex (by drawing a figure in 2 or 3 dimensional space) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 14 / 199

2-13: Affine Function VII We have that is affine. Then is convex f (x) = [ ] P 1/2 x c T = x f 1 (C) = {x f (x) C} [ P 1/2 c T ] x = {x x T Px (c T x) 2, c T x 0} Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 15 / 199

Perspective and linear-fractional function I Image convex: if S is convex, check if {P(x, t) (x, t) S} convex or not Note that S is in the domain of P Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 16 / 199

Perspective and linear-fractional function II Assume We hope (x 1, t 1 ), (x 2, t 2 ) S α x 1 t 1 + (1 α) x 2 t 2 = P(A, B), where (A, B) S Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 17 / 199

Perspective and linear-fractional function III We have α x 1 t 1 + (1 α) x 2 t 2 = αt 2x 1 + (1 α)t 1 x 2 t 1 t 2 = αt 2x 1 + (1 α)t 1 x 2 αt 1 t 2 + (1 α)t 1 t 2 αt 2 αt = 2 +(1 α)t 1 x 1 + (1 α)t 1 αt 2 +(1 α)t 1 x 2 αt 2 αt 2 +(1 α)t 1 t 1 + (1 α)t 1 αt 2 +(1 α)t 1 t 2 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 18 / 199

Perspective and linear-fractional function IV Let We have Further because θ = αt 2 αt 2 + (1 α)t 1 θx 1 + (1 θ)x 2 θt 1 + (1 θ)t 2 = A B (A, B) S (x 1, t 1 ), (x 2, t 2 ) S Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 19 / 199

Perspective and linear-fractional function V and Inverse image is convex Given C a convex set is convex S is convex P 1 (C) = {(x, t) P(x, t) = x/t C} Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 20 / 199

Perspective and linear-fractional function VI Let Do we have (x 1, t 1 ) : x 1 /t 1 C (x 2, t 2 ) : x 2 /t 2 C θ(x 1, t 1 ) + (1 θ)(x 2, t 2 ) P 1 (C)? That is, θx 1 + (1 θ)x 2 θt 1 + (1 θ)t 2 C? Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 21 / 199

Perspective and linear-fractional function VII Let Earlier we had θx 1 + (1 θ)x 2 θt 1 + (1 θ)t 2 = α x 1 t 1 + (1 α) x 2 t 2, θ = αt 2 αt 2 + (1 α)t 1 Then (α(t 2 t 1 ) + t 1 )θ = αt 2 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 22 / 199

Perspective and linear-fractional function VIII t 1 θ = αt 2 αt 2 θ + αt 1 θ t 1 θ α = t 1 θ + (1 θ)t 2 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 23 / 199

2-16: Generalized inequalities I K contains no line: x with x K and x K x = 0 Nonnegative polynomial on [0, 1] When n = 2 t = 1 x 1 tx 2, t [0, 1] Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 24 / 199

2-16: Generalized inequalities II t = 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 25 / 199

2-16: Generalized inequalities III t [0, 1] It really becomes a proper cone Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 26 / 199

2-17: I Properties: implies that x K y, u K v x y K u v K From the definition of a convex cone, (x y) + (u v) K Then x + u K y + v Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 27 / 199

2-18: Minimum and minimal elements I The minimum element S x 1 + K A minimal element (x 2 K) S = {x 2 } Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 28 / 199

2-19: Separating hyperplane theorem I We consider a simplified situation and omit part of the proof Assume inf u v > 0 u C,v D and minimum attained at c, d We will show that a d c, b d 2 c 2 forms a separating hyperplane a T x = b Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 29 / 199 2

2-19: Separating hyperplane theorem II c d a T x = 2 b 1 1 = (d c) T d, b = 1 + 2 2 a = d c 2 = (d c) T c = d T d c T c 2 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 30 / 199

2-19: Separating hyperplane theorem III Assume the result is wrong so there is u D such that a T u b < 0 We will derive a point u in D but closer to c than d. That is, u c < d c Then we have a contradiction The concept Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 31 / 199

2-19: Separating hyperplane theorem IV c u u d a T x = b a T u b = (d c) T u d T d c T c 2 implies that < 0 (d c) T (u d) + 1 2 d c 2 < 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 32 / 199

2-19: Separating hyperplane theorem V d d + t(u d) c 2 dt t=0 =2(d + t(u d) c) T (u d) =2(d c) T (u d) < 0 There exists a small t (0, 1) such that However, d + t(u d) c < d c so there is a contradiction d + t(u d) D, t=0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 33 / 199

2-20: Supporting hyperplane theorem I Case 1: C has an interior region Consider 2 sets: interior of C versus {x 0 }, where x 0 is any boundary point If C is convex, then interior of C is also convex Then both sets are convex We can apply results in slide 2-19 so that there exists a such that a T x a T x 0, x interior of C Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 34 / 199

2-20: Supporting hyperplane theorem II Then for all boundary point x we also have a T x a T x 0 because any boundary point is the limit of interior points Case 2: C has no interior region In this situation, C is like a line in R 3 (so no interior). Then of course it has a supporting hyperplane Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 35 / 199

3-4: Examples on R n and R m n I A, X R m n tr(a T X ) = (A T X ) jj j = A T ji X ij = A ij X ij j i j i Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 36 / 199

3-7: First-order Condition I An open set: for any x, there is a ball covering x such that this ball is in the set Global underestimator: z f (x) y x = f (x) z = f (x) + f (x)(y x) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 37 / 199

3-7: First-order Condition II : Because domain f is convex, for all 0 < t 1, x + t(y x) domain f when t 0, : f (x + t(y x)) (1 t)f (x) + tf (y) f (y) f (x) + f (x + t(y x)) f (x) t f (y) f (x) + f (x) T (y x) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 38 / 199

3-7: First-order Condition III For any 0 θ 1, z = θx + (1 θ)y f (x) f (z) + f (z) T (x z) =f (z) + f (z) T (1 θ)(x y) f (y) f (z) + f (z) T (y z) =f (z) + f (z) T θ(y x) θf (x) + (1 θ)f (y) f (z) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 39 / 199

3-7: First-order Condition IV First-order condition for strictly convex function: f is strictly convex if and only if f (y) > f (x) + f (x) T (y x) : it s easy by directly modifying to > f (x) >f (z) + f (z) T (x z) =f (z) + f (z) T (1 θ)(x y) f (y) > f (z)+ f (z) T (y z) = f (z)+ f (z) T θ(y x) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 40 / 199

3-7: First-order Condition V : Assume the result is wrong. From the 1st-order condition of a convex function, x, y such that x y and f (x) T (y x) = f (y) f (x) (1) For this (x, y), from the strict convexity f (x + t(y x)) f (x) <tf (y) tf (x) = f (x) T t(y x), t (0, 1) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 41 / 199

3-7: First-order Condition VI Therefore, f (x +t(y x)) < f (x)+ f (x) T t(y x), t (0, 1) However, this contradicts the first-order condition: f (x +t(y x)) f (x)+ f (x) T t(y x), t (0, 1) This proof was given by a student of this course before Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 42 / 199

3-8: Second-order condition I Proof of the 2nd-order condition: We consider only the simpler condition of n = 1 f (x + t) f (x) + f (x)t lim 2f (x + t) f (x) f (x)t t 0 t 2 2(f (x + t) f (x)) = lim = f (x) 0 t 0 2t Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 43 / 199

3-8: Second-order condition II f (x + t) = f (x) + f (x)t + 1 2 f ( x)t 2 f (x) + f (x)t by 1st-order condition The extension to general n is straightforward If 2 f (x) 0, then f is strictly convex Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 44 / 199

3-8: Second-order condition III Using 1st-order condition for strictly convex function: f (y) =f (x) + f (x) T (y x) + 1 2 (y x)t 2 f ( x)(y x) >f (x) + f (x) T (y x) Note that in our proof of 1st-order condition for strictly convex functions we used 2nd-order condition of convex function, so no contradiction here Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 45 / 199

3-8: Second-order condition IV It s possible that f is strictly convex but 2 f (x) 0 Example: Details omitted f (x) = x 4 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 46 / 199

3-9: Examples I Quadratic-over-linear f x = 2x y, f y = x2 y 2 2 f x x = 2 y, 2 f x y = 2x y, 2 f 2 y y = 2x 3 y, 3 [ ] 2 y [y ] x y 3 x = 2 [ ] [ ] y 2 1 xy y 3 xy x 2 y x y = 2 2 x x 2 y 2 y 3 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 47 / 199

3-10 I f (x) = log k exp x k f (x) = ex 1 k ex k. exn k ex k 2 iif = ( k ex k )ex i e x i ex i ( k ex k ) 2, 2 ijf = Note that if z k = exp x k exi exj ( k ex k ) 2, i j Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 48 / 199

3-10 II then (zz T ) ij = z i (z T ) j = z i z j Cauchy-Schwarz inequality (a 1 b 1 + + a n b n ) 2 (a1 2 + + an)(b 2 1 2 + + bn) 2 Note that a k = v k zk, b k = z k z k > 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 49 / 199

3-12: Jensen s inequality I General form f ( p(z)zdz) p(z)f (z)dz Discrete situation f ( p i z i ) p i f (z i ), p i = 1 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 50 / 199

3-12: Jensen s inequality II Proof: f (p 1 z 1 + p 2 z 2 + p 3 z 3 ) ( ( )) p1 z 1 + p 2 z 2 (1 p 3 ) f + p 3 f (z 3 ) 1 p ( 3 p1 (1 p 3 ) f (z 1 ) + p ) 2 f (z 2 ) + p 3 f (z 3 ) 1 p 3 1 p 3 =p 1 f (z 1 ) + p 2 f (z 2 ) + p 3 f (z 3 ) Note that p 1 1 p 3 + p 2 1 p 3 = 1 p 3 1 p 3 = 1 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 51 / 199

Positive weighted sum & composition with affine function I Composition with affine function: We know f (x) is convex Is g(x) = f (Ax + b) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 52 / 199

Positive weighted sum & composition with affine function II convex? g((1 α)x 1 + αx 2 ) =f (A((1 α)x 1 + αx 2 ) + b) =f ((1 α)(ax 1 + b) + α(ax 2 + b)) (1 α)f (Ax 1 + b) + αf (Ax 2 + b) =(1 α)g(x 1 ) + αg(x 2 ) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 53 / 199

3-15: Pointwise maximum I Proof of the convexity f ((1 α)x 1 + αx 2 ) = max(f 1 ((1 α)x 1 + αx 2 ),..., f m ((1 α)x 1 + αx 2 )) max((1 α)f 1 (x 1 ) + αf 1 (x 2 ),..., (1 α)f m (x 1 ) + αf m (x 2 )) (1 α) max(f 1 (x 1 ),..., f m (x 1 ))+ α max(f 1 (x 2 ),..., f m (x 2 )) (1 α)f (x 1 ) + αf (x 2 ) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 54 / 199

3-15: Pointwise maximum II For consider all combinations f (x) = x [1] + + x [r] ( ) n r Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 55 / 199

3-16: Pointwise supremum I The proof is similar to pointwise maximum Support function of a set C: When y is fixed, f (x, y) = y T x is linear (convex) in x Maximum eigenvales of symmetric matrix f (X, y) = y T Xy is a linear function of X when y is fixed Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 56 / 199

3-19: Minimization I Proof: Let ɛ > 0. y 1, y 2 C such that f (x 1, y 1 ) g(x 1 ) + ɛ f (x 2, y 2 ) g(x 2 ) + ɛ g(θx 1 + (1 θ)x 2 ) = inf y C f (θx 1 + (1 θ)x 2, y) f (θx 1 + (1 θ)x 2, θy 1 + (1 θ)y 2 ) θf (x 1, y 1 ) + (1 θ)f (x 2, y 2 ) θg(x 1 ) + (1 θ)g(x 2 ) + ɛ Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 57 / 199

3-19: Minimization II Note that the first inequality use the peroperty that C is convex to have θy 1 + (1 θ)y 2 C Because the above inequality holds for all ɛ > 0, g(θx 1 + (1 θ)x 2 ) θg(x 1 ) + (1 θ)g(x 2 ) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 58 / 199

3-19: Minimization III first example: The goal is to prove A BC 1 B T 0 Instead of a direct proof, here we use the property in this slide. First we have that f (x, y) is convex in (x, y) because [ ] A B B T 0 C Consider min y f (x, y) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 59 / 199

3-19: Minimization IV Because C 0, the minimum occurs at 2Cy + 2B T x = 0 Then y = C 1 B T x g(x) = x T Ax 2x T BC 1 Bx + x T BC 1 CC 1 B T x = x T (A BC 1 B T )x Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 60 / 199

3-19: Minimization V is convex. The second-order condition implies that A BC 1 B T 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 61 / 199

3-21: the conjugate function I This function is useful later When y is fixed, maximum happens at y = f (x) (2) by taking the derivative on x Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 62 / 199

3-21: the conjugate function II Explanation of the figure: when y is fixed z = xy is a straight line passing through the origin, where y is the slope of the line. Check under which x, yx and f (x) have the largest distance From the figure, the largest distance happens when (2) holds Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 63 / 199

z = x 0 f (x 0 ) + f (x 0 ) = x 0 y + f (x 0 ) = f (y) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 64 / 199 3-21: the conjugate function III About the point The tangent line is (0, f (y)) z f (x 0 ) x x 0 = f (x 0 ) where x 0 is the point satisfying When x = 0, y = f (x 0 )

3-21: the conjugate function IV f is convex: Given x, y T x f (x) is linear (convex) in y. Then we apply the property of pointwise supremum Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 65 / 199

3-22: examples I negative logarithm f (x) = log x x (xy + log x) = y + 1 x = 0 If y < 0, the picture of xy + log x Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 66 / 199

3-22: examples II Then xy + log x = 1 log( y) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 67 / 199

3-22: examples III strictly convex quadratic Qx = y, x = Q 1 y y T x 1 2 x T Qx =y T Q 1 y 1 2 y T Q 1 QQ 1 y = 1 2 y T Q 1 y Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 68 / 199

3-23: quasiconvex functions I Figure on slide: S α = [a, b], S β = (, c] Both are convex The figure is an example showing that quasi convex may not be convex Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 69 / 199

3-26: properties of quasiconvex functions I Modified Jensen inequality: f quasiconvex if and only if f (θx +(1 θ)y) max{f (x), f (y)}, x, y, θ [0, 1]. Let S is convex = max{f (x), f (y)} x S, y S θx + (1 θ)y S Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 70 / 199

3-26: properties of quasiconvex functions II f (θx + (1 θ)y) and the result is obtained If results are wrong, there exists α such that S α is not convex. x, y, θ with x, y S α, θ [0, 1] such that θx + (1 θ)y / S α Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 71 / 199

3-26: properties of quasiconvex functions III Then f (θx + (1 θ)y) > α max{f (x), f (y)} This violates the assumption First-order condition (this is exercise 3.43): Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 72 / 199

3-26: properties of quasiconvex functions IV f ((1 t)x + ty) max(f (x), f (y)) = f (x) f (x + t(y x)) f (x) 0 t f (x + t(y x)) f (x) lim = f (x) T (y x) 0 t 0 t Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 73 / 199

3-26: properties of quasiconvex functions V : If results are wrong, there exists α such that S α is not convex. x, y, θ with x, y S α, θ [0, 1] such that Then θx + (1 θ)y / S α f (θx + (1 θ)y) > α max{f (x), f (y)} (3) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 74 / 199

3-26: properties of quasiconvex functions VI Because f is differentiable, it is continuous. Without loss of generality, we have f (z) f (x), f (y), z between x and y Then z = x + θ(y x), θ (0, 1) f (z) T ( θ(y x)) 0 f (z) T (y x θ(y x)) 0 f (z) T (y x) = 0, θ (0, 1) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 75 / 199

3-26: properties of quasiconvex functions VII This contradicts (3). f (x + θ(y x)) =f (x) + f (t) T θ(y x) =f (x), θ [0, 1) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 76 / 199

3-27: Log-concave and log-convex functions I Powers: log(x a ) = a log x log x is concave Probability densities: log f (x) = 1 2 (x x)t Σ 1 (x x) + constant Σ 1 is positive definite. Thus log f (x) is concave Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 77 / 199

3-27: Log-concave and log-convex functions II Cumulative Gaussian distribution d 2 d 2 x log Φ(x) = log x e u2 /2 du d dx log Φ(x) = e x /2 du e u2 log Φ(x) x 2/2 = ( x e u2 /2 du)e x 2 /2 ( x) e x 2 /2 e x 2 /2 ( x e u2 /2 du) 2 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 78 / 199

3-27: Log-concave and log-convex functions III Need to prove that ( x Because e u2 /2 du ) x + e x 2 /2 > 0 x u for all u (, x], Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 79 / 199

3-27: Log-concave and log-convex functions IV we have = ( x x x ) e u2 /2 du x + e x 2 /2 xe u2 /2 du + e x 2 /2 ue u2 /2 du + e x 2 /2 = e u2 /2 x + e x 2 /2 = e x 2 /2 + e x 2 /2 = 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 80 / 199

3-27: Log-concave and log-convex functions V This proof was given by a student (and polished by another student) of this course before Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 81 / 199

4-3: Optimal and locally optimal points I f 0 (x) = 1/x f 0 (x) = x log x Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 82 / 199

4-3: Optimal and locally optimal points II f 0 (x) = x 3 3x f 0(x) = 1 + log x = 0 x = e 1 = 1/e Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 83 / 199

4-3: Optimal and locally optimal points III f 0(x) = 3x 2 3 = 0 x = ±1 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 84 / 199

f 0 (y) f 0 (x), for all feasible y Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 85 / 199 4-9: Optimality criterion for differentiable f 0 I : easy From first-order condition Together with we have f 0 (y) f 0 (x) + f 0 (x) T (y x) f 0 (x) T (y x) 0

4-9: Optimality criterion for differentiable f 0 II Assume the result is wrong. Then f 0 (x) T (y x) < 0 Let z(t) = ty + (1 t)x d dt f 0(z(t)) = f 0 (z(t)) T (y x) d dt f 0(z(t)) = f 0 (x) T (y x) < 0 t=0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 86 / 199

A(tx+(1 t)y) = tax+(1 t)ay = tb+(1 b)b = b Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 87 / 199 4-9: Optimality criterion for differentiable f 0 III There exists t such that f 0 (z(t)) < f 0 (x) Note that is feasible because z(t) f i (z(t)) tf i (x) + (1 t)f i (y) 0 and

4-9: Optimality criterion for differentiable f 0 IV Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 88 / 199

4-10 I Unconstrained problem: Let y = x t f 0 (x) It is feasible (unconstrained problem). Optimality condition implies Thus f 0 (x) T (y x) = t f 0 (x) 2 0 f 0 (x) = 0 Equality constrained problem Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 89 / 199

4-10 II Easy. For any feasible y, Ay = b f 0 (x)(y x) = ν T A(y x) = ν T (b b) = 0 0 So x is optimal : more complicated. We only do a rough explanation Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 90 / 199

4-10 III From optimality condition f 0 (x) T ν = f 0 (x) T ((x + ν) x) 0, ν N(A) N(A) is a subspace in 2-D. Thus ν N(A) ν N(A) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 91 / 199

4-10 IV f 0 N(A) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 92 / 199

4-10 V We have f 0 (x) T ν = 0, ν N(A) f 0 (x) N(A), f 0 (x) R(A T ) ν such that f 0 (x) + A T ν = 0 Minimization over nonnegative orthant Easy Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 93 / 199

4-10 VI For any y 0, i f 0 (x)(y i x i ) = { i f 0 (x)y i 0 if x i = 0 0 if x i > 0. Therefore, and f 0 (x) T (y x) 0 x is optimal Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 94 / 199

4-10 VII If x i = 0, we claim i f 0 (x) 0 Otherwise, i f 0 (x) < 0 Let y = x except y i f 0 (x) T (y x) = i f 0 (x)(y i x i ) This violates the optimality condition Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 95 / 199

4-10 VIII If x i > 0, we claim i f 0 (x) = 0 Otherwise, assume i f 0 (x) > 0 Consider y = x except y i = x i /2 > 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 96 / 199

4-10 IX It is feasible. Then f 0 (x) T (y x) = i f 0 (x)(y i x i ) = i f 0 (x)x i /2 < 0 violates the optimality condition. The situation for i f 0 (x) < 0 is similar Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 97 / 199

4-23: examples I c E(C) Σ E C ((C c)(c c)) Var(C T x) = E C ((C T x c T x)(c T x c T x)) = E C (x T (C c)(c c) T x) = x T Σx Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 98 / 199

4-25: second-order cone programming I Cone was defined on slide 2-8 {(x, t) x t} Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 99 / 199

4-35: generalized inequality constraint I f i R n R k i K i -convex: f i (θx + (1 θ)y) Ki θf i (x) + (1 θ)f i (y) See page 3-31 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 100 / 199

4-37: LP and SOCP as SDP I LP and equivalent SDP a 11 a 1n x 1 Ax =.. a m1 a mn x n 11 1n a... + + x n a... x 1 1 b... a m1 0 a mn b m Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 101 / 199

4-37: LP and SOCP as SDP II For SOCP and SDP we will use results in 4-39: [ tip p A p q ] A T 0 A ti T A t 2 I q q, t 0 q q Now Thus p = m, q = 1 A = A i x + b i, t = c T i x + d i A i x + b i 2 (c T i x + d i ) 2, c T i x + d i 0 A i x + b i c T i x + d i Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 102 / 199

4-39: matrix norm minimization I Following 4-38, we have the following equivalent problem min subject to t A 2 t We then use A 2 t A T A t 2 I, t 0 [ ] ti A 0 ti A T Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 103 / 199

4-39: matrix norm minimization II to have the SDP min subject to t[ ] ti A(x) A(x) T 0 ti Next we prove [ ] tip p A p q A T 0 A ti T A t 2 I q q, t 0 q q Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 104 / 199

4-39: matrix norm minimization III we immediately have t 0 If t > 0, [ v T A T tv ] [ ] [ ] T ti p p A p q Av A T ti q q tv = [ v T A T tv ] [ ] T tav + tav A T Av + t 2 v =t(t 2 v T v v T A T Av) 0 v T (t 2 I A T A)v 0, v Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 105 / 199

4-39: matrix norm minimization IV and hence If t = 0 t 2 I A T A 0 [ v T A T v ] [ ] [ ] T 0 A Av A T 0 v = [ v T A T v ] [ ] T Av A T Av = 2v T A T Av 0, v Therefore A T A 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 106 / 199

4-39: matrix norm minimization V Consider [ u T v ] [ ] [ ] T ti A u A T ti v = [ u T v ] [ ] T tu + Av A T u + tv =tu T u + 2v T A T u + tv T v We hope to have tu T u + 2v T A T u + tv T v 0, (u, v) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 107 / 199

4-39: matrix norm minimization VI If t > 0 has optimum at We have min tu T u + 2v T A T u + tv T v u u = Av t tu T u + 2v T A T u + tv T v =tv T v v T A T Av t = 1 t v T (t 2 I A T A)v 0. Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 108 / 199

4-39: matrix norm minimization VII Hence [ ] ti A A T 0 ti If t = 0 A T A 0 Thus v T A T Av 0, v T A T Av = Av 2 = 0 Av = 0, v [ u T v ] [ [ ] T 0 A u A 0] T v = [ u T v ] [ ] T 0 A T = 0 0 u Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 109 / 199

4-39: matrix norm minimization VIII Thus [ ] 0 A A T 0 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 110 / 199

4-40: Vector optimization I Though f 0 (x) is a vector note that f i (x) is still R n R 1 K-convex See 3-31 though we didn t discuss it earlier Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 111 / 199

4-41: optimal and pareto optimal points I Optimal Pareto optimal O {x} + K (x K) O = {x} Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 112 / 199

5-3: Lagrange dual function I Note that g is concave no matter if the original problem is convex or not f 0 (x) + λ i f i (x) + ν i h i (x) is convex (linear) in λ, ν for each x Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 113 / 199

5-3: Lagrange dual function II Use pointwise supremum on 3-16 sup( f 0 (x) λ i f i (x) ν i h i (x)) x D is convex. Hence inf(f 0 (x) + λ i f i (x) + ν i h i (x)) is concave. Note that sup( ) = convex = inf( ) = concave Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 114 / 199

5-8: Lagrange dual and conjugate function I f 0 ( A T λ c T ν) = sup(( A T λ c T ν) T x f 0 (x)) x = inf (f 0(x) + (A T λ + c T ν) T x) x Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 115 / 199

5-10: weak and strong duality I We don t discuss the SDP problem on this slide because we omitted 5-7 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 116 / 199

5-11: Slater s constraint qualification I We omit the proof because of no time linear inequality do not need to hold with strict inequality : for linear inequalities we DO NOT need constraint qualification We will see some explanation later Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 117 / 199

5-12: inequality from LP I If we have only linear constraints, then constraint qualification holds Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 118 / 199

5-15: geometric interpretation I Explanation of g(λ): when λ is fixed λu + t = is a line. We lower until it touches the boundary of G The value then becomes g(λ) When u = 0 t = so we see the point marked as g(λ) on t-axis Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 119 / 199

5-15: geometric interpretation II We have λ 0, so λu + t = must be like rather than Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 120 / 199

5-15: geometric interpretation III Explanation of p : In G, only points satisfying u 0 are feasible Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 121 / 199

5-16 I We do not discuss a formal proof of Slater condition strong duality Instead, we explain this result by figures Reason of using A: G may not be convex Example: min x 2 subject to x + 2 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 122 / 199

5-16 II This is a convex optimization problem G = {(x + 2, x 2 ) x R} is only a quadratic curve Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 123 / 199

5-16 III The curve is not convex However, A is convex A Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 124 / 199

5-16 IV Primal problem: x = 2 optimal objective value = 4 Dual problem: g(λ) = min x x 2 + λ(x + 2) x = λ/2 max λ 0 λ2 4 + 2λ optimal λ = 4 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 125 / 199

5-16 V optimal objective value = 16 4 + 8 = 4 Proving that A is convex (u 1, t 1 ) A, (u 2, t 2 ) A x 1, x 2 such that f 1 (x 1 ) u 1, f 0 (x 1 ) t 1 Consider f 1 (x 2 ) u 2, f 0 (x 2 ) t 2 x = θx 1 + (1 θ)x 2 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 126 / 199

5-16 VI We have f 1 (x) θu 1 + (1 θ)u 2 So f 0 (x) θt 1 + (1 θ)t 2 [ ] [ ] [ ] u u1 u2 = θ + (1 θ) A t t 1 t 2 Note that we have Slater condition strong duality However, it s possible that Slater condition doesn t hold but strong duality holds Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 127 / 199

5-16 VII Example from exercise 5.22: min x subject to x 2 0 Slater condition doesn t hold because no x satisfies x 2 < 0 G = {(x 2, x) x R} Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 128 / 199

5-16 VIII t u There is only one feasible point (0, 0) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 129 / 199

5-16 IX t A u Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 130 / 199

5-16 X Dual problem Strong duality holds g(λ) = min x + x 2 λ x { 1/(2λ) if λ > 0 x = if λ = 0 max λ 0 1/(4λ) λ, objective value 0 d = 0, p = 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 131 / 199

5-17: complementary slackness I In deriving the inequality we use h i (x ) = 0 and f i (x ) 0 Complementary slackness compare the earlier results in 4-10 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 132 / 199

5-19: KKT conditions for convex problem I For the problem on p5-16, neither slater condition nor KKT condition holds 1 λ0 Therefore, for convex problems, KKT optimality but not vice versa. Next we explain why for linear constraints we don t need constraint qualification Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 133 / 199

5-19: KKT conditions for convex problem II Consider the situation of inequality constraints only: min subject to f 0 (x) f i (x) 0, i = 1,..., m Consider an optimial solution x. We would like to prove that x satisfies KKT condition Because x is optimal, from the optimality condition on slide 4-9, for any feasible direction δx, f 0 (x) T δx 0. Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 134 / 199

f i (x) T δx 0 if f i (x) = 0. Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 135 / 199 5-19: KKT conditions for convex problem III A feasible δx means Because f i (x + δx) 0, i f i (x + δx) f i (x) + f i (x) T δx from we have f i (x) 0, i

0 and T f i (x) = 0, i : f i (x) = 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 136 / 199 5-19: KKT conditions for convex problem IV We claim that f 0 (x) = λ i 0,f i (x)=0 λ i f i (x) (4) Assume the result is wrong. First let s consider f 0 (x) =linear combination of { f i (x) f i (x) = 0} +, where

5-19: KKT conditions for convex problem V Then there exists α < 0 such that satisfies and δx α f i (x) T δx = 0 if f i (x) = 0 f i (x + δx) 0 if f i (x) < 0 f 0 (x) T δx = α T < 0 This contradicts the optimality condition Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 137 / 199

5-19: KKT conditions for convex problem VI By a similar setting we can further prove (4). Assume f 0 (x) = λ i f i (x) i:f i (x)=0 and there exists i such that λ i < 0, f i (x) 0, and f i (x) = 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 138 / 199

5-19: KKT conditions for convex problem VII Let Then λ = arg min λ f i (x) = f i (x) i:i i,f i (x)=0 i:i i,f i (x)=0 f i (x)λ i f i (x) λ i f i (x) T = 0, i i, f i (x) = 0 (5) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 139 / 199

5-19: KKT conditions for convex problem VIII We have If then f i (x) T 0 f i (x) T = 0 λ i f i (x) can be rearranged to use linear combination of { f i (x) i i, f i (x) = 0} Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 140 / 199

5-19: KKT conditions for convex problem IX Otherwise Let From (5), f i (x) T > 0 δx = α, α < 0. f i (x) T δx = 0, i i, f i (x) = 0 f i (x) T δx = α f i (x) T < 0. Hence δx is a feasible direction. Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 141 / 199

5-19: KKT conditions for convex problem X However, f 0 (x) T δx = αλ i f i (x) T < 0 contradicts the optimality condition This proof is not rigourous because of For linear the proof becomes rigourous Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 142 / 199

5-25 I Explanation of f 0 (ν) inf y (f 0(y) ν T y) = sup(ν T y f 0 (y)) = f0 (ν) y where f0 (ν) is the conjugate function Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 143 / 199

5-26 I The original problem Dual norm: If ν > 1, g(λ, ν) = inf x Ax b = constant ν sup{ν T y y 1} ν T y > 1, y 1 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 144 / 199

5-26 II inf y + ν T y y ν T y < 0 Hence ty ν T (ty ) as t inf y If ν 1, we claim that y + νt y = inf y y + νt y = 0 y = 0 y + ν T y = 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 145 / 199

5-26 III If y such that y + ν T y < 0 then y < ν T y We can scale y so that sup{ν T y y 1} > 1 but this causes a contradiction Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 146 / 199

5-27: implicit constraint I The dual function c T x + ν T (Ax b) = b T ν + x T (A T ν + c) inf x i(a T ν + c) i = (A T ν + c) i 1 x i 1 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 147 / 199

5-30: semidefinite program I From 5-29 we need that Z is non-negative in the dual cone of S k + Dual cone of S k + is S k + (we didn t discuss dual cone so we assume this result) Why tr(z( ))? We are supposed to do component-wise produt between Z and x 1 F 1 + + x n F n G Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 148 / 199

5-30: semidefinite program II Trace is the component-wise product tr(ab) = i = i (AB) ii A ij B ji = j i A ij B ij j Note that we take the property that B is symmetric Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 149 / 199

7-4 I Uniform noise p(z) = { 1 2a if z a 0 otherwise Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 150 / 199

8-10: Dual of maximum margin problem I Largangian: a 2 λ i (a T x i + b 1) + i i = a 2 + at ( λ i x i + µ i y i ) i i µ i (a T y i + b + 1) + b( i λ i + i µ i ) + i λ i + i µ i Because of b( i λ i + i µ i ) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 151 / 199

8-10: Dual of maximum margin problem II we have inf L a,b { a inf a = 2 i λ ia T x i + i µ ia T y i if i λ i = i µ i if i λ i i µ i For inf a a 2 i λ i a T x i + i µ i a T y i Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 152 / 199

8-10: Dual of maximum margin problem III we can denote it as inf a a 2 + v T a where v is a vector. We cannot do derivative because a is not differentiable. Formal solution: Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 153 / 199

8-10: Dual of maximum margin problem IV Case 1: If v 1/2: so However, a T v a v a 2 inf a a 2 + v T a 0. a = 0 a 2 + v T a = 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 154 / 199

8-10: Dual of maximum margin problem V Therefore If v > 1/2, let inf a a 2 + v T a = t 2 t v a 2 + v T a = 0. a = tv v =t( 1 2 v ) if t Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 155 / 199

8-10: Dual of maximum margin problem VI Thus inf a a 2 + v T a = Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 156 / 199

8-14. θ = vec(p) z i z j q, F (z) =. r z i. 1 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 157 / 199

10-3: initial point and sublevel set I The condition that S is closed if f (x) as x boundary of domain f Proof: if not, consider {x i } S such that Then x i boundary f (x i ) > f (x 0 ) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 158 / 199

10-3: initial point and sublevel set II and Example S is not closed m f (x) = log( exp(ai T x + b i )) i=1 domain = R n Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 159 / 199

10-3: initial point and sublevel set III Example f (x) = i log(b i a T i x) We use the condition that domain R n f (x) as x boundary of domain f Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 160 / 199

f (y) f (x 0 ) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 161 / 199 10-4: strong convexity and implications I S is bounded. Otherwise, there exists a set {y i y i = x + i } S satisfying Then lim i = i f (y i ) f (x) + f (x) T i + m 2 i 2 This contradicts

10-4: strong convexity and implications II Proof of and From p > f (x) p 1 f (x) 2 2m f (y) f (x) + f (x) T (y x) + m x y 2 2 Minimize the right-hand side with respect to y f (x) + m(y x) = 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 162 / 199

10-4: strong convexity and implications III ỹ = x f (x) m f (y) f (x) + f (x) T (ỹ x) + m ỹ x 2 2 = f (x) 1 2m f (x) 2, y Then p f (x) 1 f (x) 2 2m and f (x) p 1 f (x) 2 2m Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 163 / 199

10-5: descent methods I If then f (x + t x) < f (x) f (x) T x < 0 Proof: From the first-order condition of a convex function Then f (x + t x) f (x) + t f (x) T x t f (x) T x f (x + t x) f (x) < 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 164 / 199

10-6: line search types I Why α (0, 1 2 )? The use of 1/2 is for convergence though we won t discuss details Finite termination of backtracking line search. We argue that t > 0 such that f (x + t x) < f (x) + αt f (x) T x, t (0, t ) Otherwise, {t k } 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 165 / 199

10-6: line search types II such that f (x + t k x) f (x) + αt k f (x) T x, k However, f (x + t k x) f (x) lim t k 0 t k = f (x) T x α f (x) T x cause a contradiction f (x) T x < 0 and α > 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 166 / 199

10-6: line search types III Geometric interpretation: the tangent line passes through (0, f (x)), so the equation is y f (x) t 0 = f (x) T x Because f (x) T x < 0, we see that the line of f (x) + αt f (x) T x Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 167 / 199

10-6: line search types IV is above that of f (x) + t f (x) T x Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 168 / 199

10-7 I Linear convergence. We consider exact line search; proof for backtracking line search is more complicated S closed and bounded 2 f (x) MI, x S f (y) f (x) + f (x) T (y x) + M 2 y x 2 Solve min t f (x) t f (x) T f (x) + t2 M 2 f (x)t f (x) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 169 / 199

10-7 II t = 1 M f (x next ) f (x 1 1 f (x)) f (x) M 2M f (x)t f (x) The first inequality is from the fact that we use exact line search f (x next ) p f (x) p 1 2M f (x)t f (x) From slide 10-4, f (x) 2 2m(f (x) p ) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 170 / 199

10-7 III Hence f (x next ) p (1 m M )(f (x) p ) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 171 / 199

10-8 I Assume x k 1 = γ( γ 1 γ + 1 )k, x k 2 = ( γ 1 γ + 1 )k, f (x 1, x 2 ) = [ x1 γx 2 ] 1 min t 2 ((x 1 tx 1 ) 2 + γ(x 2 tγx 2 ) 2 ) 1 min t 2 (x 1 2 (1 t) 2 + γx2 2 (1 tγ) 2 ) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 172 / 199

10-8 II x 2 1 (1 t) + γx 2 2 (1 tγ)( γ) = 0 x 2 1 + tx 2 1 γ 2 x 2 2 + γ 3 tx 2 2 = 0 t = x 2 1 + γ2 x 2 2 x 2 1 + γ3 x 2 2 t(x 2 1 + γ 3 x 2 2 ) = x 2 1 + γ 2 x 2 2 = γ2 ( γ 1 γ+1 )2k + γ 2 ( γ 1 γ+1 )2k γ 2 ( γ 1 γ+1 )2k + γ 3 ( γ 1 γ+1 )2k = 2γ2 γ 2 + γ = 2 3 1 + γ [ ] x k+1 = x k t f (x k x k ) = 1 (1 t) (1 γt) x k 2 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 173 / 199

10-8 III x k+1 1 = γ( γ 1 γ + 1 )k ( γ 1 1 + γ ) = γ(γ 1 γ + 1 )k+1 x k+1 2 = ( γ 1 γ + 1 )k (1 2γ 1 + γ ) = ( γ 1 γ + 1 )k ( 1 γ 1 + γ ) = ( γ 1 γ + 1 )k+1 Why gradient is orghogonal to the tangent line of the contour curve? Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 174 / 199

10-8 IV Assume f (g(t)) is the countour with g(0) = x Then 0 = f (g(t)) f (g(0)) f (g(t)) f (g(0)) 0 = lim t 0 t = lim f (g(t)) T g(t) t 0 = f (x) T g(0) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 175 / 199

10-8 V where is the tangent line x + t g(0) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 176 / 199

10-10 I linear convergence: from slide 10-7 f (x k ) p c k (f (x 0 ) p ) log(c k (f (x 0 ) p )) = k log c + log(f (x 0 ) p ) is a straight line. Note that now k is the x-axis Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 177 / 199

10-11: steepest descent method I (unnormalized) steepest descent direction: x sd = f (x) x nsd Here is the dual norm We didn t discuss much about dual norm, but we can still explain some examples on 10-12 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 178 / 199

10-12 I Euclidean: x nsd is by solving min f T v subject to v = 1 f T v = f v cos θ = f when cos θ = π x nsd = f (x) f (x) f (x) = f (x) f (x) f (x) x nsd = f (x) = f (x) f (x) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 179 / 199

10-12 II Quadratic norm: x nsd is by solving min f T v subject to v T Pv = 1 Now v P = v T Pv, where P is symmetric positive definite Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 180 / 199

10-12 III Let w = P 1/2 v The optimization problem becomes min w f T P 1/2 w subject to w = 1 optimal w = P 1/2 f P 1/2 f = P 1/2 f f T P 1 f Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 181 / 199

10-12 IV optimal v = P 1 f f T P 1 f Dual norm Therefore x sd z = P 1/2 z = f T P 1 f P 1 f f T P 1 f = P 1 f Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 182 / 199

10-12 V Explanation of the figure: f (x) T x nsd = f (x) x nsd cos θ f (x) is a constant. From a point x nsd on the boundary, the projected point on f (x) indicates x nsd cos θ In the figure, we see that the chosen x nsd has the largest x nsd cos θ We omit the discusssion of l 1 -norm Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 183 / 199

10-13 I The two figures are by using two P matrices The left one has faster convergence Gradient descent after change of variables min x x = P 1/2 x, x = P 1/2 x f (x) min f (P 1/2 x) x x x αp 1/2 x f (P 1/2 x) P 1/2 x P 1/2 x αp 1/2 x f (x) x x αp 1 x f (x) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 184 / 199

10-14 I f (x) x Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 185 / 199

10-14 II Solve f (x) = 0 Fining the tangent line at x k : x k : the current iterate Let y = 0 y f (x k ) x x k = f (x k ) x k+1 = x k f (x k )/f (x k ) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 186 / 199

10-16 I ˆf (y) = f (x) + f (x) T (y x) + 1 2 (y x)t 2 f (x)(y x) ˆf (y) = 0 = f (x) + 2 f (x)(y x) y x = 2 f (x) 1 f (x) inf y ˆf (y) = f (x) 1 2 f (x)t 2 f (x) 1 f (x) f (x) inf y ˆf (y) = 1 2 λ(x)2 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 187 / 199

10-16 II Norm of the Newton step in the quadratic Hessian norm x nt = 2 f (x) 1 f (x) x T nt 2 f (x) x nt = f (x) T 2 f (x) 1 f (x) = λ(x) 2 Directional derivative in the Newton direction f (x + t nt ) f (x) lim t 0 t = f (x) T x nt = f (x) T 2 f (x) 1 f (x) = λ(x) 2 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 188 / 199

10-16 III Affine invariant f (y) f (Ty) = f (x) Assume T is an invertable square matrix. Then λ(y) = λ(ty) Proof: f (y) = T T f (Ty) 2 f (y) = T T 2 f (Ty)T Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 189 / 199

10-16 IV λ(y) 2 = f (y) T 2 f (y) 1 f (y) = f (Ty) T TT 1 2 f (Ty) 1 T T T T f (Ty) = f (Ty) T 2 f (Ty) 1 f (Ty) = λ(ty) 2 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 190 / 199

10-17 I Affine invariant y nt = 2 f (y) 1 f (y) = T 1 2 f (Ty) 1 f (Ty) = T 1 x nt Note that y k = T 1 x k so y k+1 = T 1 x k+1 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 191 / 199

10-17 II But how about line search f (y) T y nt = f (Ty) T TT 1 x nt = f (x) T x nt Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 192 / 199

10-19 I η (0, m2 L ) f (x k ) η m2 L L 2m 2 f (x k) 1 2 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 193 / 199

10-20 I Let f (x l ) f (x ) 1 2m f (x l) 2 1 2m 4m 4 L 2 (1 2 )2l k 2 = 2m3 L 2 (1 2 )2l k+1 ɛ ɛ 0 = 2m3 L 2 (from p10-4) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 194 / 199

10-20 II In at most iterations, we have log 2 ɛ 0 2 l k+1 log 2 ɛ 2 l k+1 log 2 (ɛ 0 /ɛ) l k 1 + log 2 log 2 (ɛ 0 /ɛ) k f (x 0) p r f (x 0 ) p r + log 2 log 2 (ɛ 0 /ɛ) f (x l ) f (x ) ɛ Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 195 / 199

10-29: implementation I λ(x) = ( f (x) 2 f (x) 1 f (x)) 1/2 = (g T L T L 1 g) 1/2 = L 1 g 2 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 196 / 199

10-30: example of dense Newton systems with structure I ψ 1 (x 1) f (x) =. + A T ψ 0 (Ax + b) ψ n(x n ) ψ 2 1 (x 1) f (x) =... + A T 2 ψ0(ax 2 + b)a ψ n(x n ) method 2: x = D 1 ( g A T L o w) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 197 / 199

10-30: example of dense Newton systems with structure II L T 0 AD 1 ( g A T L 0 w) = w Cost (I + L T 0 AD 1 A T L 0 )w = L T 0 AD 1 g L 0 : p p A T L 0 : n p, cost : O(np) (L T 0 A)D 1 (A T L 0 ) : O(p 2 n) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 198 / 199

10-30: example of dense Newton systems with structure III Note that Cholesky factorization of H 0 costs as 1 3 p3 p 2 n p n Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 199 / 199