A : k n. Usually k > n otherwise easily the minimum is zero. Analytical solution:

1-5: Least-squares I A : k n. Usually k > n otherwise easily the minimum is zero. Analytical solution: f (x) =(Ax b) T (Ax b) =x T A T Ax 2b T Ax + b T b f (x) = 2A T Ax 2A T b = 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 1 / 228

1-5: Least-squares II Regularization, weights: 1 2 λx T x + w 1 (Ax b) 2 1 + + w k (Ax b) 2 k Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 2 / 228

2-4: Convex Combination and Convex Hull I Convex hull is convex Then x = θ 1 x 1 + + θ k x k x = θ 1 x 1 + + θ k x k αx + (1 α) x =αθ 1 x 1 + + αθ k x k + (1 α) θ 1 x 1 + + (1 α) θ k x k Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 3 / 228

2-4: Convex Combination and Convex Hull II Each coefficient is nonnegative and αθ 1 + + αθ k + (1 α) θ 1 + + (1 α) θ k = α + (1 α) = 1 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 4 / 228

2-7: Euclidean Balls and Ellipsoid I We prove that any satisfies Let x = x c + Au with u 2 1 (x x c ) T P 1 (x x c ) 1 A = P 1/2 because P is symmetric positive definite. Then u T A T P 1 Au = u T P 1/2 P 1 P 1/2 u 1. Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 5 / 228

2-10: Positive Semidefinite Cone I S n + is a convex cone. Let For any θ 1 0, θ 2 0, X 1, X 2 S n + z T (θ 1 X 1 + θ 2 X 2 )z = θ 1 z T X 1 z + θ 2 z T X 2 z 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 6 / 228

2-10: Positive Semidefinite Cone II Example: [ ] x y S y z 2 + is equivalent to x 0, z 0, xz y 2 0 If x > 0 or (z > 0) is fixed, we can see that has a parabolic shape z y 2 x Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 7 / 228

2-12: Interaction I When t is fixed, {(x 1, x 2 ) 1 x 1 cos t + x 2 cos 2t 1} gives a region between two parallel lines This region is convex Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 8 / 228

2-13: Affine Function I f (S) is convex: Let f (x 1 ) f (S), f (x 2 ) f (S) αf (x 1 ) + (1 α)f (x 2 ) =α(ax 1 + b) + (1 α)(ax 2 + b) =A(αx 1 + (1 α)x 2 ) + b f (S) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 9 / 228

2-13: Affine Function II f 1 (C) convex: means that Because C is convex, x 1, x 2 f 1 (C) Ax 1 + b C, Ax 2 + b C α(ax 1 + b) + (1 α)(ax 2 + b) =A(αx 1 + (1 α)x 2 ) + b C Thus αx 1 + (1 α)x 2 f 1 (C) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 10 / 228

2-13: Affine Function III Scaling: Translation αs = {αx x S} S + a = {x + a x S} Projection T = {x 1 R m (x 1, x 2 ) S, x 2 R n }, S R m R n Scaling, translation, and projection are all affine functions Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 11 / 228

2-13: Affine Function IV For example, for projection f (x) = [ I 0 ] [ ] x 1 + 0 x 2 I : identity matrix Solution set of linear matrix inequality C = {S S 0} is convex f (x) = x 1 A 1 + + x m A m B = Ax + b f 1 (C) = {x f (x) 0} is convex Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 12 / 228

2-13: Affine Function V But this isn t rigourous because of some problems in arguing f (x) = Ax + b A more formal explanation: C = {s R p2 mat(s) S p and mat(s) 0} is convex f (x) = x 1 vec(a 1 ) + + x m vec(a m ) vec(b) = [ vec(a 1 ) vec(a m ) ] x + ( vec(b)) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 13 / 228

2-13: Affine Function VI f 1 (C) = {x mat(f (x)) S p and mat(f (x)) 0} is convex Hyperbolic cone: C = {(z, t) z T z t 2, t 0} is convex (by drawing a figure in 2 or 3 dimensional space) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 14 / 228

2-13: Affine Function VII We have that is affine. Then is convex f (x) = [ ] P 1/2 x c T = x f 1 (C) = {x f (x) C} [ P 1/2 c T ] x = {x x T Px (c T x) 2, c T x 0} Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 15 / 228

Perspective and linear-fractional function I Image convex: if S is convex, check if {P(x, t) (x, t) S} convex or not Note that S is in the domain of P Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 16 / 228

Perspective and linear-fractional function II Assume We hope (x 1, t 1 ), (x 2, t 2 ) S α x 1 t 1 + (1 α) x 2 t 2 = P(A, B), where (A, B) S Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 17 / 228

Perspective and linear-fractional function III We have α x 1 t 1 + (1 α) x 2 t 2 = αt 2x 1 + (1 α)t 1 x 2 t 1 t 2 = αt 2x 1 + (1 α)t 1 x 2 αt 1 t 2 + (1 α)t 1 t 2 αt 2 αt = 2 +(1 α)t 1 x 1 + (1 α)t 1 αt 2 +(1 α)t 1 x 2 αt 2 αt 2 +(1 α)t 1 t 1 + (1 α)t 1 αt 2 +(1 α)t 1 t 2 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 18 / 228

Perspective and linear-fractional function IV Let We have Further because θ = αt 2 αt 2 + (1 α)t 1 θx 1 + (1 θ)x 2 θt 1 + (1 θ)t 2 = A B (A, B) S (x 1, t 1 ), (x 2, t 2 ) S Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 19 / 228

Perspective and linear-fractional function V and Inverse image is convex Given C a convex set is convex S is convex P 1 (C) = {(x, t) P(x, t) = x/t C} Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 20 / 228

Perspective and linear-fractional function VI Let Do we have (x 1, t 1 ) : x 1 /t 1 C (x 2, t 2 ) : x 2 /t 2 C θ(x 1, t 1 ) + (1 θ)(x 2, t 2 ) P 1 (C)? That is, θx 1 + (1 θ)x 2 θt 1 + (1 θ)t 2 C? Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 21 / 228

Perspective and linear-fractional function VII Let Earlier we had θx 1 + (1 θ)x 2 θt 1 + (1 θ)t 2 = α x 1 t 1 + (1 α) x 2 t 2, θ = αt 2 αt 2 + (1 α)t 1 Then (α(t 2 t 1 ) + t 1 )θ = αt 2 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 22 / 228

Perspective and linear-fractional function VIII t 1 θ = αt 2 αt 2 θ + αt 1 θ t 1 θ α = t 1 θ + (1 θ)t 2 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 23 / 228

2-16: Generalized inequalities I K contains no line: x with x K and x K x = 0 Nonnegative orthant Clearly all properties are satisfied Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 24 / 228

2-16: Generalized inequalities II Positive semidefinite cone: PD matrices are interior Nonnegative polynomial on [0, 1] When n = 2 t = 1 x 1 tx 2, t [0, 1] Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 25 / 228

2-16: Generalized inequalities III t = 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 26 / 228

2-16: Generalized inequalities IV t [0, 1] It really becomes a proper cone Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 27 / 228

2-17: I Properties: implies that x K y, u K v y x K v u K From the definition of a convex cone, (y x) + (v u) K Then x + u K y + v Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 28 / 228

2-18: Minimum and minimal elements I The minimum element S x 1 + K A minimal element (x 2 K) S = {x 2 } Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 29 / 228

2-19: Separating hyperplane theorem I We consider a simplified situation and omit part of the proof Assume inf u v > 0 u C,v D and minimum attained at c, d We will show that a d c, b d 2 c 2 forms a separating hyperplane a T x = b Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 30 / 228 2

2-19: Separating hyperplane theorem II c d a T x = 2 b 1 1 = (d c) T d, b = 1 + 2 2 a = d c 2 = (d c) T c = d T d c T c 2 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 31 / 228

2-19: Separating hyperplane theorem III Assume the result is wrong so there is u D such that a T u b < 0 We will derive a point u in D but it is closer to c than d. That is, u c < d c Then we have a contradiction The concept Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 32 / 228

2-19: Separating hyperplane theorem IV c u u d a T x = b a T u b = (d c) T u d T d c T c 2 implies that < 0 (d c) T (u d) + 1 2 d c 2 < 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 33 / 228

2-19: Separating hyperplane theorem V d d + t(u d) c 2 dt t=0 =2(d + t(u d) c) T (u d) =2(d c) T (u d) < 0 There exists a small t (0, 1) such that However, d + t(u d) c < d c d + t(u d) D, t=0 so there is a contradiction Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 34 / 228

2-19: Separating hyperplane theorem VI Strict separation C D They are disjoint convex sets. However, no a, b such that a T x < b, x C and a T x > b, x D Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 35 / 228

2-20: Supporting hyperplane theorem I Case 1: C has an interior region Consider 2 sets: interior of C versus {x 0 }, where x 0 is any boundary point If C is convex, then interior of C is also convex Then both sets are convex We can apply results in slide 2-19 so that there exists a such that a T x a T x 0, x interior of C Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 36 / 228

2-20: Supporting hyperplane theorem II Then for all boundary point x we also have a T x a T x 0 because any boundary point is the limit of interior points Case 2: C has no interior region In this situation, C is like a line in R 3 (so no interior). Then of course it has a supporting hyperplane We don t do a rigourous proof here Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 37 / 228

3-3: Examples on R I Example: x 3, x 0 Example: x 1, x > 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 38 / 228

3-3: Examples on R II Example: x 1/2, x 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 39 / 228

3-3: Examples on R III Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 40 / 228

3-4: Examples on R n and R m n I A, X R m n tr(a T X ) = (A T X ) jj j = A T ji X ij = A ij X ij j i j i Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 41 / 228

3-7: First-order Condition I An open set: for any x, there is a ball covering x such that this ball is in the set Global underestimator: z f (x) y x = f (x) z = f (x) + f (x)(y x) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 42 / 228

3-7: First-order Condition II : Because domain f is convex, for all 0 < t 1, x + t(y x) domain f when t 0, : f (x + t(y x)) (1 t)f (x) + tf (y) f (y) f (x) + f (x + t(y x)) f (x) t f (y) f (x) + f (x) T (y x) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 43 / 228

3-7: First-order Condition III For any 0 θ 1, z = θx + (1 θ)y f (x) f (z) + f (z) T (x z) =f (z) + f (z) T (1 θ)(x y) f (y) f (z) + f (z) T (y z) =f (z) + f (z) T θ(y x) θf (x) + (1 θ)f (y) f (z) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 44 / 228

3-7: First-order Condition IV First-order condition for strictly convex function: f is strictly convex if and only if f (y) > f (x) + f (x) T (y x) : it s easy by directly modifying to > f (x) >f (z) + f (z) T (x z) =f (z) + f (z) T (1 θ)(x y) f (y) > f (z)+ f (z) T (y z) = f (z)+ f (z) T θ(y x) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 45 / 228

3-7: First-order Condition V : Assume the result is wrong. From the 1st-order condition of a convex function, x, y such that x y and f (x) T (y x) = f (y) f (x) (1) For this (x, y), from the strict convexity f (x + t(y x)) f (x) <tf (y) tf (x) = f (x) T t(y x), t (0, 1) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 46 / 228

3-7: First-order Condition VI Therefore, f (x +t(y x)) < f (x)+ f (x) T t(y x), t (0, 1) However, this contradicts the first-order condition: f (x +t(y x)) f (x)+ f (x) T t(y x), t (0, 1) This proof was given by a student of this course before Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 47 / 228

3-8: Second-order condition I Proof of the 2nd-order condition: We consider only the simpler condition of n = 1 f (x + t) f (x) + f (x)t lim 2f (x + t) f (x) f (x)t t 0 t 2 2(f (x + t) f (x)) = lim = f (x) 0 t 0 2t Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 48 / 228

3-8: Second-order condition II f (x + t) = f (x) + f (x)t + 1 2 f ( x)t 2 f (x) + f (x)t by 1st-order condition The extension to general n is straightforward If 2 f (x) 0, then f is strictly convex Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 49 / 228

3-8: Second-order condition III Using 1st-order condition for strictly convex function: f (y) =f (x) + f (x) T (y x) + 1 2 (y x)t 2 f ( x)(y x) >f (x) + f (x) T (y x) It s possible that f is strictly convex but 2 f (x) 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 50 / 228

3-8: Second-order condition IV Example: Details omitted f (x) = x 4 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 51 / 228

3-9: Examples I Quadratic-over-linear f x = 2x y, f y = x2 y 2 2 f x x = 2 y, 2 f x y = 2x y, 2 f 2 y y = 2x 2 y, 3 [ ] 2 y [y ] x y 3 x = 2 [ ] [ ] y 2 1 xy y 3 xy x 2 y x y = 2 2 x x 2 y 2 y 3 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 52 / 228

3-10 I f (x) = log k exp x k f (x) = ex 1 k ex k. exn k ex k 2 iif = ( k ex k )ex i e x i ex i ( k ex k ) 2, 2 ijf = Note that if z k = exp x k exi exj ( k ex k ) 2, i j Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 53 / 228

3-10 II then (zz T ) ij = z i (z T ) j = z i z j Cauchy-Schwarz inequality (a 1 b 1 + + a n b n ) 2 (a1 2 + + an)(b 2 1 2 + + bn) 2 Note that a k = v k zk, b k = z k z k > 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 54 / 228

3-12: Jensen s inequality I General form f ( p(z)zdz) p(z)f (z)dz Discrete situation f ( p i z i ) p i f (z i ), p i = 1 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 55 / 228

3-12: Jensen s inequality II Proof: f (p 1 z 1 + p 2 z 2 + p 3 z 3 ) ( ( )) p1 z 1 + p 2 z 2 (1 p 3 ) f + p 3 f (z 3 ) 1 p ( 3 p1 (1 p 3 ) f (z 1 ) + p ) 2 f (z 2 ) + p 3 f (z 3 ) 1 p 3 1 p 3 =p 1 f (z 1 ) + p 2 f (z 2 ) + p 3 f (z 3 ) Note that p 1 1 p 3 + p 2 1 p 3 = 1 p 3 1 p 3 = 1 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 56 / 228

3-14: Positive weighted sum & composition with affine function I Composition with affine function: We know f (x) is convex Is g(x) = f (Ax + b) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 57 / 228

3-14: Positive weighted sum & composition with affine function II convex? g((1 α)x 1 + αx 2 ) =f (A((1 α)x 1 + αx 2 ) + b) =f ((1 α)(ax 1 + b) + α(ax 2 + b)) (1 α)f (Ax 1 + b) + αf (Ax 2 + b) =(1 α)g(x 1 ) + αg(x 2 ) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 58 / 228

3-15: Pointwise maximum I Proof of the convexity f ((1 α)x 1 + αx 2 ) = max(f 1 ((1 α)x 1 + αx 2 ),..., f m ((1 α)x 1 + αx 2 )) max((1 α)f 1 (x 1 ) + αf 1 (x 2 ),..., (1 α)f m (x 1 ) + αf m (x 2 )) (1 α) max(f 1 (x 1 ),..., f m (x 1 ))+ α max(f 1 (x 2 ),..., f m (x 2 )) (1 α)f (x 1 ) + αf (x 2 ) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 59 / 228

3-15: Pointwise maximum II For consider all combinations f (x) = x [1] + + x [r] ( ) n r Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 60 / 228

3-16: Pointwise supremum I The proof is similar to pointwise maximum Support function of a set C: When y is fixed, f (x, y) = y T x is linear (convex) in x Maximum eigenvales of symmetric matrix f (X, y) = y T Xy is a linear function of X when y is fixed Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 61 / 228

3-19: Minimization I Proof: Let ɛ > 0. y 1, y 2 C such that f (x 1, y 1 ) g(x 1 ) + ɛ f (x 2, y 2 ) g(x 2 ) + ɛ g(θx 1 + (1 θ)x 2 ) = inf y C f (θx 1 + (1 θ)x 2, y) f (θx 1 + (1 θ)x 2, θy 1 + (1 θ)y 2 ) θf (x 1, y 1 ) + (1 θ)f (x 2, y 2 ) θg(x 1 ) + (1 θ)g(x 2 ) + ɛ Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 62 / 228

3-19: Minimization II Note that the first inequality use the peroperty that C is convex to have θy 1 + (1 θ)y 2 C Because the above inequality holds for all ɛ > 0, g(θx 1 + (1 θ)x 2 ) θg(x 1 ) + (1 θ)g(x 2 ) First example: Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 63 / 228

3-19: Minimization III The goal is to prove A BC 1 B T 0 Instead of a direct proof, here we use the property in this slide. First we have that f (x, y) is convex in (x, y) because [ ] A B B T 0 C Consider min y f (x, y) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 64 / 228

3-19: Minimization IV Because C 0, the minimum occurs at 2Cy + 2B T x = 0 Then y = C 1 B T x g(x) = x T Ax 2x T BC 1 Bx + x T BC 1 CC 1 B T x = x T (A BC 1 B T )x Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 65 / 228

3-19: Minimization V is convex. The second-order condition implies that A BC 1 B T 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 66 / 228

3-21: the conjugate function I This function is useful later When y is fixed, maximum happens at y = f (x) (2) by taking the derivative on x Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 67 / 228

3-21: the conjugate function II Explanation of the figure: when y is fixed z = xy is a straight line passing through the origin, where y is the slope of the line. Check under which x, yx and f (x) have the largest distance From the figure, the largest distance happens when (2) holds Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 68 / 228

z = x 0 f (x 0 ) + f (x 0 ) = x 0 y + f (x 0 ) = f (y) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 69 / 228 3-21: the conjugate function III About the point The tangent line is (0, f (y)) z f (x 0 ) x x 0 = f (x 0 ) where x 0 is the point satisfying When x = 0, y = f (x 0 )

3-21: the conjugate function IV f is convex: Given x, y T x f (x) is linear (convex) in y. Then we apply the property of pointwise supremum Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 70 / 228

3-22: examples I negative logarithm f (x) = log x x (xy + log x) = y + 1 x = 0 If y < 0, pictures of xy and log x are Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 71 / 228

3-22: examples II Thus xy + log x has maximum. Then xy + log x = 1 log( y) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 72 / 228

3-22: examples III strictly convex quadratic Qx = y, x = Q 1 y y T x 1 2 x T Qx =y T Q 1 y 1 2 y T Q 1 QQ 1 y = 1 2 y T Q 1 y Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 73 / 228

3-23: quasiconvex functions I Figure on slide: S α = [a, b], S β = (, c] Both are convex The figure is an example showing that quasi convex may not be convex Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 74 / 228

3-26: properties of quasiconvex functions I Modified Jensen inequality: f quasiconvex if and only if f (θx +(1 θ)y) max{f (x), f (y)}, x, y, θ [0, 1]. Let S is convex = max{f (x), f (y)} x S, y S θx + (1 θ)y S Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 75 / 228

3-26: properties of quasiconvex functions II f (θx + (1 θ)y) and the result is obtained If results are wrong, there exists α such that S α is not convex. x, y, θ with x, y S α, θ [0, 1] such that θx + (1 θ)y / S α Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 76 / 228

3-26: properties of quasiconvex functions III Then f (θx + (1 θ)y) > α max{f (x), f (y)} This violates the assumption First-order condition (this is exercise 3.43): Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 77 / 228

3-26: properties of quasiconvex functions IV f ((1 t)x + ty) max(f (x), f (y)) = f (x) f (x + t(y x)) f (x) 0 t f (x + t(y x)) f (x) lim = f (x) T (y x) 0 t 0 t Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 78 / 228

3-26: properties of quasiconvex functions V : If results are wrong, there exists α such that S α is not convex. x, y, θ with x, y S α, θ [0, 1] such that Then θx + (1 θ)y / S α f (θx + (1 θ)y) > α max{f (x), f (y)} (3) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 79 / 228

3-26: properties of quasiconvex functions VI Because f is differentiable, it is continuous. Without loss of generality, we have f (z) f (x), f (y), z between x and y (4) Let s give a 1-D interpretation. From (3), we can find a ball surrounding θx + (1 θ)y Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 80 / 228

3-26: properties of quasiconvex functions VII and two points x, y such that f (z) f (x ) = f (y ), z between x and y With (4), z = x + θ(y x), θ (0, 1) f (z) T ( θ(y x)) 0 f (z) T (y x θ(y x)) 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 81 / 228

3-26: properties of quasiconvex functions VIII Then f (z) T (y x) = 0, θ (0, 1) This contradicts (3). f (x + θ(y x)) =f (x) + f (t) T θ(y x) =f (x), θ [0, 1) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 82 / 228

3-27: Log-concave and log-convex functions I Powers: log(x a ) = a log x log x is concave Probability densities: log f (x) = 1 2 (x x)t Σ 1 (x x) + constant Σ 1 is positive definite. Thus log f (x) is concave Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 83 / 228

3-27: Log-concave and log-convex functions II Cumulative Gaussian distribution d 2 d 2 x log Φ(x) = log x e u2 /2 du d dx log Φ(x) = e x /2 du e u2 log Φ(x) x 2/2 = ( x e u2 /2 du)e x 2 /2 ( x) e x 2 /2 e x 2 /2 ( x e u2 /2 du) 2 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 84 / 228

3-27: Log-concave and log-convex functions III Need to prove that ( x Because e u2 /2 du ) x + e x 2 /2 > 0 x u for all u (, x], Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 85 / 228

3-27: Log-concave and log-convex functions IV we have = ( x x x ) e u2 /2 du x + e x 2 /2 xe u2 /2 du + e x 2 /2 ue u2 /2 du + e x 2 /2 = e u2 /2 x + e x 2 /2 = e x 2 /2 + e x 2 /2 = 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 86 / 228

3-27: Log-concave and log-convex functions V This proof was given by a student (and polished by another student) of this course before Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 87 / 228

4-3: Optimal and locally optimal points I f 0 (x) = 1/x f 0 (x) = x log x Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 88 / 228

4-3: Optimal and locally optimal points II f 0 (x) = x 3 3x f 0(x) = 1 + log x = 0 x = e 1 = 1/e Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 89 / 228

4-3: Optimal and locally optimal points III f 0(x) = 3x 2 3 = 0 x = ±1 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 90 / 228

4-7: example I x 1 /(1 + x 2 2 ) 0 x 1 0 (x 1 + x 2 ) 2 = 0 x 1 + x 2 = 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 91 / 228

f 0 (y) f 0 (x), for all feasible y Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 92 / 228 4-9: Optimality criterion for differentiable f 0 I : easy From first-order condition Together with we have f 0 (y) f 0 (x) + f 0 (x) T (y x) f 0 (x) T (y x) 0

4-9: Optimality criterion for differentiable f 0 II Assume the result is wrong. Then f 0 (x) T (y x) < 0 Let z(t) = ty + (1 t)x d dt f 0(z(t)) = f 0 (z(t)) T (y x) d dt f 0(z(t)) = f 0 (x) T (y x) < 0 t=0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 93 / 228

A(tx+(1 t)y) = tax+(1 t)ay = tb+(1 b)b = b Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 94 / 228 4-9: Optimality criterion for differentiable f 0 III There exists t such that f 0 (z(t)) < f 0 (x) Note that is feasible because z(t) f i (z(t)) tf i (x) + (1 t)f i (y) 0 and

4-9: Optimality criterion for differentiable f 0 IV Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 95 / 228

4-10 I Unconstrained problem: Let y = x t f 0 (x) It is feasible (unconstrained problem). Optimality condition implies Thus f 0 (x) T (y x) = t f 0 (x) 2 0 f 0 (x) = 0 Equality constrained problem Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 96 / 228

4-10 II Easy. For any feasible y, Ay = b f 0 (x) T (y x) = ν T A(y x) = ν T (b b) = 0 0 So x is optimal : more complicated. We only do a rough explanation Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 97 / 228

4-10 III From optimality condition f 0 (x) T ν = f 0 (x) T ((x + ν) x) 0, ν N(A) N(A) is a subspace in 2-D. Thus ν N(A) ν N(A) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 98 / 228

4-10 IV f 0 N(A) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 99 / 228

4-10 V We have f 0 (x) T ν = 0, ν N(A) f 0 (x) N(A), f 0 (x) R(A T ) ν such that f 0 (x) + A T ν = 0 Minimization over nonnegative orthant Easy Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 100 / 228

4-10 VI For any y 0, i f 0 (x)(y i x i ) = { i f 0 (x)y i 0 if x i = 0 0 if x i > 0. Therefore, and f 0 (x) T (y x) 0 x is optimal Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 101 / 228

4-10 VII If x i = 0, we claim i f 0 (x) 0 Otherwise, i f 0 (x) < 0 Let y = x except y i f 0 (x) T (y x) = i f 0 (x)(y i x i ) This violates the optimality condition Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 102 / 228

4-10 VIII If x i > 0, we claim i f 0 (x) = 0 Otherwise, assume i f 0 (x) > 0 Consider y = x except y i = x i /2 > 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 103 / 228

4-10 IX It is feasible. Then f 0 (x) T (y x) = i f 0 (x)(y i x i ) = i f 0 (x)x i /2 < 0 violates the optimality condition. The situation for i f 0 (x) < 0 is similar Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 104 / 228

4-23: examples I least-squares min x T (A T A)x 2b T Ax + b T b A T A may not be invertibe pseudo inverse linear program with random cost c E(C) Σ E C ((C c)(c c) T ) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 105 / 228

4-23: examples II Var(C T x) = E C ((C T x c T x)(c T x c T x)) = E C (x T (C c)(c c) T x) = x T Σx Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 106 / 228

4-25: second-order cone programming I Cone was defined on slide 2-8 {(x, t) x t} Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 107 / 228

4-35: generalized inequality constraint I f i R n R k i K i -convex: f i (θx + (1 θ)y) Ki θf i (x) + (1 θ)f i (y) See page 3-31 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 108 / 228

4-37: LP and SOCP as SDP I LP and equivalent SDP a 11 a 1n x 1 Ax =.. a m1 a mn x n 11 1n a... + + x n a... x 1 1 b... a m1 0 a mn b m Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 109 / 228

4-37: LP and SOCP as SDP II For SOCP and SDP we will use results in 4-39: [ tip p A p q ] A T 0 A ti T A t 2 I q q, t 0 q q Now Thus p = m, q = 1 A = A i x + b i, t = c T i x + d i A i x + b i 2 (c T i x + d i ) 2, c T i x + d i 0 from t 0 A i x + b i c T i x + d i Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 110 / 228

4-39: matrix norm minimization I Following 4-38, we have the following equivalent problem min subject to t A 2 t We then use A 2 t A T A t 2 I, t 0 [ ] ti A 0 ti A T Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 111 / 228

4-39: matrix norm minimization II to have the SDP min subject to t[ ] ti A(x) A(x) T 0 ti Next we prove [ ] tip p A p q A T 0 A ti T A t 2 I q q, t 0 q q Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 112 / 228

4-39: matrix norm minimization III we immediately have t 0 If t > 0, [ v T A T tv ] [ ] [ ] T ti p p A p q Av A T ti q q tv = [ v T A T tv ] [ ] T tav + tav A T Av + t 2 v =t(t 2 v T v v T A T Av) 0 v T (t 2 I A T A)v 0, v Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 113 / 228

4-39: matrix norm minimization IV and hence If t = 0 t 2 I A T A 0 [ v T A T v ] [ ] [ ] T 0 A Av A T 0 v = [ v T A T v ] [ ] T Av A T Av = 2v T A T Av 0, v Therefore A T A 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 114 / 228

4-39: matrix norm minimization V Consider [ u T v ] [ ] [ ] T ti A u A T ti v = [ u T v ] [ ] T tu + Av A T u + tv =tu T u + 2v T A T u + tv T v We hope to have tu T u + 2v T A T u + tv T v 0, (u, v) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 115 / 228

4-39: matrix norm minimization VI If t > 0 has optimum at We have min tu T u + 2v T A T u + tv T v u u = Av t tu T u + 2v T A T u + tv T v =tv T v v T A T Av t = 1 t v T (t 2 I A T A)v 0. Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 116 / 228

4-39: matrix norm minimization VII Hence [ ] ti A A T 0 ti If t = 0 A T A 0 Thus v T A T Av 0, v T A T Av = Av 2 = 0 Av = 0, v [ u T v ] [ [ ] T 0 A u A 0] T v = [ u T v ] [ ] T 0 A T = 0 0 u Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 117 / 228

4-39: matrix norm minimization VIII Thus [ ] 0 A A T 0 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 118 / 228

4-40: Vector optimization I Though f 0 (x) is a vector note that f i (x) is still R n R 1 K-convex See 3-31 though we didn t discuss it earlier f 0 (θx + (1 θ)y) K θf 0 (x) + (1 θ)f 0 (y) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 119 / 228

4-41: optimal and pareto optimal points I See definition in slide 2-38 Optimal O {x} + K Pareto optimal (x K) O = {x} Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 120 / 228

5-3: Lagrange dual function I Note that g is concave no matter if the original problem is convex or not f 0 (x) + λ i f i (x) + ν i h i (x) is convex (linear) in λ, ν for each x Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 121 / 228

5-3: Lagrange dual function II Use pointwise supremum on 3-16 sup( f 0 (x) λ i f i (x) ν i h i (x)) x D is convex. Hence inf(f 0 (x) + λ i f i (x) + ν i h i (x)) is concave. Note that sup( ) = convex = inf( ) = concave Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 122 / 228

5-8: Lagrange dual and conjugate function I f 0 ( A T λ c T ν) = sup(( A T λ c T ν) T x f 0 (x)) x = inf (f 0(x) + (A T λ + c T ν) T x) x Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 123 / 228

5-9: The dual problem I From 5-5, the dual problem is It can be simplified to max g(λ, ν) subject to λ 0 max b T ν subject to A T ν λ + c = 0 λ 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 124 / 228

5-9: The dual problem II Further, max b T ν subject to A T ν + c 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 125 / 228

5-10: weak and strong duality I We don t discuss the SDP problem on this slide because we omitted 5-7 on the two-way partitioning problem Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 126 / 228

5-11: Slater s constraint qualification I We omit the proof because of no time linear inequality do not need to hold with strict inequality : for linear inequalities we DO NOT need constraint qualification We will see some explanation later Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 127 / 228

5-12: inequality from LP I If we have only linear constraints, then constraint qualification holds Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 128 / 228

5-15: geometric interpretation I Explanation of g(λ): when λ is fixed λu + t = is a line. We lower until it touches the boundary of G The value then becomes g(λ) When u = 0 t = so we see the point marked as g(λ) on t-axis Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 129 / 228

5-15: geometric interpretation II We have λ 0, so λu + t = must be like rather than Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 130 / 228

5-15: geometric interpretation III Explanation of p : In G, only points satisfying u 0 are feasible Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 131 / 228

5-16 I We do not discuss a formal proof of Slater condition strong duality Instead, we explain this result by figures Reason of using A: G may not be convex Example: min x 2 subject to x + 2 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 132 / 228

5-16 II This is a convex optimization problem is only a quadratic curve G = {(x + 2, x 2 ) x R} Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 133 / 228

5-16 III The curve is not convex However, A is convex Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 134 / 228

5-16 IV A Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 135 / 228

5-16 V Primal problem: x = 2 optimal objective value = 4 Dual problem: g(λ) = min x x 2 + λ(x + 2) x = λ/2 max λ 0 λ2 4 + 2λ optimal λ = 4 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 136 / 228

5-16 VI optimal objective value = 16 4 + 8 = 4 Proving that A is convex (u 1, t 1 ) A, (u 2, t 2 ) A x 1, x 2 such that f 1 (x 1 ) u 1, f 0 (x 1 ) t 1 Consider f 1 (x 2 ) u 2, f 0 (x 2 ) t 2 x = θx 1 + (1 θ)x 2 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 137 / 228

5-16 VII We have f 1 (x) θu 1 + (1 θ)u 2 So f 0 (x) θt 1 + (1 θ)t 2 [ ] [ ] [ ] u u1 u2 = θ + (1 θ) A t t 1 t 2 Why non-vertical supporting hyperplane? Then g(λ) is well defined. Note that we have Slater condition strong duality Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 138 / 228

5-16 VIII However, it s possible that Slater condition doesn t hold but strong duality holds Example from exercise 5.22: min x subject to x 2 0 Slater condition doesn t hold because no x satisfies x 2 < 0 G = {(x 2, x) x R} Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 139 / 228

5-16 IX t u There is only one feasible point (0, 0) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 140 / 228

5-16 X t A u Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 141 / 228

5-16 XI Dual problem Strong duality holds g(λ) = min x + x 2 λ x { 1/(2λ) if λ > 0 x = if λ = 0 max λ 0 1/(4λ) λ, objective value 0 d = 0, p = 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 142 / 228

5-17: complementary slackness I In deriving the inequality we use h i (x ) = 0 and f i (x ) 0 Complementary slackness compare the earlier results in 4-10 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 143 / 228

5-17: complementary slackness II 4-10: x is optimal of min subject to f 0 (x) x i 0, i if and only if x i 0, { i f 0 (x) 0 x i = 0 i f 0 (x) = 0 x i > 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 144 / 228

5-17: complementary slackness III From KKT condition i f 0 (x) = λ i λ i x i = 0 λ i 0, x i 0 If then x i > 0, λ i = 0 = i f 0 (x) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 145 / 228

5-19: KKT conditions for convex problem I For the problem on p5-16, neither slater condition nor KKT condition holds 1 λ0 Therefore, for convex problems, KKT optimality but not vice versa. Next we explain why for linear constraints we don t need constraint qualification Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 146 / 228

5-19: KKT conditions for convex problem II Consider the situation of inequality constraints only: min subject to f 0 (x) f i (x) 0, i = 1,..., m Consider an optimial solution x. We would like to see if x satisfies KKT condition We claim that f 0 (x) = λ i f i (x) (5) i:f i (x)=0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 147 / 228

5-19: KKT conditions for convex problem III Assume the result is wrong. Then, f 0 (x) =linear combination of { f i (x) f i (x) = 0} +, where 0 and T f i (x) = 0, i : f i (x) = 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 148 / 228

5-19: KKT conditions for convex problem IV Then there exists α < 0 such that δx α satisfies and f i (x) T δx = 0 if f i (x) = 0 f i (x + δx) 0 if f i (x) < 0 We claim that δx is feasible. That is, f i (x + δx) 0, i Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 149 / 228

5-19: KKT conditions for convex problem V We have f i (x + δx) f i (x) + f i (x) T δx = 0 if f i (x) = 0 However, f 0 (x) T δx = α T < 0 This contradicts the optimality condition that from slide 4-9, for any feasible direction δx, f 0 (x) T δx 0. Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 150 / 228

5-19: KKT conditions for convex problem VI We do not continue to prove f 0 (x) = λ i 0,f i (x)=0 λ i f i (x) (6) because the proof is not trivial However, what we want to say is that in proving (5), the proof is not rigourous because of For linear the proof becomes rigourous This roughly give you a feeling that linear is different from non-linear Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 151 / 228

5-25 I Explanation of f 0 (ν) inf y (f 0(y) ν T y) = sup(ν T y f 0 (y)) = f0 (ν) y where f0 (ν) is the conjugate function Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 152 / 228

5-26 I The original problem Dual norm: If ν > 1, g(λ, ν) = inf x Ax b = constant ν sup{ν T y y 1} ν T y > 1, y 1 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 153 / 228

5-26 II inf y + ν T y y ν T y < 0 Hence ty ν T (ty ) as t inf y If ν 1, we claim that y + νt y = inf y y + νt y = 0 y = 0 y + ν T y = 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 154 / 228

5-26 III If y such that y + ν T y < 0 then y < ν T y We can scale y so that sup{ν T y y 1} > 1 but this causes a contradiction Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 155 / 228

5-27: implicit constraint I The dual function c T x + ν T (Ax b) = b T ν + x T (A T ν + c) inf x i(a T ν + c) i = (A T ν + c) i 1 x i 1 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 156 / 228

5-30: semidefinite program I From 5-29 we need that Z is non-negative in the dual cone of S k + Dual cone of S k + is S k + (we didn t discuss dual cone so we assume this result) Why tr(z( ))? We are supposed to do component-wise produt between Z and x 1 F 1 + + x n F n G Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 157 / 228

5-30: semidefinite program II Trace is the component-wise product tr(ab) = i = i (AB) ii A ij B ji = j i A ij B ij j Note that we take the property that B is symmetric Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 158 / 228

7-4 I Uniform noise p(z) = { 1 2a if z a 0 otherwise Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 159 / 228

8-10: Dual of maximum margin problem I Largangian: a 2 λ i (a T x i + b 1) + i i = a 2 + at ( λ i x i + µ i y i ) i i µ i (a T y i + b + 1) + b( i λ i + i µ i ) + i λ i + i µ i Because of b( i λ i + i µ i ) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 160 / 228

8-10: Dual of maximum margin problem II we have inf L = λ i + µ i + a,b i i { a inf a 2 + at ( i λ ix i + i µ iy i ) if i λ i = i µ i if i λ i i µ i For inf a a 2 + at ( i λ i x i + i µ i y i ) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 161 / 228

8-10: Dual of maximum margin problem III we can denote it as inf a a 2 + v T a where v is a vector. We cannot do derivative because a is not differentiable. Formal solution: Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 162 / 228

8-10: Dual of maximum margin problem IV Case 1: If v 1/2: so However, a T v a v a 2 inf a a 2 + v T a 0. a = 0 a 2 + v T a = 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 163 / 228

8-10: Dual of maximum margin problem V Therefore If v > 1/2, let inf a a 2 + v T a = t 2 t v a 2 + v T a = 0. a = tv v =t( 1 2 v ) if t Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 164 / 228

8-10: Dual of maximum margin problem VI Thus Finally, inf a a 2 + v T a = inf L = λ i + µ i + a,b i i 0 if i λ i = i µ i and i λ ix i i µ iy i 1/2 otherwise Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 165 / 228

8-14. θ = vec(p) z i z j q, F (z) =. r z i. 1 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 166 / 228

10-3: initial point and sublevel set I The condition that S is closed if domain of f = R n Proof: Be definition S is closed if for every convergent sequence {x i } with x i S and lim i x i = x, then x S. Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 167 / 228

10-3: initial point and sublevel set II Because we have domain of f = R n x domain of f Thus by the continuity of f, lim f (x i) = f (x ) f (x 0 ) i and x S Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 168 / 228

10-3: initial point and sublevel set III The condition that S is closed if f (x) as x boundary of domain f Proof: if not, from the definition of the closeness of S, there exists {x i } S such that Thus x i x / domain f x is on the boundary Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 169 / 228

10-3: initial point and sublevel set IV Then violates f (x i ) > f (x 0 ) f (x i ) f (x 0 ), i Thus the assumption is wrong and S is closed Example f (x) = log( m exp(ai T x + b i )) i=1 domain = R n Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 170 / 228

10-3: initial point and sublevel set V Example f (x) = i log(b i a T i x) We use the condition that domain R n f (x) as x boundary of domain f Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 171 / 228

f (y) f (x 0 ) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 172 / 228 10-4: strong convexity and implications I S is bounded. Otherwise, there exists a set {y i y i = x + i } S satisfying Then lim i = i f (y i ) f (x) + f (x) T i + m 2 i 2 This contradicts

10-4: strong convexity and implications II Proof of and From p > f (x) p 1 f (x) 2 2m f (y) f (x) + f (x) T (y x) + m x y 2 2 Minimize the right-hand side with respect to y f (x) + m(y x) = 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 173 / 228

10-4: strong convexity and implications III ỹ = x f (x) m f (y) f (x) + f (x) T (ỹ x) + m ỹ x 2 2 = f (x) 1 2m f (x) 2, y Then and p f (x) 1 2m f (x) 2 > f (x) p 1 f (x) 2 2m Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 174 / 228

10-5: descent methods I If then f (x + t x) < f (x) f (x) T x < 0 Proof: From the first-order condition of a convex function Then f (x + t x) f (x) + t f (x) T x t f (x) T x f (x + t x) f (x) < 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 175 / 228

10-6: line search types I Why α (0, 1 2 )? The use of 1/2 is for convergence though we won t discuss details Finite termination of backtracking line search. We argue that t > 0 such that f (x + t x) < f (x) + αt f (x) T x, t (0, t ) Otherwise, {t k } 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 176 / 228

10-6: line search types II such that f (x + t k x) f (x) + αt k f (x) T x, k However, f (x + t k x) f (x) lim t k 0 t k = f (x) T x α f (x) T x cause a contradiction f (x) T x < 0 and α (0, 1) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 177 / 228

10-6: line search types III Graphical interpretation: the tangent line passes through (0, f (x)), so the equation is y f (x) t 0 = f (x) T x Because f (x) T x < 0, we see that the line of f (x) + αt f (x) T x Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 178 / 228

10-6: line search types IV is above that of f (x) + t f (x) T x Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 179 / 228

10-7 I Linear convergence. We consider exact line search; proof for backtracking line search is more complicated S closed and bounded 2 f (x) MI, x S f (y) f (x) + f (x) T (y x) + M 2 y x 2 Solve min t f (x) t f (x) T f (x) + t2 M 2 f (x)t f (x) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 180 / 228

10-7 II t = 1 M f (x next ) f (x 1 1 f (x)) f (x) M 2M f (x)t f (x) The first inequality is from the fact that we use exact line search f (x next ) p f (x) p 1 2M f (x)t f (x) From slide 10-4, f (x) 2 2m(f (x) p ) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 181 / 228

10-7 III Hence f (x next ) p (1 m M )(f (x) p ) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 182 / 228

10-8 I Assume x k 1 = γ( γ 1 γ + 1 )k, x k 2 = ( γ 1 γ + 1 )k, f (x 1, x 2 ) = [ x1 γx 2 ] 1 min t 2 ((x 1 tx 1 ) 2 + γ(x 2 tγx 2 ) 2 ) 1 min t 2 (x 1 2 (1 t) 2 + γx2 2 (1 tγ) 2 ) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 183 / 228

10-8 II x 2 1 (1 t) + γx 2 2 (1 tγ)( γ) = 0 x 2 1 + tx 2 1 γ 2 x 2 2 + γ 3 tx 2 2 = 0 t = x 2 1 + γ2 x 2 2 x 2 1 + γ3 x 2 2 t(x 2 1 + γ 3 x 2 2 ) = x 2 1 + γ 2 x 2 2 = γ2 ( γ 1 γ+1 )2k + γ 2 ( γ 1 γ+1 )2k γ 2 ( γ 1 γ+1 )2k + γ 3 ( γ 1 γ+1 )2k = 2γ2 γ 2 + γ = 2 3 1 + γ [ ] x k+1 = x k t f (x k x k ) = 1 (1 t) (1 γt) x k 2 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 184 / 228

10-8 III x k+1 1 = γ( γ 1 γ + 1 )k ( γ 1 1 + γ ) = γ(γ 1 γ + 1 )k+1 x k+1 2 = ( γ 1 γ + 1 )k (1 2γ 1 + γ ) = ( γ 1 γ + 1 )k ( 1 γ 1 + γ ) = ( γ 1 γ + 1 )k+1 Why gradient is orghogonal to the tangent line of the contour curve? Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 185 / 228

10-8 IV Assume f (g(t)) is the countour with g(0) = x Then 0 = f (g(t)) f (g(0)) f (g(t)) f (g(0)) 0 = lim t 0 t = lim f (g(t)) T g(t) t 0 = f (x) T g(0) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 186 / 228

10-8 V where is the tangent line x + t g(0) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 187 / 228

10-10 I linear convergence: from slide 10-7 f (x k ) p c k (f (x 0 ) p ) log(c k (f (x 0 ) p )) = k log c + log(f (x 0 ) p ) is a straight line. Note that now k is the x-axis Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 188 / 228

10-11: steepest descent method I (unnormalized) steepest descent direction: x sd = f (x) x nsd Here is the dual norm We didn t discuss much about dual norm, but we can still explain some examples on 10-12 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 189 / 228

10-12 I Euclidean: x nsd is by solving min f T v subject to v = 1 f T v = f v cos θ = f when cos θ = π x nsd = f (x) f (x) f (x) = f (x) f (x) f (x) x nsd = f (x) = f (x) f (x) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 190 / 228

10-12 II Quadratic norm: x nsd is by solving min f T v subject to v T Pv = 1 Now v P = v T Pv, where P is symmetric positive definite Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 191 / 228

10-12 III Let w = P 1/2 v The optimization problem becomes min w f T P 1/2 w subject to w = 1 optimal w = P 1/2 f P 1/2 f = P 1/2 f f T P 1 f Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 192 / 228

10-12 IV optimal v = P 1 f f T P 1 f = x nsd Dual norm Therefore x sd z = P 1/2 z = f T P 1 f P 1 f f T P 1 f = P 1 f Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 193 / 228

10-12 V Explanation of the figure: f (x) T x nsd = f (x) x nsd cos θ f (x) is a constant. From a point x nsd on the boundary, the projected point on f (x) indicates x nsd cos θ In the figure, we see that the chosen x nsd has the largest x nsd cos θ We omit the discusssion of l 1 -norm Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 194 / 228

10-13 I The two figures are by using two P matrices The left one has faster convergence Gradient descent after change of variables min x x = P 1/2 x, x = P 1/2 x f (x) min f (P 1/2 x) x x x αp 1/2 x f (P 1/2 x) P 1/2 x P 1/2 x αp 1/2 x f (x) x x αp 1 x f (x) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 195 / 228

10-14 I f (x) x Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 196 / 228

10-14 II Solve f (x) = 0 Fining the tangent line at x k : x k : the current iterate Let y = 0 y f (x k ) x x k = f (x k ) x k+1 = x k f (x k )/f (x k ) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 197 / 228

10-16 I ˆf (y) = f (x) + f (x) T (y x) + 1 2 (y x)t 2 f (x)(y x) inf y ˆf (y) = 0 = f (x) + 2 f (x)(y x) y x = 2 f (x) 1 f (x) ˆf (y) = f (x) 1 2 f (x)t 2 f (x) 1 f (x) f (x) inf y ˆf (y) = 1 2 λ(x)2 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 198 / 228

10-16 II Norm of the Newton step in the quadratic Hessian norm x nt = 2 f (x) 1 f (x) x T nt 2 f (x) x nt = f (x) T 2 f (x) 1 f (x) = λ(x) 2 Directional derivative in the Newton direction f (x + t x nt ) f (x) lim t 0 t = f (x) T x nt = f (x) T 2 f (x) 1 f (x) = λ(x) 2 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 199 / 228

10-16 III Affine invariant f (y) f (Ty) = f (x) Assume T is an invertable square matrix. Then λ(y) = λ(ty) Proof: f (y) = T T f (Ty) 2 f (y) = T T 2 f (Ty)T Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 200 / 228

10-16 IV λ(y) 2 = f (y) T 2 f (y) 1 f (y) = f (Ty) T TT 1 2 f (Ty) 1 T T T T f (Ty) = f (Ty) T 2 f (Ty) 1 f (Ty) = λ(ty) 2 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 201 / 228

10-17 I Affine invariant y nt = 2 f (y) 1 f (y) = T 1 2 f (Ty) 1 f (Ty) = T 1 x nt Note that y k = T 1 x k so y k+1 = T 1 x k+1 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 202 / 228

10-17 II But how about line search f (y) T y nt = f (Ty) T TT 1 x nt = f (x) T x nt Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 203 / 228