A : k n. Usually k > n otherwise easily the minimum is zero. Analytical solution:

1-5: Least-squares I A : k n. Usually k > n otherwise easily the minimum is zero. Analytical solution: f (x) =(Ax b) T (Ax b) =x T A T Ax 2b T Ax + b T b f (x) = 2A T Ax 2A T b = 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 1 / 199

1-5: Least-squares II Regularization, weights: 1 2 λx T x + w 1 (Ax b) 2 1 + + w k (Ax b) 2 k Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 2 / 199

2-4: Convex Combination and Convex Hull I Convex hull is convex Then x = θ 1 x 1 + + θ k x k x = θ 1 x 1 + + θ k x k αx + (1 α) x =αθ 1 x 1 + + αθ k x k + (1 α) θ 1 x 1 + + (1 α) θ k x k Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 3 / 199

2-4: Convex Combination and Convex Hull II Each coefficient is nonnegative and αθ 1 + + αθ k + (1 α) θ 1 + + (1 α) θ k = α + (1 α) = 1 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 4 / 199

2-7: Euclidean Balls and Ellipsoid I We prove that any satisfies Let x = x c + Au with u 2 1 (x x c ) T P 1 (x x c ) 1 A = P 1/2 because P is symmetric positive definite. Then u T A T P 1 Au = u T P 1/2 P 1 P 1/2 u 1. Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 5 / 199

2-10: Positive Semidefinite Cone I S n + is a convex cone. Let For any θ 1 0, θ 2 0, X 1, X 2 S n + z T (θ 1 X 1 + θ 2 X 2 )z = θ 1 z T X 1 z + θ 2 z T X 2 z 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 6 / 199

2-10: Positive Semidefinite Cone II Example: [ ] x y S y z 2 + is equivalent to x 0, z 0, xz y 2 0 If x > 0 or (z > 0) is fixed, we can see that has a parabolic shape z y 2 x Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 7 / 199

2-12: Interaction I When t is fixed, {(x 1, x 2 ) 1 x 1 cos t + x 2 cos 2t 1} gives a region between two parallel lines This region is convex Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 8 / 199

2-13: Affine Function I f (S) is convex: Let f (x 1 ) f (S), f (x 2 ) f (S) αf (x 1 ) + (1 α)f (x 2 ) =α(ax 1 + b) + (1 α)(ax 2 + b) =A(αx 1 + (1 α)x 2 ) + b f (S) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 9 / 199

2-13: Affine Function II f 1 (C) convex: means that Because C is convex, x 1, x 2 f 1 (C) Ax 1 + b C, Ax 2 + b C α(ax 1 + b) + (1 α)(ax 2 + b) =A(αx 1 + (1 α)x 2 ) + b C Thus αx 1 + (1 α)x 2 f 1 (C) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 10 / 199

2-13: Affine Function III Scaling: Translation αs = {αx x S} S + a = {x + a x S} Projection T = {x 1 R m (x 1, x 2 ) S, x 2 R n }, S R m R n Scaling, translation, and projection are all affine functions Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 11 / 199

2-13: Affine Function IV For example, for projection f (x) = [ I 0 ] [ ] x 1 + 0 x 2 I : identity matrix Solution set of linear matrix inequality C = {S S 0} is convex f (x) = x 1 A 1 + + x m A m B = Ax + b f 1 (C) = {x f (x) 0} is convex Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 12 / 199

2-13: Affine Function V But this isn t rigourous because of some problems in arguing f (x) = Ax + b A more formal explanation: C = {s R p2 mat(s) S p and mat(s) 0} is convex f (x) = x 1 vec(a 1 ) + + x m vec(a m ) vec(b) = [ vec(a 1 ) vec(a m ) ] x + ( vec(b)) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 13 / 199

2-13: Affine Function VI f 1 (C) = {x mat(f (x)) S p and mat(f (x)) 0} is convex Hyperbolic cone: C = {(z, t) z T z t 2, t 0} is convex (by drawing a figure in 2 or 3 dimensional space) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 14 / 199

2-13: Affine Function VII We have that is affine. Then is convex f (x) = [ ] P 1/2 x c T = x f 1 (C) = {x f (x) C} [ P 1/2 c T ] x = {x x T Px (c T x) 2, c T x 0} Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 15 / 199

Perspective and linear-fractional function I Image convex: if S is convex, check if {P(x, t) (x, t) S} convex or not Note that S is in the domain of P Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 16 / 199

Perspective and linear-fractional function II Assume We hope (x 1, t 1 ), (x 2, t 2 ) S α x 1 t 1 + (1 α) x 2 t 2 = P(A, B), where (A, B) S Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 17 / 199

Perspective and linear-fractional function III We have α x 1 t 1 + (1 α) x 2 t 2 = αt 2x 1 + (1 α)t 1 x 2 t 1 t 2 = αt 2x 1 + (1 α)t 1 x 2 αt 1 t 2 + (1 α)t 1 t 2 αt 2 αt = 2 +(1 α)t 1 x 1 + (1 α)t 1 αt 2 +(1 α)t 1 x 2 αt 2 αt 2 +(1 α)t 1 t 1 + (1 α)t 1 αt 2 +(1 α)t 1 t 2 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 18 / 199

Perspective and linear-fractional function IV Let We have Further because θ = αt 2 αt 2 + (1 α)t 1 θx 1 + (1 θ)x 2 θt 1 + (1 θ)t 2 = A B (A, B) S (x 1, t 1 ), (x 2, t 2 ) S Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 19 / 199

Perspective and linear-fractional function V and Inverse image is convex Given C a convex set is convex S is convex P 1 (C) = {(x, t) P(x, t) = x/t C} Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 20 / 199

Perspective and linear-fractional function VI Let Do we have (x 1, t 1 ) : x 1 /t 1 C (x 2, t 2 ) : x 2 /t 2 C θ(x 1, t 1 ) + (1 θ)(x 2, t 2 ) P 1 (C)? That is, θx 1 + (1 θ)x 2 θt 1 + (1 θ)t 2 C? Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 21 / 199

Perspective and linear-fractional function VII Let Earlier we had θx 1 + (1 θ)x 2 θt 1 + (1 θ)t 2 = α x 1 t 1 + (1 α) x 2 t 2, θ = αt 2 αt 2 + (1 α)t 1 Then (α(t 2 t 1 ) + t 1 )θ = αt 2 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 22 / 199

Perspective and linear-fractional function VIII t 1 θ = αt 2 αt 2 θ + αt 1 θ t 1 θ α = t 1 θ + (1 θ)t 2 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 23 / 199

2-16: Generalized inequalities I K contains no line: x with x K and x K x = 0 Nonnegative polynomial on [0, 1] When n = 2 t = 1 x 1 tx 2, t [0, 1] Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 24 / 199

2-16: Generalized inequalities II t = 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 25 / 199

2-16: Generalized inequalities III t [0, 1] It really becomes a proper cone Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 26 / 199

2-17: I Properties: implies that x K y, u K v x y K u v K From the definition of a convex cone, (x y) + (u v) K Then x + u K y + v Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 27 / 199

2-18: Minimum and minimal elements I The minimum element S x 1 + K A minimal element (x 2 K) S = {x 2 } Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 28 / 199

2-19: Separating hyperplane theorem I We consider a simplified situation and omit part of the proof Assume inf u v > 0 u C,v D and minimum attained at c, d We will show that a d c, b d 2 c 2 forms a separating hyperplane a T x = b Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 29 / 199 2

2-19: Separating hyperplane theorem II c d a T x = 2 b 1 1 = (d c) T d, b = 1 + 2 2 a = d c 2 = (d c) T c = d T d c T c 2 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 30 / 199

2-19: Separating hyperplane theorem III Assume the result is wrong so there is u D such that a T u b < 0 We will derive a point u in D but closer to c than d. That is, u c < d c Then we have a contradiction The concept Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 31 / 199

2-19: Separating hyperplane theorem IV c u u d a T x = b a T u b = (d c) T u d T d c T c 2 implies that < 0 (d c) T (u d) + 1 2 d c 2 < 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 32 / 199

2-19: Separating hyperplane theorem V d d + t(u d) c 2 dt t=0 =2(d + t(u d) c) T (u d) =2(d c) T (u d) < 0 There exists a small t (0, 1) such that However, d + t(u d) c < d c so there is a contradiction d + t(u d) D, t=0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 33 / 199

2-20: Supporting hyperplane theorem I Case 1: C has an interior region Consider 2 sets: interior of C versus {x 0 }, where x 0 is any boundary point If C is convex, then interior of C is also convex Then both sets are convex We can apply results in slide 2-19 so that there exists a such that a T x a T x 0, x interior of C Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 34 / 199

2-20: Supporting hyperplane theorem II Then for all boundary point x we also have a T x a T x 0 because any boundary point is the limit of interior points Case 2: C has no interior region In this situation, C is like a line in R 3 (so no interior). Then of course it has a supporting hyperplane Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 35 / 199

3-4: Examples on R n and R m n I A, X R m n tr(a T X ) = (A T X ) jj j = A T ji X ij = A ij X ij j i j i Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 36 / 199

3-7: First-order Condition I An open set: for any x, there is a ball covering x such that this ball is in the set Global underestimator: z f (x) y x = f (x) z = f (x) + f (x)(y x) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 37 / 199

3-7: First-order Condition II : Because domain f is convex, for all 0 < t 1, x + t(y x) domain f when t 0, : f (x + t(y x)) (1 t)f (x) + tf (y) f (y) f (x) + f (x + t(y x)) f (x) t f (y) f (x) + f (x) T (y x) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 38 / 199

3-7: First-order Condition III For any 0 θ 1, z = θx + (1 θ)y f (x) f (z) + f (z) T (x z) =f (z) + f (z) T (1 θ)(x y) f (y) f (z) + f (z) T (y z) =f (z) + f (z) T θ(y x) θf (x) + (1 θ)f (y) f (z) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 39 / 199

3-7: First-order Condition IV First-order condition for strictly convex function: f is strictly convex if and only if f (y) > f (x) + f (x) T (y x) : it s easy by directly modifying to > f (x) >f (z) + f (z) T (x z) =f (z) + f (z) T (1 θ)(x y) f (y) > f (z)+ f (z) T (y z) = f (z)+ f (z) T θ(y x) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 40 / 199

3-7: First-order Condition V : Assume the result is wrong. From the 1st-order condition of a convex function, x, y such that x y and f (x) T (y x) = f (y) f (x) (1) For this (x, y), from the strict convexity f (x + t(y x)) f (x) <tf (y) tf (x) = f (x) T t(y x), t (0, 1) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 41 / 199

3-7: First-order Condition VI Therefore, f (x +t(y x)) < f (x)+ f (x) T t(y x), t (0, 1) However, this contradicts the first-order condition: f (x +t(y x)) f (x)+ f (x) T t(y x), t (0, 1) This proof was given by a student of this course before Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 42 / 199

3-8: Second-order condition I Proof of the 2nd-order condition: We consider only the simpler condition of n = 1 f (x + t) f (x) + f (x)t lim 2f (x + t) f (x) f (x)t t 0 t 2 2(f (x + t) f (x)) = lim = f (x) 0 t 0 2t Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 43 / 199

3-8: Second-order condition II f (x + t) = f (x) + f (x)t + 1 2 f ( x)t 2 f (x) + f (x)t by 1st-order condition The extension to general n is straightforward If 2 f (x) 0, then f is strictly convex Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 44 / 199

3-8: Second-order condition III Using 1st-order condition for strictly convex function: f (y) =f (x) + f (x) T (y x) + 1 2 (y x)t 2 f ( x)(y x) >f (x) + f (x) T (y x) Note that in our proof of 1st-order condition for strictly convex functions we used 2nd-order condition of convex function, so no contradiction here Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 45 / 199

3-8: Second-order condition IV It s possible that f is strictly convex but 2 f (x) 0 Example: Details omitted f (x) = x 4 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 46 / 199

3-9: Examples I Quadratic-over-linear f x = 2x y, f y = x2 y 2 2 f x x = 2 y, 2 f x y = 2x y, 2 f 2 y y = 2x 3 y, 3 [ ] 2 y [y ] x y 3 x = 2 [ ] [ ] y 2 1 xy y 3 xy x 2 y x y = 2 2 x x 2 y 2 y 3 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 47 / 199

3-10 I f (x) = log k exp x k f (x) = ex 1 k ex k. exn k ex k 2 iif = ( k ex k )ex i e x i ex i ( k ex k ) 2, 2 ijf = Note that if z k = exp x k exi exj ( k ex k ) 2, i j Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 48 / 199

3-10 II then (zz T ) ij = z i (z T ) j = z i z j Cauchy-Schwarz inequality (a 1 b 1 + + a n b n ) 2 (a1 2 + + an)(b 2 1 2 + + bn) 2 Note that a k = v k zk, b k = z k z k > 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 49 / 199

3-12: Jensen s inequality I General form f ( p(z)zdz) p(z)f (z)dz Discrete situation f ( p i z i ) p i f (z i ), p i = 1 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 50 / 199

3-12: Jensen s inequality II Proof: f (p 1 z 1 + p 2 z 2 + p 3 z 3 ) ( ( )) p1 z 1 + p 2 z 2 (1 p 3 ) f + p 3 f (z 3 ) 1 p ( 3 p1 (1 p 3 ) f (z 1 ) + p ) 2 f (z 2 ) + p 3 f (z 3 ) 1 p 3 1 p 3 =p 1 f (z 1 ) + p 2 f (z 2 ) + p 3 f (z 3 ) Note that p 1 1 p 3 + p 2 1 p 3 = 1 p 3 1 p 3 = 1 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 51 / 199

Positive weighted sum & composition with affine function I Composition with affine function: We know f (x) is convex Is g(x) = f (Ax + b) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 52 / 199

Positive weighted sum & composition with affine function II convex? g((1 α)x 1 + αx 2 ) =f (A((1 α)x 1 + αx 2 ) + b) =f ((1 α)(ax 1 + b) + α(ax 2 + b)) (1 α)f (Ax 1 + b) + αf (Ax 2 + b) =(1 α)g(x 1 ) + αg(x 2 ) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 53 / 199

3-15: Pointwise maximum I Proof of the convexity f ((1 α)x 1 + αx 2 ) = max(f 1 ((1 α)x 1 + αx 2 ),..., f m ((1 α)x 1 + αx 2 )) max((1 α)f 1 (x 1 ) + αf 1 (x 2 ),..., (1 α)f m (x 1 ) + αf m (x 2 )) (1 α) max(f 1 (x 1 ),..., f m (x 1 ))+ α max(f 1 (x 2 ),..., f m (x 2 )) (1 α)f (x 1 ) + αf (x 2 ) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 54 / 199

3-15: Pointwise maximum II For consider all combinations f (x) = x [1] + + x [r] ( ) n r Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 55 / 199

3-16: Pointwise supremum I The proof is similar to pointwise maximum Support function of a set C: When y is fixed, f (x, y) = y T x is linear (convex) in x Maximum eigenvales of symmetric matrix f (X, y) = y T Xy is a linear function of X when y is fixed Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 56 / 199

3-19: Minimization I Proof: Let ɛ > 0. y 1, y 2 C such that f (x 1, y 1 ) g(x 1 ) + ɛ f (x 2, y 2 ) g(x 2 ) + ɛ g(θx 1 + (1 θ)x 2 ) = inf y C f (θx 1 + (1 θ)x 2, y) f (θx 1 + (1 θ)x 2, θy 1 + (1 θ)y 2 ) θf (x 1, y 1 ) + (1 θ)f (x 2, y 2 ) θg(x 1 ) + (1 θ)g(x 2 ) + ɛ Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 57 / 199

3-19: Minimization II Note that the first inequality use the peroperty that C is convex to have θy 1 + (1 θ)y 2 C Because the above inequality holds for all ɛ > 0, g(θx 1 + (1 θ)x 2 ) θg(x 1 ) + (1 θ)g(x 2 ) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 58 / 199

3-19: Minimization III first example: The goal is to prove A BC 1 B T 0 Instead of a direct proof, here we use the property in this slide. First we have that f (x, y) is convex in (x, y) because [ ] A B B T 0 C Consider min y f (x, y) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 59 / 199

3-19: Minimization IV Because C 0, the minimum occurs at 2Cy + 2B T x = 0 Then y = C 1 B T x g(x) = x T Ax 2x T BC 1 Bx + x T BC 1 CC 1 B T x = x T (A BC 1 B T )x Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 60 / 199

3-19: Minimization V is convex. The second-order condition implies that A BC 1 B T 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 61 / 199

3-21: the conjugate function I This function is useful later When y is fixed, maximum happens at y = f (x) (2) by taking the derivative on x Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 62 / 199

3-21: the conjugate function II Explanation of the figure: when y is fixed z = xy is a straight line passing through the origin, where y is the slope of the line. Check under which x, yx and f (x) have the largest distance From the figure, the largest distance happens when (2) holds Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 63 / 199

z = x 0 f (x 0 ) + f (x 0 ) = x 0 y + f (x 0 ) = f (y) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 64 / 199 3-21: the conjugate function III About the point The tangent line is (0, f (y)) z f (x 0 ) x x 0 = f (x 0 ) where x 0 is the point satisfying When x = 0, y = f (x 0 )

3-21: the conjugate function IV f is convex: Given x, y T x f (x) is linear (convex) in y. Then we apply the property of pointwise supremum Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 65 / 199

3-22: examples I negative logarithm f (x) = log x x (xy + log x) = y + 1 x = 0 If y < 0, the picture of xy + log x Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 66 / 199

3-22: examples II Then xy + log x = 1 log( y) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 67 / 199

3-22: examples III strictly convex quadratic Qx = y, x = Q 1 y y T x 1 2 x T Qx =y T Q 1 y 1 2 y T Q 1 QQ 1 y = 1 2 y T Q 1 y Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 68 / 199

3-23: quasiconvex functions I Figure on slide: S α = [a, b], S β = (, c] Both are convex The figure is an example showing that quasi convex may not be convex Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 69 / 199

3-26: properties of quasiconvex functions I Modified Jensen inequality: f quasiconvex if and only if f (θx +(1 θ)y) max{f (x), f (y)}, x, y, θ [0, 1]. Let S is convex = max{f (x), f (y)} x S, y S θx + (1 θ)y S Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 70 / 199

3-26: properties of quasiconvex functions II f (θx + (1 θ)y) and the result is obtained If results are wrong, there exists α such that S α is not convex. x, y, θ with x, y S α, θ [0, 1] such that θx + (1 θ)y / S α Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 71 / 199

3-26: properties of quasiconvex functions III Then f (θx + (1 θ)y) > α max{f (x), f (y)} This violates the assumption First-order condition (this is exercise 3.43): Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 72 / 199

3-26: properties of quasiconvex functions IV f ((1 t)x + ty) max(f (x), f (y)) = f (x) f (x + t(y x)) f (x) 0 t f (x + t(y x)) f (x) lim = f (x) T (y x) 0 t 0 t Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 73 / 199

3-26: properties of quasiconvex functions V : If results are wrong, there exists α such that S α is not convex. x, y, θ with x, y S α, θ [0, 1] such that Then θx + (1 θ)y / S α f (θx + (1 θ)y) > α max{f (x), f (y)} (3) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 74 / 199

3-26: properties of quasiconvex functions VI Because f is differentiable, it is continuous. Without loss of generality, we have f (z) f (x), f (y), z between x and y Then z = x + θ(y x), θ (0, 1) f (z) T ( θ(y x)) 0 f (z) T (y x θ(y x)) 0 f (z) T (y x) = 0, θ (0, 1) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 75 / 199

3-26: properties of quasiconvex functions VII This contradicts (3). f (x + θ(y x)) =f (x) + f (t) T θ(y x) =f (x), θ [0, 1) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 76 / 199

3-27: Log-concave and log-convex functions I Powers: log(x a ) = a log x log x is concave Probability densities: log f (x) = 1 2 (x x)t Σ 1 (x x) + constant Σ 1 is positive definite. Thus log f (x) is concave Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 77 / 199

3-27: Log-concave and log-convex functions II Cumulative Gaussian distribution d 2 d 2 x log Φ(x) = log x e u2 /2 du d dx log Φ(x) = e x /2 du e u2 log Φ(x) x 2/2 = ( x e u2 /2 du)e x 2 /2 ( x) e x 2 /2 e x 2 /2 ( x e u2 /2 du) 2 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 78 / 199

3-27: Log-concave and log-convex functions III Need to prove that ( x Because e u2 /2 du ) x + e x 2 /2 > 0 x u for all u (, x], Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 79 / 199

3-27: Log-concave and log-convex functions IV we have = ( x x x ) e u2 /2 du x + e x 2 /2 xe u2 /2 du + e x 2 /2 ue u2 /2 du + e x 2 /2 = e u2 /2 x + e x 2 /2 = e x 2 /2 + e x 2 /2 = 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 80 / 199

3-27: Log-concave and log-convex functions V This proof was given by a student (and polished by another student) of this course before Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 81 / 199

4-3: Optimal and locally optimal points I f 0 (x) = 1/x f 0 (x) = x log x Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 82 / 199

4-3: Optimal and locally optimal points II f 0 (x) = x 3 3x f 0(x) = 1 + log x = 0 x = e 1 = 1/e Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 83 / 199

4-3: Optimal and locally optimal points III f 0(x) = 3x 2 3 = 0 x = ±1 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 84 / 199

f 0 (y) f 0 (x), for all feasible y Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 85 / 199 4-9: Optimality criterion for differentiable f 0 I : easy From first-order condition Together with we have f 0 (y) f 0 (x) + f 0 (x) T (y x) f 0 (x) T (y x) 0

4-9: Optimality criterion for differentiable f 0 II Assume the result is wrong. Then f 0 (x) T (y x) < 0 Let z(t) = ty + (1 t)x d dt f 0(z(t)) = f 0 (z(t)) T (y x) d dt f 0(z(t)) = f 0 (x) T (y x) < 0 t=0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 86 / 199

A(tx+(1 t)y) = tax+(1 t)ay = tb+(1 b)b = b Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 87 / 199 4-9: Optimality criterion for differentiable f 0 III There exists t such that f 0 (z(t)) < f 0 (x) Note that is feasible because z(t) f i (z(t)) tf i (x) + (1 t)f i (y) 0 and

4-9: Optimality criterion for differentiable f 0 IV Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 88 / 199

4-10 I Unconstrained problem: Let y = x t f 0 (x) It is feasible (unconstrained problem). Optimality condition implies Thus f 0 (x) T (y x) = t f 0 (x) 2 0 f 0 (x) = 0 Equality constrained problem Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 89 / 199

4-10 II Easy. For any feasible y, Ay = b f 0 (x)(y x) = ν T A(y x) = ν T (b b) = 0 0 So x is optimal : more complicated. We only do a rough explanation Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 90 / 199

4-10 III From optimality condition f 0 (x) T ν = f 0 (x) T ((x + ν) x) 0, ν N(A) N(A) is a subspace in 2-D. Thus ν N(A) ν N(A) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 91 / 199

4-10 IV f 0 N(A) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 92 / 199

4-10 V We have f 0 (x) T ν = 0, ν N(A) f 0 (x) N(A), f 0 (x) R(A T ) ν such that f 0 (x) + A T ν = 0 Minimization over nonnegative orthant Easy Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 93 / 199

4-10 VI For any y 0, i f 0 (x)(y i x i ) = { i f 0 (x)y i 0 if x i = 0 0 if x i > 0. Therefore, and f 0 (x) T (y x) 0 x is optimal Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 94 / 199

4-10 VII If x i = 0, we claim i f 0 (x) 0 Otherwise, i f 0 (x) < 0 Let y = x except y i f 0 (x) T (y x) = i f 0 (x)(y i x i ) This violates the optimality condition Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 95 / 199

4-10 VIII If x i > 0, we claim i f 0 (x) = 0 Otherwise, assume i f 0 (x) > 0 Consider y = x except y i = x i /2 > 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 96 / 199

4-10 IX It is feasible. Then f 0 (x) T (y x) = i f 0 (x)(y i x i ) = i f 0 (x)x i /2 < 0 violates the optimality condition. The situation for i f 0 (x) < 0 is similar Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 97 / 199

4-23: examples I c E(C) Σ E C ((C c)(c c)) Var(C T x) = E C ((C T x c T x)(c T x c T x)) = E C (x T (C c)(c c) T x) = x T Σx Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 98 / 199

4-25: second-order cone programming I Cone was defined on slide 2-8 {(x, t) x t} Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 99 / 199

4-35: generalized inequality constraint I f i R n R k i K i -convex: f i (θx + (1 θ)y) Ki θf i (x) + (1 θ)f i (y) See page 3-31 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 100 / 199

4-37: LP and SOCP as SDP I LP and equivalent SDP a 11 a 1n x 1 Ax =.. a m1 a mn x n 11 1n a... + + x n a... x 1 1 b... a m1 0 a mn b m Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 101 / 199

4-37: LP and SOCP as SDP II For SOCP and SDP we will use results in 4-39: [ tip p A p q ] A T 0 A ti T A t 2 I q q, t 0 q q Now Thus p = m, q = 1 A = A i x + b i, t = c T i x + d i A i x + b i 2 (c T i x + d i ) 2, c T i x + d i 0 A i x + b i c T i x + d i Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 102 / 199

4-39: matrix norm minimization I Following 4-38, we have the following equivalent problem min subject to t A 2 t We then use A 2 t A T A t 2 I, t 0 [ ] ti A 0 ti A T Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 103 / 199

4-39: matrix norm minimization II to have the SDP min subject to t[ ] ti A(x) A(x) T 0 ti Next we prove [ ] tip p A p q A T 0 A ti T A t 2 I q q, t 0 q q Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 104 / 199

4-39: matrix norm minimization III we immediately have t 0 If t > 0, [ v T A T tv ] [ ] [ ] T ti p p A p q Av A T ti q q tv = [ v T A T tv ] [ ] T tav + tav A T Av + t 2 v =t(t 2 v T v v T A T Av) 0 v T (t 2 I A T A)v 0, v Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 105 / 199

4-39: matrix norm minimization IV and hence If t = 0 t 2 I A T A 0 [ v T A T v ] [ ] [ ] T 0 A Av A T 0 v = [ v T A T v ] [ ] T Av A T Av = 2v T A T Av 0, v Therefore A T A 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 106 / 199

4-39: matrix norm minimization V Consider [ u T v ] [ ] [ ] T ti A u A T ti v = [ u T v ] [ ] T tu + Av A T u + tv =tu T u + 2v T A T u + tv T v We hope to have tu T u + 2v T A T u + tv T v 0, (u, v) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 107 / 199

4-39: matrix norm minimization VI If t > 0 has optimum at We have min tu T u + 2v T A T u + tv T v u u = Av t tu T u + 2v T A T u + tv T v =tv T v v T A T Av t = 1 t v T (t 2 I A T A)v 0. Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 108 / 199

4-39: matrix norm minimization VII Hence [ ] ti A A T 0 ti If t = 0 A T A 0 Thus v T A T Av 0, v T A T Av = Av 2 = 0 Av = 0, v [ u T v ] [ [ ] T 0 A u A 0] T v = [ u T v ] [ ] T 0 A T = 0 0 u Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 109 / 199

4-39: matrix norm minimization VIII Thus [ ] 0 A A T 0 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 110 / 199

4-40: Vector optimization I Though f 0 (x) is a vector note that f i (x) is still R n R 1 K-convex See 3-31 though we didn t discuss it earlier Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 111 / 199

4-41: optimal and pareto optimal points I Optimal Pareto optimal O {x} + K (x K) O = {x} Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 112 / 199

5-3: Lagrange dual function I Note that g is concave no matter if the original problem is convex or not f 0 (x) + λ i f i (x) + ν i h i (x) is convex (linear) in λ, ν for each x Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 113 / 199

5-3: Lagrange dual function II Use pointwise supremum on 3-16 sup( f 0 (x) λ i f i (x) ν i h i (x)) x D is convex. Hence inf(f 0 (x) + λ i f i (x) + ν i h i (x)) is concave. Note that sup( ) = convex = inf( ) = concave Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 114 / 199

5-8: Lagrange dual and conjugate function I f 0 ( A T λ c T ν) = sup(( A T λ c T ν) T x f 0 (x)) x = inf (f 0(x) + (A T λ + c T ν) T x) x Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 115 / 199

5-10: weak and strong duality I We don t discuss the SDP problem on this slide because we omitted 5-7 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 116 / 199

5-11: Slater s constraint qualification I We omit the proof because of no time linear inequality do not need to hold with strict inequality : for linear inequalities we DO NOT need constraint qualification We will see some explanation later Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 117 / 199

5-12: inequality from LP I If we have only linear constraints, then constraint qualification holds Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 118 / 199

5-15: geometric interpretation I Explanation of g(λ): when λ is fixed λu + t = is a line. We lower until it touches the boundary of G The value then becomes g(λ) When u = 0 t = so we see the point marked as g(λ) on t-axis Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 119 / 199

5-15: geometric interpretation II We have λ 0, so λu + t = must be like rather than Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 120 / 199

5-15: geometric interpretation III Explanation of p : In G, only points satisfying u 0 are feasible Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 121 / 199

5-16 I We do not discuss a formal proof of Slater condition strong duality Instead, we explain this result by figures Reason of using A: G may not be convex Example: min x 2 subject to x + 2 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 122 / 199

5-16 II This is a convex optimization problem G = {(x + 2, x 2 ) x R} is only a quadratic curve Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 123 / 199

5-16 III The curve is not convex However, A is convex A Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 124 / 199

5-16 IV Primal problem: x = 2 optimal objective value = 4 Dual problem: g(λ) = min x x 2 + λ(x + 2) x = λ/2 max λ 0 λ2 4 + 2λ optimal λ = 4 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 125 / 199

5-16 V optimal objective value = 16 4 + 8 = 4 Proving that A is convex (u 1, t 1 ) A, (u 2, t 2 ) A x 1, x 2 such that f 1 (x 1 ) u 1, f 0 (x 1 ) t 1 Consider f 1 (x 2 ) u 2, f 0 (x 2 ) t 2 x = θx 1 + (1 θ)x 2 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 126 / 199

5-16 VI We have f 1 (x) θu 1 + (1 θ)u 2 So f 0 (x) θt 1 + (1 θ)t 2 [ ] [ ] [ ] u u1 u2 = θ + (1 θ) A t t 1 t 2 Note that we have Slater condition strong duality However, it s possible that Slater condition doesn t hold but strong duality holds Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 127 / 199

5-16 VII Example from exercise 5.22: min x subject to x 2 0 Slater condition doesn t hold because no x satisfies x 2 < 0 G = {(x 2, x) x R} Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 128 / 199

5-16 VIII t u There is only one feasible point (0, 0) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 129 / 199

5-16 IX t A u Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 130 / 199

5-16 X Dual problem Strong duality holds g(λ) = min x + x 2 λ x { 1/(2λ) if λ > 0 x = if λ = 0 max λ 0 1/(4λ) λ, objective value 0 d = 0, p = 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 131 / 199

5-17: complementary slackness I In deriving the inequality we use h i (x ) = 0 and f i (x ) 0 Complementary slackness compare the earlier results in 4-10 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 132 / 199

5-19: KKT conditions for convex problem I For the problem on p5-16, neither slater condition nor KKT condition holds 1 λ0 Therefore, for convex problems, KKT optimality but not vice versa. Next we explain why for linear constraints we don t need constraint qualification Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 133 / 199

5-19: KKT conditions for convex problem II Consider the situation of inequality constraints only: min subject to f 0 (x) f i (x) 0, i = 1,..., m Consider an optimial solution x. We would like to prove that x satisfies KKT condition Because x is optimal, from the optimality condition on slide 4-9, for any feasible direction δx, f 0 (x) T δx 0. Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 134 / 199

f i (x) T δx 0 if f i (x) = 0. Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 135 / 199 5-19: KKT conditions for convex problem III A feasible δx means Because f i (x + δx) 0, i f i (x + δx) f i (x) + f i (x) T δx from we have f i (x) 0, i

0 and T f i (x) = 0, i : f i (x) = 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 136 / 199 5-19: KKT conditions for convex problem IV We claim that f 0 (x) = λ i 0,f i (x)=0 λ i f i (x) (4) Assume the result is wrong. First let s consider f 0 (x) =linear combination of { f i (x) f i (x) = 0} +, where

5-19: KKT conditions for convex problem V Then there exists α < 0 such that satisfies and δx α f i (x) T δx = 0 if f i (x) = 0 f i (x + δx) 0 if f i (x) < 0 f 0 (x) T δx = α T < 0 This contradicts the optimality condition Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 137 / 199

5-19: KKT conditions for convex problem VI By a similar setting we can further prove (4). Assume f 0 (x) = λ i f i (x) i:f i (x)=0 and there exists i such that λ i < 0, f i (x) 0, and f i (x) = 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 138 / 199

5-19: KKT conditions for convex problem VII Let Then λ = arg min λ f i (x) = f i (x) i:i i,f i (x)=0 i:i i,f i (x)=0 f i (x)λ i f i (x) λ i f i (x) T = 0, i i, f i (x) = 0 (5) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 139 / 199

5-19: KKT conditions for convex problem VIII We have If then f i (x) T 0 f i (x) T = 0 λ i f i (x) can be rearranged to use linear combination of { f i (x) i i, f i (x) = 0} Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 140 / 199

5-19: KKT conditions for convex problem IX Otherwise Let From (5), f i (x) T > 0 δx = α, α < 0. f i (x) T δx = 0, i i, f i (x) = 0 f i (x) T δx = α f i (x) T < 0. Hence δx is a feasible direction. Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 141 / 199

5-19: KKT conditions for convex problem X However, f 0 (x) T δx = αλ i f i (x) T < 0 contradicts the optimality condition This proof is not rigourous because of For linear the proof becomes rigourous Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 142 / 199

5-25 I Explanation of f 0 (ν) inf y (f 0(y) ν T y) = sup(ν T y f 0 (y)) = f0 (ν) y where f0 (ν) is the conjugate function Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 143 / 199

5-26 I The original problem Dual norm: If ν > 1, g(λ, ν) = inf x Ax b = constant ν sup{ν T y y 1} ν T y > 1, y 1 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 144 / 199

5-26 II inf y + ν T y y ν T y < 0 Hence ty ν T (ty ) as t inf y If ν 1, we claim that y + νt y = inf y y + νt y = 0 y = 0 y + ν T y = 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 145 / 199

5-26 III If y such that y + ν T y < 0 then y < ν T y We can scale y so that sup{ν T y y 1} > 1 but this causes a contradiction Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 146 / 199

5-27: implicit constraint I The dual function c T x + ν T (Ax b) = b T ν + x T (A T ν + c) inf x i(a T ν + c) i = (A T ν + c) i 1 x i 1 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 147 / 199

5-30: semidefinite program I From 5-29 we need that Z is non-negative in the dual cone of S k + Dual cone of S k + is S k + (we didn t discuss dual cone so we assume this result) Why tr(z( ))? We are supposed to do component-wise produt between Z and x 1 F 1 + + x n F n G Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 148 / 199

5-30: semidefinite program II Trace is the component-wise product tr(ab) = i = i (AB) ii A ij B ji = j i A ij B ij j Note that we take the property that B is symmetric Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 149 / 199

7-4 I Uniform noise p(z) = { 1 2a if z a 0 otherwise Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 150 / 199

8-10: Dual of maximum margin problem I Largangian: a 2 λ i (a T x i + b 1) + i i = a 2 + at ( λ i x i + µ i y i ) i i µ i (a T y i + b + 1) + b( i λ i + i µ i ) + i λ i + i µ i Because of b( i λ i + i µ i ) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 151 / 199

8-10: Dual of maximum margin problem II we have inf L a,b { a inf a = 2 i λ ia T x i + i µ ia T y i if i λ i = i µ i if i λ i i µ i For inf a a 2 i λ i a T x i + i µ i a T y i Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 152 / 199

8-10: Dual of maximum margin problem III we can denote it as inf a a 2 + v T a where v is a vector. We cannot do derivative because a is not differentiable. Formal solution: Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 153 / 199

8-10: Dual of maximum margin problem IV Case 1: If v 1/2: so However, a T v a v a 2 inf a a 2 + v T a 0. a = 0 a 2 + v T a = 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 154 / 199

8-10: Dual of maximum margin problem V Therefore If v > 1/2, let inf a a 2 + v T a = t 2 t v a 2 + v T a = 0. a = tv v =t( 1 2 v ) if t Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 155 / 199

8-10: Dual of maximum margin problem VI Thus inf a a 2 + v T a = Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 156 / 199

8-14. θ = vec(p) z i z j q, F (z) =. r z i. 1 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 157 / 199

10-3: initial point and sublevel set I The condition that S is closed if f (x) as x boundary of domain f Proof: if not, consider {x i } S such that Then x i boundary f (x i ) > f (x 0 ) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 158 / 199

10-3: initial point and sublevel set II and Example S is not closed m f (x) = log( exp(ai T x + b i )) i=1 domain = R n Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 159 / 199

10-3: initial point and sublevel set III Example f (x) = i log(b i a T i x) We use the condition that domain R n f (x) as x boundary of domain f Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 160 / 199

f (y) f (x 0 ) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 161 / 199 10-4: strong convexity and implications I S is bounded. Otherwise, there exists a set {y i y i = x + i } S satisfying Then lim i = i f (y i ) f (x) + f (x) T i + m 2 i 2 This contradicts

10-4: strong convexity and implications II Proof of and From p > f (x) p 1 f (x) 2 2m f (y) f (x) + f (x) T (y x) + m x y 2 2 Minimize the right-hand side with respect to y f (x) + m(y x) = 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 162 / 199

10-4: strong convexity and implications III ỹ = x f (x) m f (y) f (x) + f (x) T (ỹ x) + m ỹ x 2 2 = f (x) 1 2m f (x) 2, y Then p f (x) 1 f (x) 2 2m and f (x) p 1 f (x) 2 2m Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 163 / 199

10-5: descent methods I If then f (x + t x) < f (x) f (x) T x < 0 Proof: From the first-order condition of a convex function Then f (x + t x) f (x) + t f (x) T x t f (x) T x f (x + t x) f (x) < 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 164 / 199

10-6: line search types I Why α (0, 1 2 )? The use of 1/2 is for convergence though we won t discuss details Finite termination of backtracking line search. We argue that t > 0 such that f (x + t x) < f (x) + αt f (x) T x, t (0, t ) Otherwise, {t k } 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 165 / 199

10-6: line search types II such that f (x + t k x) f (x) + αt k f (x) T x, k However, f (x + t k x) f (x) lim t k 0 t k = f (x) T x α f (x) T x cause a contradiction f (x) T x < 0 and α > 0 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 166 / 199

10-6: line search types III Geometric interpretation: the tangent line passes through (0, f (x)), so the equation is y f (x) t 0 = f (x) T x Because f (x) T x < 0, we see that the line of f (x) + αt f (x) T x Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 167 / 199

10-6: line search types IV is above that of f (x) + t f (x) T x Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 168 / 199

10-7 I Linear convergence. We consider exact line search; proof for backtracking line search is more complicated S closed and bounded 2 f (x) MI, x S f (y) f (x) + f (x) T (y x) + M 2 y x 2 Solve min t f (x) t f (x) T f (x) + t2 M 2 f (x)t f (x) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 169 / 199

10-7 II t = 1 M f (x next ) f (x 1 1 f (x)) f (x) M 2M f (x)t f (x) The first inequality is from the fact that we use exact line search f (x next ) p f (x) p 1 2M f (x)t f (x) From slide 10-4, f (x) 2 2m(f (x) p ) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 170 / 199

10-7 III Hence f (x next ) p (1 m M )(f (x) p ) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 171 / 199

10-8 I Assume x k 1 = γ( γ 1 γ + 1 )k, x k 2 = ( γ 1 γ + 1 )k, f (x 1, x 2 ) = [ x1 γx 2 ] 1 min t 2 ((x 1 tx 1 ) 2 + γ(x 2 tγx 2 ) 2 ) 1 min t 2 (x 1 2 (1 t) 2 + γx2 2 (1 tγ) 2 ) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 172 / 199

10-8 II x 2 1 (1 t) + γx 2 2 (1 tγ)( γ) = 0 x 2 1 + tx 2 1 γ 2 x 2 2 + γ 3 tx 2 2 = 0 t = x 2 1 + γ2 x 2 2 x 2 1 + γ3 x 2 2 t(x 2 1 + γ 3 x 2 2 ) = x 2 1 + γ 2 x 2 2 = γ2 ( γ 1 γ+1 )2k + γ 2 ( γ 1 γ+1 )2k γ 2 ( γ 1 γ+1 )2k + γ 3 ( γ 1 γ+1 )2k = 2γ2 γ 2 + γ = 2 3 1 + γ [ ] x k+1 = x k t f (x k x k ) = 1 (1 t) (1 γt) x k 2 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 173 / 199

10-8 III x k+1 1 = γ( γ 1 γ + 1 )k ( γ 1 1 + γ ) = γ(γ 1 γ + 1 )k+1 x k+1 2 = ( γ 1 γ + 1 )k (1 2γ 1 + γ ) = ( γ 1 γ + 1 )k ( 1 γ 1 + γ ) = ( γ 1 γ + 1 )k+1 Why gradient is orghogonal to the tangent line of the contour curve? Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 174 / 199

10-8 IV Assume f (g(t)) is the countour with g(0) = x Then 0 = f (g(t)) f (g(0)) f (g(t)) f (g(0)) 0 = lim t 0 t = lim f (g(t)) T g(t) t 0 = f (x) T g(0) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 175 / 199

10-8 V where is the tangent line x + t g(0) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 176 / 199

10-10 I linear convergence: from slide 10-7 f (x k ) p c k (f (x 0 ) p ) log(c k (f (x 0 ) p )) = k log c + log(f (x 0 ) p ) is a straight line. Note that now k is the x-axis Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 177 / 199

10-11: steepest descent method I (unnormalized) steepest descent direction: x sd = f (x) x nsd Here is the dual norm We didn t discuss much about dual norm, but we can still explain some examples on 10-12 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 178 / 199

10-12 I Euclidean: x nsd is by solving min f T v subject to v = 1 f T v = f v cos θ = f when cos θ = π x nsd = f (x) f (x) f (x) = f (x) f (x) f (x) x nsd = f (x) = f (x) f (x) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 179 / 199

10-12 II Quadratic norm: x nsd is by solving min f T v subject to v T Pv = 1 Now v P = v T Pv, where P is symmetric positive definite Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 180 / 199

10-12 III Let w = P 1/2 v The optimization problem becomes min w f T P 1/2 w subject to w = 1 optimal w = P 1/2 f P 1/2 f = P 1/2 f f T P 1 f Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 181 / 199

10-12 IV optimal v = P 1 f f T P 1 f Dual norm Therefore x sd z = P 1/2 z = f T P 1 f P 1 f f T P 1 f = P 1 f Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 182 / 199

10-12 V Explanation of the figure: f (x) T x nsd = f (x) x nsd cos θ f (x) is a constant. From a point x nsd on the boundary, the projected point on f (x) indicates x nsd cos θ In the figure, we see that the chosen x nsd has the largest x nsd cos θ We omit the discusssion of l 1 -norm Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 183 / 199

10-13 I The two figures are by using two P matrices The left one has faster convergence Gradient descent after change of variables min x x = P 1/2 x, x = P 1/2 x f (x) min f (P 1/2 x) x x x αp 1/2 x f (P 1/2 x) P 1/2 x P 1/2 x αp 1/2 x f (x) x x αp 1 x f (x) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 184 / 199

10-14 I f (x) x Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 185 / 199

10-14 II Solve f (x) = 0 Fining the tangent line at x k : x k : the current iterate Let y = 0 y f (x k ) x x k = f (x k ) x k+1 = x k f (x k )/f (x k ) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 186 / 199

10-16 I ˆf (y) = f (x) + f (x) T (y x) + 1 2 (y x)t 2 f (x)(y x) ˆf (y) = 0 = f (x) + 2 f (x)(y x) y x = 2 f (x) 1 f (x) inf y ˆf (y) = f (x) 1 2 f (x)t 2 f (x) 1 f (x) f (x) inf y ˆf (y) = 1 2 λ(x)2 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 187 / 199

10-16 II Norm of the Newton step in the quadratic Hessian norm x nt = 2 f (x) 1 f (x) x T nt 2 f (x) x nt = f (x) T 2 f (x) 1 f (x) = λ(x) 2 Directional derivative in the Newton direction f (x + t nt ) f (x) lim t 0 t = f (x) T x nt = f (x) T 2 f (x) 1 f (x) = λ(x) 2 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 188 / 199

10-16 III Affine invariant f (y) f (Ty) = f (x) Assume T is an invertable square matrix. Then λ(y) = λ(ty) Proof: f (y) = T T f (Ty) 2 f (y) = T T 2 f (Ty)T Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 189 / 199

10-16 IV λ(y) 2 = f (y) T 2 f (y) 1 f (y) = f (Ty) T TT 1 2 f (Ty) 1 T T T T f (Ty) = f (Ty) T 2 f (Ty) 1 f (Ty) = λ(ty) 2 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 190 / 199

10-17 I Affine invariant y nt = 2 f (y) 1 f (y) = T 1 2 f (Ty) 1 f (Ty) = T 1 x nt Note that y k = T 1 x k so y k+1 = T 1 x k+1 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 191 / 199

10-17 II But how about line search f (y) T y nt = f (Ty) T TT 1 x nt = f (x) T x nt Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 192 / 199

10-19 I η (0, m2 L ) f (x k ) η m2 L L 2m 2 f (x k) 1 2 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 193 / 199

10-20 I Let f (x l ) f (x ) 1 2m f (x l) 2 1 2m 4m 4 L 2 (1 2 )2l k 2 = 2m3 L 2 (1 2 )2l k+1 ɛ ɛ 0 = 2m3 L 2 (from p10-4) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 194 / 199

10-20 II In at most iterations, we have log 2 ɛ 0 2 l k+1 log 2 ɛ 2 l k+1 log 2 (ɛ 0 /ɛ) l k 1 + log 2 log 2 (ɛ 0 /ɛ) k f (x 0) p r f (x 0 ) p r + log 2 log 2 (ɛ 0 /ɛ) f (x l ) f (x ) ɛ Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 195 / 199

10-29: implementation I λ(x) = ( f (x) 2 f (x) 1 f (x)) 1/2 = (g T L T L 1 g) 1/2 = L 1 g 2 Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 196 / 199

10-30: example of dense Newton systems with structure I ψ 1 (x 1) f (x) =. + A T ψ 0 (Ax + b) ψ n(x n ) ψ 2 1 (x 1) f (x) =... + A T 2 ψ0(ax 2 + b)a ψ n(x n ) method 2: x = D 1 ( g A T L o w) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 197 / 199

10-30: example of dense Newton systems with structure II L T 0 AD 1 ( g A T L 0 w) = w Cost (I + L T 0 AD 1 A T L 0 )w = L T 0 AD 1 g L 0 : p p A T L 0 : n p, cost : O(np) (L T 0 A)D 1 (A T L 0 ) : O(p 2 n) Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 198 / 199

10-30: example of dense Newton systems with structure III Note that Cholesky factorization of H 0 costs as 1 3 p3 p 2 n p n Chih-Jen Lin (National Taiwan Univ.) Optimization and Machine Learning 199 / 199